Insider Training: Data and Reward Poisoning Attacks with Data-Level Defences
Insider Training: Data and Reward Poisoning Attacks with Data-Level Defences
Year: 2025 - 2026
Role: AI Researcher
Duration: ~7 months
Relevant Links: Preliminary preprint, Blogpost
Summary: Ongoing research into data poisoning
I’m researching data poisoning attacks on frontier models through RL (GRPO) fine-tuning and SFT experiments, and developing blue-team evaluations such as InspectAI evals, statistical dataset analysis, and controllable oversight methods.
This work is supervised by Mary Phuong (Google DeepMind). The project has received $120K+ in extended funding and will be submitted to a top AI conference (ICML 2026).
A preliminary preprint is available below for anyone interested. Please note that the work will change substantially before final submission.