I'm a PhD researcher at the University of Oxford, where I focus on language model explainability and interpretability. My current research explores whether models can reliably explain their outputs in natural language, a key requirement for trustworthy AI and effective human-computer interaction. I've previously worked on mechanistic interpretability problems and continue to contribute to this area, though it is less of a priority for me at the moment.
Asides from my main PhD research, I do a lot of work on LLM evals more broadly. This includes the
LingOly reasoning benchmark (oral@NeurIPS 2024, top 0.5% papers) and
LingOly-TOO. Beyond individual benchmarks, I’m interested in building more rigorous standards and ways to aggregate the results from many benchmarks. I'm currently involved in several projects aimed at advancing this goal.
I’m a member of the
Reasoning with Machines Lab, and am supervised by
Dr Adam Mahdi (Oxford Internet Institute) and
Professor Jakob Foerster (Department of Engineering Sciences).