I'm a PhD researcher at the University of Oxford, where I work on language model explainability and interpretability. My current research explores whether models can reliably explain their outputs in natural language, a key requirement for effective human-computer interaction and potentially a major tool for monitoring to cognition of advanced AI. I also work on mechanistic interpretability problems, though it is less of a priority for me at the moment.
Alongside my main PhD research, I also work on LLM evals and science of evals. My publications include the
LingOly reasoning benchmark (NeurIPS 2024 oral, top 0.5% papers) and
LingOly-TOO. Beyond individual benchmarks, I’m interested in building more rigorous standards and ways to aggregate the results from many benchmarks. I'm currently involved in several projects aimed at advancing this goal.
I’m a member of the
Reasoning with Machines Lab, and am supervised by
Prof. Adam Mahdi (Oxford Internet Institute) and
Prof. Jakob Foerster (Department of Engineering Sciences).