AI Safety Researchers Should Care About Eval Quality

Recently, UK AISI published "Lessons from a Chimp: AI 'Scheming' and the Quest for Ape Language", a critique of the scheming evals literature. This post articulates why eval quality matters: evals guide the allocation of scarce AI safety resources, and low-quality evals distort this mechanism.


August 28, 2025 · Substack

A theory of change for evals

Recently, UK AISI published "Lessons from a Chimp: AI 'Scheming' and the Quest for Ape Language", a critique of the scheming evals literature. Predictably, this has been met with a mixed response. Some have said this is a summary of the issues many researchers have had with AI safety evals for years, with others, privately, arguing the paper misses the point.

It is hard to argue against the paper's core recommendations. Good evals shouldn't rely on anecdotal evidence; good evals should have suitable control conditions; good evals should carefully define the concept being tested; and good evals shouldn't exaggerate findings. These are relatively uncontroversial. I suspect that the disagreement is over (i) the extent to which these apply to the scheming literature, and (ii) the importance of eval quality.

AISI's paper takes a good stab at showing why this applies to the scheming literature, which one may agree or disagree with (see Evan Hubinger's take). However, it regrettably gives little attention to the importance of eval quality, writing as if this is common wisdom.

I suspect most people in AI safety would find it hard to articulate precisely why this matters. This post is my attempt to do so.

To start, here is my theory of change for evals.

Step 1: Enumerate the threats

Together, AI safety evals are a list of threats we might care about. A threat makes it onto the list if there is an existence proof. This requires evidence of a capability (the ability to do something) rather than claims about the propensity (the AI's willingness or likelihood of doing it). Many papers have explicitly tried to play this enumeration role, e.g. Google's Evaluating Frontier Models for Dangerous Capabilities, culminating in taxonomies like the MIT AI Risk Repository.

Step 2: Quantify severity levels

Once we know what threats we should care about, we need to know how much to care about them. In other words, we need to quantify the severity of each threat.

We do this by measuring the propensity of a threat. How often might this behaviour emerge in a deployment-like scenario? How difficult is it to elicit this behaviour? The propensity, combined with the impact of the threat, informs us of the severity.

Step 3: Allocate resources efficiently

Finally, we can efficiently allocate AI safety resources to mitigate them.

Here, it is helpful to view AI safety as a resource allocation problem. There are fewer researchers than optimal, grants get more applications than they can fund, etc. Evals are a mechanism to direct the allocation of scarce resources, whether that be in a centralised or decentralised way. They help assign resources to the issues with the highest marginal benefit.

AI safety evals are particularly important compared to other evals because the consequences of misallocation are more severe. Society needs the field to be allocatively efficient to make progress as fast as possible.

Hype and 'scary demos' distort this mechanism

So where can this go wrong? Under this theory, we need evals to accurately reflect severity to guide resource allocation. If this is the case, the mechanism works perfectly.

To use the example of scheming, if a new eval shows a particularly high propensity of scheming, it will no doubt inspire more technical research on preventing scheming and more grants to fund this research.

This is ideal.

However, if an eval misrepresents the true severity of a threat, it will distort the mechanism at the cost of researching other threats. Returning to the critiques in the paper, if a scheming eval shows models behaving badly, but can't separate bad behaviour caused by malicious misalignment from a more benign mechanism, e.g. instruction following, it would be an overclaim to suggest there is likely a high propensity of scheming in real deployment scenarios.

In short, good evals cause justifiable shifts in resources; bad evals cause misallocation. Since AISI takes an active role in setting the AI safety agenda, it is unsurprising that they're making these critiques.

A counterargument

There is a counterargument that anecdotal evidence or so-called 'scary demos' can be useful for bringing more funding into the field, therefore expanding the total resources available. The argument goes:

"Yes, some evals may overstate the propensity of a threat, distorting the allocation mechanism, but so what! Who cares! The increase in talent and funding offsets this and the net effect is positive."

Some AI safety researchers, particularly those with short timelines, actively advocate for this approach. It's hard to quantify the net effect, but I'd expect this to be positive: scary demos probably do increase resources more than they distort the allocation mechanism!

However, we're not factoring in all the costs.

First, systematically overestimating the true propensity of a risk in pursuit of funding risks the long-term credibility of the field in the eyes of funders and the general public.

Second, additional resources are likely funnelled into specific threats, and any rebalancing is slow. Scheming demos are unlikely to cause more resources to flow into societal impacts work.

Third, low-quality evals make it harder to measure progress. The quest of AI safety is ultimately not just to evaluate risks, but to mitigate them. Getting signal on technical progress requires rigorous scientific measurement.

Everyone in AI safety wants the field to progress as fast as possible, but if we are to do this sustainably, we must focus on making high-quality evals!

Thanks to Ryan Othniel Kearns, Ze Shen, and Nikita Ostrovsky for insightful discussions about these ideas.