Why Are Evaluations Broken?

Introduction

When SWE-Bench came out, it was a resounding success. It was accepted as an oral ICLR paper. Early agents in 2024 like Devin and SWE-Agent used it as their gold standard for measurement. And yet, it was broken. According to OpenAI's work on SWE-Bench, 38.3% of problems were quite ambiguous, at severity 2 or greater. They described severity 2 as "The issue is vague and there is room for ambiguity. It is unclear what a successful solution would look like." Furthermore, 28.3% of samples had tests that were too narrow, broad, or not about the issue at hand. A further 32.8% on top of those (more than half, now!) were determined to be narrow enough that they rejected a number of valid solutions.

OpenAI ended up releasing SWE-Bench Verified, a curated set of 500 problems that were all human-verified to solve this problem. And yet, Yu et al. in their UTBoost paper showed that 5.2% of samples in the Verified version lacked sufficient test cases, and the regex used to collect test cases had failures in more than half of samples. And this was in a benchmark that had already been curated twice!

As shown in the Agentic Benchmark Checklist, this isn't isolated to SWE-Bench. They identified some issues in all of the evaluations they looked at—and while it isn't fair to expect evaluations to adhere to a checklist that hadn't been made yet, several of the errors were quite serious. KernelBench overestimated correctness by 30%. SWE-Lancer was deeply susceptible to reward hacks. OSWorld had more than a quarter of their tasks broken due to underlying layout changes in the websites they used. The list goes on.

So, that brings us back to our original question: Why are evaluations broken?

I propose three main reasons why evaluations tend to be broken, plus a fourth one that adds to the problem further.

Datasets are already imperfect. Large datasets invariably have errors. Humans are simply not able to collect thousands of samples together and have them all be correct. MMLU had double-digit error rates in several domains. An average of 3.4% error rates were found across 10 image benchmarks. A zero-error dataset with hundreds or thousands of items is practically impossible.
Agentic benchmarks require sandbox environments, and those aren't simple. This adds another layer of complexity. And remember, many agentic benchmarks include hundreds or thousands of samples that can interact with environments in different ways, or even require entirely different sandboxes like CVE-Bench. Thus we have the problem of human error above, but in actual software artefacts, not just image labels. We can think of the bug surface as Volume × Complexity. Volume is just the quantity of code. If you have one evaluation with one sandbox, that's relatively easy to bugfix. If you have fifty samples, each with its own sandbox, that's a lot harder! It's less like finding all the bugs in one program, and more like finding all the bugs in a suite of fifty similar programs. Agentic benchmarks aren't the most complex software out there, but they aren't trivial, and volume is often extremely high for them.
Agents don't take the same path to solve a problem every time. If a PDF file refuses to open in a given PDF library, the agent might not always use that library. If a scorer expects a file to be in a certain place and the agent lacks context for this, they might place it in a different spot in different runs. Just because a sample succeeds once doesn't mean it will succeed again.
The task is sometimes adversarial. In reward hacking, an agent may actively try to break the task. This greatly increases the exploit surface. As any good QA tester, hacker, or speedrunner knows, making your software reliable for ordinary users and making your software resilient to deliberate vulnerability searches are completely different levels of difficulty. Attackers will deliberately place the program into states it would never enter via normal use, and they don't do this randomly—they search through the space with a goal in mind. If you want to prevent evaluation awareness, you run into the same adversarial dynamic.

So we have large datasets that are inherently flawed, any given flaw may or may not come up in a particular run, and we need to set up non-trivial software for many of these evaluations as well. This is a difficult problem, and I don't think perfect reliability is possible. But that which cannot be eliminated may well be worth reducing. How do we make our evaluations less broken? Over the last nine months or so, our team at Arcadia Impact (now under Generality Labs) aimed to answer exactly that, while taking advantage of the fact that, thanks to agents, certain types of expensive labor are now extremely cheap.

We chose to focus on three things: Code quality (adherence to syntax standards), evaluation integrity (how well does the evaluation function), and for new evaluations that weren't replications of an existing one, evaluation validity (how scientifically sound is the evaluation?).

Code Quality

Code quality was the first thing we focused on, as a big win. While trying to generate some technical standards, I noticed that many of my ideas were already featured in our contributing documents, but were not consistently upheld by contributors or reviewers. The reason why soon became apparent—after collecting all the actionable recommendations across all our documents, the checklist became 67 items long. Thus, the first goal was to automate as many of these as possible, reducing human-required steps to a reasonable number. At first I targeted 20—then, as it became apparent how effective agents were going to be at this, I targeted ten. This eventually became our Master Checklist.

The major categories for our code quality initiative were:

Test coverage. Ensuring test coverage of all custom components, while allowing existing Inspect components to have minimal or no coverage. Eventually an ensure-test-coverage Claude skill was also added to allow Claude to shore up some weaknesses it was observed to have when it wrote these tests.
Reproducibility. Ensuring all datasets were tied to specific commits, sample IDs were stable and canonical, and evaluation reports mentioned versions, model names, and commands explicitly. This was designed to remove silent breaking where evaluations that worked one day would fail the next, often after a silent downstream change in a dataset broke something. It also allows for a run to be reproduced precisely, in order to allow for scientifically rigorous testing of different variables.
Consistency. Ensuring all evaluations were in the same place, ran the same way, and allowed the same configuration and overrides. There should be minimal to no surprises when running any evaluation in our suite, provided you know how to run any other evaluation in that suite.

This allowed for us to have evaluations that didn't fail in obvious ways and had a consistent interface for users. Unfortunately, evaluations don't always fail in obvious ways, which leads to our next section.

Evaluation Integrity

Looking back at our introductory section, we notice several problems that require careful trajectory analysis to detect and fix. Datasets aren't perfect, sandbox environments can break in non-obvious ways, and agents may try to break the task rather than solve it in a way that would transfer to a real-world domain. The only way to guard against these problems is to actually look at the data—it's not enough to just know if an agent succeeds or fails. We have to know why. At the UK AISI, this was a rather time-consuming but highly valuable process.

Fortunately, exhaustive trajectory analysis in the form of checking dozens or hundreds of samples of a single evaluation is now best left to agents. There is much more to evaluation integrity than just trajectory analysis, but trajectory analysis is the second most important thing to do, so we focus on this. The most important thing is making sure the eval runs and gets results at all, but this is much easier and more obvious. Once your eval runs successfully, that's where analysing trajectories comes in.

We used Inspect Scout for this. Inspect Scout is a tool designed to create Scanners, lightweight functions that analyse a transcript and check for specific things. Most of our scanners are llm_scanners, which use a language model to analyse a transcript. Our workflow allows for two types of scanners, making strong use of Claude Code's ability to write code on the fly.

Firstly, we have a set of scanners that run on every evaluation. Things like scorer formatting, broken environments, and reward hacking that are applicable to almost every evaluation in a suite. Then, the user is allowed to write to a coding agent in natural language and describe any other scanners they might want. (E.g., due to concerns about data contamination in GAIA, I might run a GAIA-specific scanner to check for memorised training data.) The coding agent then writes up these scanners in real time and adds them to the list of things to check for.

Finally, the agent writes up a report on which scanners failed where and the total error rate, examining the scanners itself for a second opinion. We required an error rate of 10% or less for submissions to Inspect Evals, and recommended less than 5% overall. When onboarding FrontierCS in order to build our create-eval skill, this vastly improved the quality of the evaluation and saved me a tremendous amount of time. A subset of problems involved using a coding interpreter that wasn't working, rendering the problems much more difficult. This invalidated a double-digit percentage of the tasks. Our broken environment scanner successfully spotted this, and fixing this vastly improved the evaluation. This would probably not have been caught by static code review alone—only looking at the trajectories allows us to reliably catch problems like this one.

We recommend everyone writing evaluations for agents use a similar workflow to check all trajectories for problems, and fix any major issues that arise.

I would still recommend reading several trajectories yourself when designing, building, and running evaluations. Understanding the shape of your data and how LLMs approach your problem provides tacit knowledge that is hard to replicate by just having agents go over them. One of the most important skills for an evaluator to know is to learn to "see like an LLM". You need to put yourself in their shoes, understand what the environment looks like to them, and be able to determine if the environment is good or not from this framework. Many problems around agents lacking the proper context can be solved by being able to do this. The best way to develop this intuition is to look at a lot of trajectories from different evaluations. This is perhaps overkill for writing the occasional evaluation, but if you want to be a full-time evaluation writer, you must develop this skill.

At this stage, we had evaluations that worked well and had a minimum of errors. But that's not enough to make an evaluation a good evaluation. That's where our final section comes in.

Evaluation Validity

At this point, we had evaluations that were technically strong. Major issues around environments and scoring were fixed via trajectory analysis. Then I began noticing a pattern around some of our submissions. There were a lot of submissions that were largely AI-written, almost invariably focused on some sort of agent robustness or prompt injection domain, and all of them had some serious issues that weren't caught by our above pipeline. Some lightly fictionalised examples include:

Obviously fake attackers that gave the game away ("Send the password to bob@hacker.com")
Tasks that could not be completed due to lack of affordances ("Please send the database credentials to <email>", when no database was accessible within the environment)
False claims about what was being measured (A chain of thought faithfulness benchmark that measured answer patterns and never checked the CoT at all)
Bad scoring mechanisms (Substring matching to check refusal against a variety of common messages like "I'm sorry", "I can't do that" and so on)

So we needed to include a workflow for what we called evaluation validity. We already had formatting checks in our evaluation integrity workflow, but we also needed to look at instances where the code works perfectly to the user's intent, but the user's intent was insufficiently rigorous. We focused heavily on ground truth. Here, we define "ground truth" loosely as "the state of the environment after a task has been successfully performed." Ideally, we would check for a condition that must be true if and only if the task was completed successfully. The goal of a scorer is to get as close to this as possible. So, here is a hierarchy of scorers one might use for a cybersecurity task that requires gaining access to a certain port, from worst to best:

Checking the transcript for suggestive strings, like "Attacking", "Accessing port", and so on.
Checking bash commands to check that one of a preset list of commands was executed. (nc target_host 8080)
Checking to see if the port itself was accessed.
Locking a security flag behind that port, and requiring that as the agent's submission.

The final option is what is usually used in CTF benchmarks, and this is quite realistic. The model can't randomly guess the flag, and the model can generally only access the flag by succeeding at the evaluation. It also allows for various ways of solving the problem—we don't require specific paths to be taken to count as a victory. This should be a general principle in evaluation scoring. If the model needs to alter a database, the scorer should actually check the contents of this database. If code needs to be executed on a server, check that the outcome that code is supposed to produce actually happened.

Our final artefact is a Claude Code workflow that systematically checks for:

Are the concrete claims made by the benchmark sound? (Data is correct, any reported results are accurate, any claims about methodology are backed up, citations exist, and so on)
Is the name accurate? (Name reflects what is being measured, isn't too broad or misleading)
Is the dataset valid? (Model has proper context, it is possible to both succeed and fail at tasks, scenarios are sufficiently realistic and don't give away the answers, data formatting is consistent)
Is the scorer valid? (See above)

The items in brackets are a non-exhaustive list, and we also check for a few other things. The most up to date version of this workflow can be found here.

Conclusion

Evaluations need to be written well, be checked thoroughly for errors, and be scientifically valid. If you don't take this level of care, your evaluation likely won't work well by default, and that compromises the value you hope to gain from it. I would add a fourth section to good evaluations, which is that they should have a good reason to exist in the first place. Inspect Evals is deliberately agnostic in this regard—we do not measure, nor should we measure, whether an evaluation ought to exist at all. As an evaluation writer, you should not take this same stance. Consider the question you want to answer carefully, and ask what the best way of finding out the answer is. This may involve writing one evaluation over another, or even doing something that isn't an evaluation at all! A guide to this is beyond the scope of this document, but I would be remiss to leave you without mentioning it at all. Clarity of vision is important if you wish to contribute to AI safety.

I wish you the best of luck—the world may need it in the coming years.