Benchmarks, evals, productivity


Last updated

There is an assortment of attempts to quantify model performance on a variety of tasks (also known as evaluations or evals). At this point in time, there are three significant caveats with these:

  1. Performance on an eval doesn’t necessarily generalise to any specific task you might be interested in.
  2. It’s quite possible that the training for some models is tailored towards better performance on benchmarks (this is known as “overfitting”) which, again, means benchmark results don’t necessarily reflect real world performance.
  3. When it comes to problem solving, if the information related to the eval tasks was somehow present in the training data, the eval short circuits to a confirmation of that fact, which isn’t useful.

Of course, model and tool/agent evaluation is a multi-dimensional problem. In practical terms, we have to consider things like API pricing, context size, or RAM requirements for local models. I’ve listed some resources to assess and compare different facets below.

Models

llm-stats.com maintains an LLM leaderboard for the GPQA benchmark and allows you to compare a variety of models on a large set of parameters including input and output costs.

LLM pricing calculator by Simon Willison is good for a quick lookup of token prices.

The authors of the Aider CLI tool maintain a leaderboard for a multi-language benchmark based on higher difficulty Exercism exercises. The leaderboard includes both success percentage and cost.

There is work happening to keep benchmark problems out of training data, for example:

There is also work to evaluate different types of tasks, for example GSO sets up performance optimisation tasks which current models struggle to complete successfully.

Going deeper
Deep dive

Berkeley Function-Calling Leaderboard maintains a leaderboard for the benchmark which evaluates the LLM’s ability to call functions/tools accurately.

The RULER experiment showed that claimed context sizes often don’t stack up in practice, with many models experiencing sharp dropoffs in performance past 32K tokens.

Local models

There are extra considerations when running models locally: first of all, can you run them on your hardware, and second, what kind of throughput you are going to achieve.

Mozilla maintains a LocalScore leaderboard as well as a dataset of throughput for a variety of models.

You can also use the VRAM & Performance Calculator to get an idea of what you can run on your hardware and how fast it might be.

Some locally-runnable models have been fine-tuned for code generation. Big Code Models Leaderboard is based on evaluating such models on HumanEval, a Python benchmark, as well as MultiPL-E, which translates HumanEval into 18 other programming languages.

Agents and productivity

Solid data on productivity effects and output quality of agents is still hard to come by, and the picture from the available reports is decidedly mixed.

LiveSWEBench and Terminal-Bench are two agent-specific benchmarks I was able to find. PR Arena tracks the volume of agent-generated PRs on GitHub along with the proportion which get merged.

The 2025 Impact of Generative AI report by Google Cloud is based on a survey of AI adoption and suggests that increased genAI adoption is associated with (some) increase in productivity and particularly documentation, but at the same time a decrease in delivery stability.

2025 AI Copilot Code Quality report by GitClear examines a large dataset of codebases to infer the effects of AI assistance. They observe “a spike in the prevalence of duplicate code blocks, along with increases in short-term churn code, and the continued decline of moved lines (code reuse).” Unfortunately, this doesn’t bode well for the prevalence of tech debt and incidental complexity.

2024 Uplevel’s analysis of 800 developers showed a 41% increase in bug rate:

Developers with Copilot access saw a significantly higher bug rate while their issue throughput remained consistent. This suggests that Copilot may negatively impact code quality.

A 2025 study that compared productivity of experienced open-source developers with and without AI found that they needed 19% more time to complete tasks when using AI. A particularly interesting finding is that developers still believed that they got a speedup from AI despite being measurably slower. It underscores that anecdotal reports of improved productivity may have little to do with reality.

On a more positive note, ThoughtWorks has been collecting some estimates of time savings from GitHub Copilot-assisted development:

The claim that coding assistants can increase delivery speed by 50% is a wild overestimation. Our tests suggest the gains are more likely 10-15%.

Another experiment resulted in a similar increase in team velocity.

There is potential to apply genAI to adjacent tasks such as requirements analysis. In yet another ThoughtWorks experiment, the team estimated a 20% time saving despite the extra work to describe the context for the LLM.

Rolling your own evals

Kevin Schaul recommends creating your own evals and reviews some Python packages that assist with the task. The motivation is that evals don’t generalise well, so you have to evaluate LLMs and other tools on the specific tasks you are going to apply them to.

For an example of using custom evals to show the difference in success rate between generating Python, TypeScript, Swift, and Rust code, check out LLM-Powered Programming: A Language Matrix Revealed.

Going deeper
Deep dive

LLM benchmarks, evals and tests provides a conceptual overview of benchmarks and evals.

If you’d like to dive even deeper into constructing evals, The LLM Evaluation guidebook is a good resource. Lighteval is a toolkit for LLM evaluation.

    The AI landscape is shifting fast. Drop your email below to get notified when I make significant updates to the site or write new posts.
    No spam, only updates related to No Hype AI.