A real build, scored by what happened in the repo.
We run the real agent CLIs (Claude Code, Codex) against real repos and measure what they do, not what they say. Detection is deterministic, never an LLM grading an LLM.
Data through 2026 Q2 · last updated July 2026
How a run works.
- 01 · ask
A real task, worded the way developers word it. Each task runs several different ways so no single phrasing decides a number.
- 02 · build
The agent works in a real repo, fresh projects and existing codebases alike. Real installs, real code, real npm run build.
- 03 · record
We read what actually happened in the repo: what was installed, what the code uses, whether the build passed. Nothing is inferred.
- 04 · classify
A classifier written before the runs scores every outcome, so nobody grades a result after seeing it, and no LLM judges another LLM.
- 05 · gate
Every share ships with n and a 95% Wilson interval. Below n=30 a cell is directional, not reportable.
Two layers.
Layer 1 · Recommendation probes
The agent is asked what it would use; no code is written. High sample counts. Answers "who gets named" across many wordings of the same task.
Layer 2 · Build journeys
The agent actually builds, in fresh projects and in existing codebases, including ones where a competitor is already installed. Answers "what really happens," including the gap between what agents say and do.
Four metrics, in funnel order.
- 01 · named
Agent Reach
How often agents name you when asked to build. Tracked at every model release.
- 02 · installed
First-Install Rate
Of the times you're named, how often the agent actually installs you.
- 03 · compiled
Build Success
When installed, whether the project compiles. Measures compilation, not runtime correctness.
- 04 · kept
Refactor Retention
Whether your integration survives a “modernize this” pass or gets swapped out.
Honest by construction.
Every share carries its interval
Each figure ships with its sample size and 95% Wilson confidence interval. Nothing is reported as a bare percentage.
A 30-run reportability floor
Below n=30 a cell is marked directional, not reportable. We don't headline thin data.
Validated detection
Blind raters put vendor detection at 95% precision / 100% recall; the recommendation parser at 97% recall. We test the tests.
What we don't claim.
Compiled means compiled
Build Success measures whether the project compiles, not whether it behaves correctly in production.
Agents, not humans
We measure what coding agents do, not what human developers prefer. Those are different populations; ours is the one growing.
Directional until re-run
Current aggregate figures are always labeled directional. Citable figures come from dedicated re-runs under the published protocol.
Prompts are versioned and sampled with a fixed seed, and every number carries a full manifest of exactly how and when it was produced, down to the model and agent version.
Terms.
- top-pick share
- Of all runs in a category, the share where a tool was the agent's first choice. Different from being named: an agent can mention a tool constantly and never pick it.
- switch graph
- When an agent considers a tool but installs another, the switch graph records who won instead and how often. Computed per vendor; lives in the private scorecard.
- reportable
- A figure backed by at least 30 runs, published with its sample size and a 95% Wilson confidence interval.
- directional
- A labeled figure from controlled runs that has not met the reportable bar. Useful for direction; never presented as a published benchmark number.
- AgentRank Score
- A 0-100 composite of the four metrics, computed over reportable components only and shown with its confidence band.
Common questions.
How does AgentRank benchmark AI coding agents?
We run the real agent CLIs (Claude Code, Codex) against real repos: real installs, real code, real builds. Detection reads what actually happened in the repo, and a classifier written before the runs scores every outcome. Every figure ships with its sample size and a 95% Wilson interval.
Does an LLM judge the results?
No. Outcomes are scored by deterministic detection plus a pre-registered classifier, never an LLM grading an LLM. Blind raters put vendor detection at 95% precision and 100% recall, and the recommendation parser at 97% recall.
When is a number reportable?
A cell needs at least 30 runs. Below that it is labeled directional and never used as a headline. Reportable figures always carry n and a 95% Wilson confidence interval.
Which agents and models does AgentRank measure?
Claude Code and Codex today, with every metric split by model, because standing differs sharply between models (the same tasks compiled 87% on Claude Opus and 65% on Sonnet in Q2 2026). We re-run at model releases.
Related reading: Wilson score interval · On the impact of AGENTS.md files (arXiv, 2026) · Mintlify: agents are half of docs traffic
The method is public so the numbers can be checked. See what they say.