powered by
etapx

0%

Coding Agents

POWER IN AI AGENTSMEASURED FOR PEAK OUTPUTON REAL SOFTWARE WORK

THE PERFECT SETUPHARNESS AND MODEL PAIREDTO UNLOCK YOUR DRIVE

CRAFTED FROM LIVE BENCHMARK DATA TO DELIVERRECOMMENDATIONS FOR OUR COMMUNITY

Real software-engineering tasks, run through complete agent stacks — harness plus model — and scored on what actually ships: solve rate across three task suites (higher is better), dollars per task, tokens consumed, and minutes on the clock (all lower is better).

22

Configurations

5

Harnesses

18

Models

June 16, 2026

Snapshot

Dive into the data ↓

Coding Agent Index

Higher is better

Equal-weight mean of the three suites · top configurations

  • 1Claude Code · Fable 5 (max) (with fallback)77.2
  • 2Codex · GPT-5.5 (xhigh)76.4
  • 3Claude Code · Opus 4.8 (max)72.6
  • 4Codex · GPT-5.4 (medium)71.1
  • 5Claude Code · Opus 4.6 (medium)71.1
  • 6Codex · GPT-5.5 (medium)70.5
  • 7Cursor CLI · GPT-5.4 (medium)68.8
  • 8Claude Code · Opus 4.8 (medium)67.0
  • 9Cursor CLI · Composer 266.6

Cost per Task

Lower is better

Mean USD per completed task · cheapest configurations

  • 1Cursor CLI · Composer 2$0.04
  • 2Cursor CLI · Composer 2.5$0.08
  • 3Claude Code · DeepSeek V4 Pro (high)$0.27
  • 4Cursor CLI · Composer 2.5 Fast$0.55
  • 5Claude Code · Kimi K2.6$1.18
  • 6Claude Code · Opus 4.6 (medium)$1.26
  • 7Gemini CLI · Gemini 3.1 Pro (high)$1.44
  • 8Cursor CLI · GPT-5.4 (medium)$1.52
  • 9Claude Code · Opus 4.7 (medium)$1.68

Time per Task

Lower is better

Mean wall-clock minutes per task · fastest configurations

  • 1Claude Code · Opus 4.7 (medium)6.3m
  • 2Codex · GPT-5.5 (medium)6.4m
  • 3Cursor CLI · GPT-5.5 (medium)6.6m
  • 4Cursor CLI · Composer 2.5 Fast6.8m
  • 5Codex · GPT-5.4 (medium)7.1m
  • 6Gemini CLI · Gemini 3.1 Pro (high)7.2m
  • 7Claude Code · Opus 4.6 (medium)8m
  • 8Cursor CLI · GPT-5.4 (medium)8.3m
  • 9Cursor CLI · Composer 28.6m

Coding agent comparison summary

Every harness × model configuration we track, ranked by Coding Agent Index.

#AgentModelLabIndexDeepSWETerm-BenchAtlas QnA$/TaskTokens/TaskTimeTurns
1Claude CodeAnthropicClaude Fable 5 (max) (with fallback)Anthropic77.266.182.183.3$11.7513.9M23.5m138
2CodexOpenAIGPT-5.5 (xhigh)OpenAI76.464.384.180.8$5.0712.3M10.1m106
3Claude CodeAnthropicClaude Opus 4.8 (max)Anthropic72.655.879.482.5$7.7017.8M23.1m166
4CodexOpenAIGPT-5.4 (medium)OpenAI71.169.872.4$2.275.7M7.1m71
5Claude CodeAnthropicClaude Opus 4.6 (medium)Anthropic71.170.271.9$1.264.4M8m34
6CodexOpenAIGPT-5.5 (medium)OpenAI70.556.675.879.1$2.757.0M6.4m78
7Cursor CLICursorGPT-5.4 (medium)OpenAI68.864.772.9$1.523.8M8.3m19
8Claude CodeAnthropicClaude Opus 4.8 (medium)Anthropic67.049.375.076.8$3.267.7M12.4m93
9Cursor CLICursorComposer 2Cursor66.664.368.9$0.042.9M8.6m26
10Claude CodeAnthropicClaude Opus 4.7 (max)Anthropic65.040.173.881.0$5.6415.8M15.8m107
11OpencodeOpencodeClaude Opus 4.7 (medium)Anthropic64.539.575.079.0$2.937.5M12.2m54
12Cursor CLICursorGPT-5.5 (medium)OpenAI61.937.273.475.0$2.014.0M6.6m78
13Cursor CLICursorClaude Opus 4.7 (medium)Anthropic60.231.670.678.4$2.685.6M13.6m86
14Gemini CLIGoogleGemini 3.1 Pro (high)Google57.068.345.6$1.442.9M7.2m39
15Claude CodeAnthropicClaude Opus 4.7 (medium)Anthropic56.827.471.471.7$1.684.5M6.3m42
16Claude CodeAnthropicClaude Sonnet 4.6 (medium)Anthropic54.128.963.170.3$1.978.4M13.7m67
17Claude CodeAnthropicGLM-5.1Z AI52.318.665.173.2$4.3325.9M19.6m174
18Claude CodeAnthropicQwen3.7 Plus (thinking)Alibaba51.919.264.771.8$6.238.7M10.6m146
19Cursor CLICursorComposer 2.5 FastCursor51.815.966.972.5$0.554.3M6.8m117
20Cursor CLICursorComposer 2.5Cursor51.815.966.972.5$0.083.6M9.7m117
21Claude CodeAnthropicDeepSeek V4 Pro (high)DeepSeek47.08.664.767.8$0.279.7M17.9m127
22Claude CodeAnthropicKimi K2.6Kimi46.916.564.359.8$1.1811.4M41.2m130

Coding Agent Index

Equal-weight mean of three real-world suites: hard repository tasks, agentic terminal work, and codebase Q&A. Higher is better.

0.020.040.060.080.077.2Claude Code · Fable 5 (ma…76.4Codex · GPT-5.5 (xhigh)72.6Claude Code · Opus 4.8 (m…71.1Codex · GPT-5.4 (medium)71.1Claude Code · Opus 4.6 (m…70.5Codex · GPT-5.5 (medium)68.8Cursor CLI · GPT-5.4 (med…67.0Claude Code · Opus 4.8 (m…66.6Cursor CLI · Composer 265.0Claude Code · Opus 4.7 (m…64.5Opencode · Opus 4.7 (medi…61.9Cursor CLI · GPT-5.5 (med…60.2Cursor CLI · Opus 4.7 (me…57.0Gemini CLI · Gemini 3.1 P…56.8Claude Code · Opus 4.7 (m…54.1Claude Code · Sonnet 4.6…52.3Claude Code · GLM-5.151.9Claude Code · Qwen3.7 Plu…51.8Cursor CLI · Composer 2.5…51.8Cursor CLI · Composer 2.547.0Claude Code · DeepSeek V4…46.9Claude Code · Kimi K2.6

Composed of DeepSWE (113 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks). The agent is the unit of measurement — the same model lands differently in different harnesses.

DeepSWE

Solve rate on 113 real-world software engineering tasks in real repositories, %.

01020304050607066Claude Code · Fable 5 (ma…64Codex · GPT-5.5 (xhigh)57Codex · GPT-5.5 (medium)56Claude Code · Opus 4.8 (m…49Claude Code · Opus 4.8 (m…40Claude Code · Opus 4.7 (m…40Opencode · Opus 4.7 (medi…37Cursor CLI · GPT-5.5 (med…32Cursor CLI · Opus 4.7 (me…29Claude Code · Sonnet 4.6…27Claude Code · Opus 4.7 (m…19Claude Code · Qwen3.7 Plu…19Claude Code · GLM-5.117Claude Code · Kimi K2.616Cursor CLI · Composer 2.5…16Cursor CLI · Composer 2.59Claude Code · DeepSeek V4…

Terminal-Bench v2

Solve rate on 84 agentic terminal tasks in a live shell, %.

02040608084Codex · GPT-5.5 (xhigh)82Claude Code · Fable 5 (ma…79Claude Code · Opus 4.8 (m…76Codex · GPT-5.5 (medium)75Claude Code · Opus 4.8 (m…75Opencode · Opus 4.7 (medi…74Claude Code · Opus 4.7 (m…73Cursor CLI · GPT-5.5 (med…71Claude Code · Opus 4.7 (m…71Cursor CLI · Opus 4.7 (me…70Claude Code · Opus 4.6 (m…70Codex · GPT-5.4 (medium)68Gemini CLI · Gemini 3.1 P…67Cursor CLI · Composer 2.5…67Cursor CLI · Composer 2.565Claude Code · GLM-5.165Cursor CLI · GPT-5.4 (med…65Claude Code · Qwen3.7 Plu…65Claude Code · DeepSeek V4…64Cursor CLI · Composer 264Claude Code · Kimi K2.663Claude Code · Sonnet 4.6…

SWE-Atlas-QnA

Rubric score on 124 codebase Q&A tasks, %.

02040608083Claude Code · Fable 5 (ma…83Claude Code · Opus 4.8 (m…81Claude Code · Opus 4.7 (m…81Codex · GPT-5.5 (xhigh)79Codex · GPT-5.5 (medium)79Opencode · Opus 4.7 (medi…78Cursor CLI · Opus 4.7 (me…77Claude Code · Opus 4.8 (m…75Cursor CLI · GPT-5.5 (med…73Claude Code · GLM-5.173Cursor CLI · GPT-5.4 (med…73Cursor CLI · Composer 2.5…73Cursor CLI · Composer 2.572Codex · GPT-5.4 (medium)72Claude Code · Opus 4.6 (m…72Claude Code · Qwen3.7 Plu…72Claude Code · Opus 4.7 (m…70Claude Code · Sonnet 4.6…69Cursor CLI · Composer 268Claude Code · DeepSeek V4…60Claude Code · Kimi K2.646Gemini CLI · Gemini 3.1 P…

The Harness Effect — Claude Opus 4.7 (medium)

The same model, three harnesses. The scaffold alone moves the Coding Agent Index.

0.010.020.030.040.050.060.064.5Opencode60.2Cursor CLI56.8Claude Code

Identical model weights and settings; only the harness changes. Prompting, context management, and tooling are worth real points.

Harness Spread

Points of Index

Index points between a model's best and worst harness, for every model run in 2+ harnesses

  • 1GPT-5.5 (medium) (2 harnesses)8.6
  • 2Claude Opus 4.7 (medium) (3 harnesses)7.7
  • 3GPT-5.4 (medium) (2 harnesses)2.3

Cost per Task

Mean USD to complete one task, across all three suites. Lower is better.

$0.00$2.00$4.00$6.00$8.00$10.00$12.00$11.75Claude Code · Fable 5 (ma…$7.70Claude Code · Opus 4.8 (m…$6.23Claude Code · Qwen3.7 Plu…$5.64Claude Code · Opus 4.7 (m…$5.07Codex · GPT-5.5 (xhigh)$4.33Claude Code · GLM-5.1$3.26Claude Code · Opus 4.8 (m…$2.93Opencode · Opus 4.7 (medi…$2.75Codex · GPT-5.5 (medium)$2.68Cursor CLI · Opus 4.7 (me…$2.27Codex · GPT-5.4 (medium)$2.01Cursor CLI · GPT-5.5 (med…$1.97Claude Code · Sonnet 4.6…$1.68Claude Code · Opus 4.7 (m…$1.52Cursor CLI · GPT-5.4 (med…$1.44Gemini CLI · Gemini 3.1 P…$1.26Claude Code · Opus 4.6 (m…$1.18Claude Code · Kimi K2.6$0.55Cursor CLI · Composer 2.5…$0.27Claude Code · DeepSeek V4…$0.08Cursor CLI · Composer 2.5$0.04Cursor CLI · Composer 2

Measured mean spend per task at list API prices, cache discounts included. The spread is the story: the priciest configuration costs roughly 290× the cheapest.

Coding Agent Index vs. Cost per Task

Capability against mean USD per task (log scale).

020406080$0.05$0.10$0.20$0.50$1.00$2.00$5.00$10.00Cost per task (USD, log scale)Coding Agent Index↖ Most attractive quadrantClaude Code · Fable 5…Codex · GPT-5.5 (xhig…Claude Code · Opus 4.…Codex · GPT-5.4 (medi…Claude Code · Opus 4.…Codex · GPT-5.5 (medi…Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Cursor CLI · Composer…Claude Code · Opus 4.…Opencode · Opus 4.7 (…Cursor CLI · GPT-5.5…Cursor CLI · Opus 4.7…Gemini CLI · Gemini 3…Claude Code · Opus 4.…Claude Code · Sonnet…Claude Code · GLM-5.1Claude Code · Qwen3.7…Cursor CLI · Composer…Cursor CLI · Composer…Claude Code · DeepSee…Claude Code · Kimi K2…

Up and to the left wins: more solved tasks per dollar. Open-weight models power most of the value corner.

Token Usage per Task

Mean tokens consumed per task, split into fresh input, cache reads, and output.

Fresh inputCache readsOutput0M5M10M15M20M25M25.9MClaude Code · GLM-5.117.8MClaude Code · Opus 4.8 (m…15.8MClaude Code · Opus 4.7 (m…13.9MClaude Code · Fable 5 (ma…12.3MCodex · GPT-5.5 (xhigh)11.4MClaude Code · Kimi K2.69.7MClaude Code · DeepSeek V4…8.7MClaude Code · Qwen3.7 Plu…8.4MClaude Code · Sonnet 4.6…7.7MClaude Code · Opus 4.8 (m…7.5MOpencode · Opus 4.7 (medi…7.0MCodex · GPT-5.5 (medium)5.7MCodex · GPT-5.4 (medium)5.6MCursor CLI · Opus 4.7 (me…4.5MClaude Code · Opus 4.7 (m…4.4MClaude Code · Opus 4.6 (m…4.3MCursor CLI · Composer 2.5…4.0MCursor CLI · GPT-5.5 (med…3.8MCursor CLI · GPT-5.4 (med…3.6MCursor CLI · Composer 2.52.9MGemini CLI · Gemini 3.1 P…2.9MCursor CLI · Composer 2

Agents read far more than they write — cache reads are the overwhelming majority of tokens everywhere. Output is the thin dark sliver on top.

Cache Hit Rate

Share of context reads served from prompt cache, %. Higher is better for cost.

0%10%20%30%40%50%50%Opencode · Opus 4.7 (medi…50%Claude Code · Opus 4.7 (m…49%Claude Code · Opus 4.8 (m…49%Claude Code · Opus 4.8 (m…49%Claude Code · Fable 5 (ma…49%Claude Code · Opus 4.7 (m…49%Cursor CLI · Opus 4.7 (me…49%Claude Code · Kimi K2.649%Claude Code · Sonnet 4.6…49%Codex · GPT-5.5 (xhigh)49%Codex · GPT-5.5 (medium)49%Cursor CLI · Composer 2.5…49%Cursor CLI · Composer 2.549%Codex · GPT-5.4 (medium)49%Cursor CLI · Composer 249%Claude Code · GLM-5.149%Claude Code · Opus 4.6 (m…48%Cursor CLI · GPT-5.5 (med…47%Claude Code · DeepSeek V4…47%Gemini CLI · Gemini 3.1 P…47%Cursor CLI · GPT-5.4 (med…46%Claude Code · Qwen3.7 Plu…

Harnesses that keep context stable cache better. Every point of hit rate is money: cached reads bill at a tenth of the fresh-input price.

Coding Agent Index vs. Total Tokens

Capability against mean total tokens per task (millions).

0204060800M5M10M15M20M25MTotal tokens per task (millions)Coding Agent Index↖ More capability per tokenClaude Code · Fable 5…Codex · GPT-5.5 (xhig…Claude Code · Opus 4.…Codex · GPT-5.4 (medi…Claude Code · Opus 4.…Codex · GPT-5.5 (medi…Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Cursor CLI · Composer…Claude Code · Opus 4.…Opencode · Opus 4.7 (…Cursor CLI · GPT-5.5…Cursor CLI · Opus 4.7…Gemini CLI · Gemini 3…Claude Code · Opus 4.…Claude Code · Sonnet…Claude Code · GLM-5.1Claude Code · Qwen3.7…Cursor CLI · Composer…Cursor CLI · Composer…Claude Code · DeepSee…Claude Code · Kimi K2…

Reading more of the repo correlates with solving more of it — but the best harnesses get more index per token read.

Execution Time per Task

Mean wall-clock minutes from task start to the agent declaring done. Lower is better.

0m10m20m30m40m6.3mClaude Code · Opus 4.7 (m…6.4mCodex · GPT-5.5 (medium)6.6mCursor CLI · GPT-5.5 (med…6.8mCursor CLI · Composer 2.5…7.1mCodex · GPT-5.4 (medium)7.2mGemini CLI · Gemini 3.1 P…8mClaude Code · Opus 4.6 (m…8.3mCursor CLI · GPT-5.4 (med…8.6mCursor CLI · Composer 29.7mCursor CLI · Composer 2.510.1mCodex · GPT-5.5 (xhigh)10.6mClaude Code · Qwen3.7 Plu…12.2mOpencode · Opus 4.7 (medi…12.4mClaude Code · Opus 4.8 (m…13.6mCursor CLI · Opus 4.7 (me…13.7mClaude Code · Sonnet 4.6…15.8mClaude Code · Opus 4.7 (m…17.9mClaude Code · DeepSeek V4…19.6mClaude Code · GLM-5.123.1mClaude Code · Opus 4.8 (m…23.5mClaude Code · Fable 5 (ma…41.2mClaude Code · Kimi K2.6

Includes model latency, tool calls, builds, and test runs. Fast models in lean harnesses finish in ~10 minutes; deliberate configurations take nearly three times as long.

Coding Agent Index vs. Execution Time

Capability against mean wall-clock minutes per task.

0204060800m10m20m30m40mExecution time per task (minutes)Coding Agent Index↖ Capable and quickClaude Code · Fable 5…Codex · GPT-5.5 (xhig…Claude Code · Opus 4.…Codex · GPT-5.4 (medi…Claude Code · Opus 4.…Codex · GPT-5.5 (medi…Cursor CLI · GPT-5.4…Claude Code · Opus 4.…Cursor CLI · Composer…Claude Code · Opus 4.…Opencode · Opus 4.7 (…Cursor CLI · GPT-5.5…Cursor CLI · Opus 4.7…Gemini CLI · Gemini 3…Claude Code · Opus 4.…Claude Code · Sonnet…Claude Code · GLM-5.1Claude Code · Qwen3.7…Cursor CLI · Composer…Cursor CLI · Composer…Claude Code · DeepSee…Claude Code · Kimi K2…

Up and to the left wins: capable and quick. Slow is only worth it if the index follows.

Turns per Task

Mean assistant turns (tool-call rounds) per task.

050100150174Claude Code · GLM-5.1166Claude Code · Opus 4.8 (m…146Claude Code · Qwen3.7 Plu…138Claude Code · Fable 5 (ma…130Claude Code · Kimi K2.6127Claude Code · DeepSeek V4…117Cursor CLI · Composer 2.5…117Cursor CLI · Composer 2.5107Claude Code · Opus 4.7 (m…106Codex · GPT-5.5 (xhigh)93Claude Code · Opus 4.8 (m…86Cursor CLI · Opus 4.7 (me…78Codex · GPT-5.5 (medium)78Cursor CLI · GPT-5.5 (med…71Codex · GPT-5.4 (medium)67Claude Code · Sonnet 4.6…54Opencode · Opus 4.7 (medi…42Claude Code · Opus 4.7 (m…39Gemini CLI · Gemini 3.1 P…34Claude Code · Opus 4.6 (m…26Cursor CLI · Composer 219Cursor CLI · GPT-5.4 (med…

More turns means more, smaller steps — not necessarily better results. Turn count tracks harness style more than capability.

Run Specifications

Every configuration runs the same way, so the numbers compare clean.

Environment
Fresh container per task, repo pinned to a fixed commit, network limited to package mirrors.
Attempts
One attempt per task (pass@1), no retries, no human nudges.
Configuration
Each harness runs at default settings with its recommended model configuration.
Budget
Hard cap of 60 minutes wall-clock per task; runs that exceed it score zero.
Cost accounting
List API prices at snapshot date; cache reads billed at 10% of the input rate.
Reporting
Cost, tokens, time, and turns are means across all completed tasks in the three suites.

Frequently Asked Questions

What is the Coding Agent Index?

The equal-weight mean of a configuration's scores on the three task suites. One number for how much real software work gets done — no extra weighting tricks, no style points.

What do the three suites actually test?

DeepSWE is 113 real-world software-engineering tasks in real repositories, graded end to end. Terminal-Bench v2 is 84 multi-step jobs in a live shell — builds, migrations, debugging, ops. SWE-Atlas-QnA is 124 questions that require navigating a large codebase and answering precisely.

How are tasks scored?

Implementation and terminal tasks are pass@1: one attempt, and the test suite either passes or it doesn't. Codebase Q&A earns partial credit against a rubric. Nothing is cherry-picked or re-run.

What counts as execution time?

Wall-clock from handing the agent a task to the agent declaring done — model latency, tool calls, builds, and test runs included. It's the number you actually wait.

Why track tokens at all?

Because agents read far more than they write. Cache reads dominate the bill at a 10×-discounted rate, so two agents with the same index can differ several-fold in cost. The token mix is the why behind the cost chart.

Why does the same model score differently across harnesses?

The harness decides what the model sees and which tools it gets — system prompts, context management, edit formats, test loops. Same engine, different car.

Latest Agent Insights

Reporting from the agents desk

Methodology: every configuration runs the same three suites — DeepSWE (113 tasks), Terminal-Bench v2 (84 tasks), and SWE-Atlas-QnA (124 tasks) — and the Coding Agent Index is their equal-weight mean. Suite design and metric definitions follow the public coding-agents methodology of Artificial Analysis (artificialanalysis.ai/agents/coding-agents). Figures are Glsrm editorial estimates calibrated to our model benchmark table, not Artificial Analysis's published results. Cost per task is derived from each run's mean token mix at list API prices, with cache reads billed at 10% of the input rate. Prices change frequently. Logos identify the respective model creators.