powered by
etapx

0%

AI Benchmarks

Independent-style analysis of frontier AI models and the APIs that serve them, across the three numbers that decide every deployment: intelligence (higher is better), output speed in tokens per second (higher is better), and blended price per million tokens (lower is better).

31

Frontier models

16

Labs tracked

9

Evals in the index

June 16, 2026

Snapshot

Intelligence

Higher is better

Intelligence Index · best model per lab

  • 1Claude Fable 5 (with fallback)60
  • 2GPT-5.5 (xhigh)55
  • 3Gemini 3.5 Flash50
  • 4Qwen3.7 Max46
  • 5MiniMax-M344
  • 6DeepSeek V4 Pro (Max)44
  • 7Muse Spark43
  • 8Kimi K2.643
  • 9GLM-5.140

Speed

Higher is better

Median output tokens per second · best per lab

  • 1Grok 4.3 (high)170
  • 2Gemini 3.5 Flash159
  • 3Nemotron 3 Ultra155
  • 4Qwen3.7 Max129
  • 5Nova 2.0 Pro Preview (medium)117
  • 6Mistral Medium 3.595
  • 7GLM-5.176
  • 8DeepSeek V4 Pro (Max)67
  • 9GPT-5.5 (xhigh)62

Price

Lower is better

USD per 1M tokens, 3:1 blended · cheapest per lab

  • 1MiniMax-M3$0.53
  • 2DeepSeek V4 Pro (Max)$0.54
  • 3Nemotron 3 Ultra$1.18
  • 4MiMo-V2.5-Pro$1.35
  • 5Grok 4.3 (high)$1.56
  • 6Kimi K2.6$1.71
  • 7GLM-5.1$2.15
  • 8Mistral Medium 3.5$3.00
  • 9Gemini 3.5 Flash$3.38

Model comparison summary

Every model we track, ranked by Intelligence Index.

#ModelCreatorReleasedContextIntelligenceCodingAgenticSpeed (t/s)Blended $/1MLatency
1Claude Fable 5 (with fallback)AnthropicJun 20261M606281$20.00
2Claude Opus 4.8 (max)AnthropicMay 20261M56577863$10.0034s
3GPT-5.5 (xhigh)OpenAIApr 2026922k55597462$11.25117s
4Claude Opus 4.7 (max)AnthropicApr 20261M54537149$10.0018s
5Gemini 3.5 FlashGoogleMay 20261M504570159$3.3819s
6Claude Sonnet 4.6 (max)AnthropicFeb 20261M47516347$6.00108s
7Gemini 3.1 Pro PreviewGoogleFeb 20261M475659118$4.5023s
8Qwen3.7 MaxAlibabaMay 20261M465067129$3.7521s
9MiniMax-M3OpenMiniMaxJun 20261M44436959$0.5337s
10GPT-5.3 Codex (xhigh)OpenAIFeb 2026400k44536181$4.8195s
11DeepSeek V4 Pro (Max)OpenDeepSeekApr 20261M44486767$0.5467s
12Muse SparkMetaApr 2026262k434862
13Kimi K2.6OpenKimiApr 2026256k43476643$1.71107s
14DeepSeek V4 Flash (Max)OpenDeepSeekApr 20261M40396193$0.1862s
15GLM-5.1OpenZ AIApr 2026200k40436776$2.1551s
16GPT-5.4 mini (xhigh)OpenAIMar 2026400k405259172$1.699.4s
17Qwen3.7 PlusAlibabaJun 20261M39476553$0.5940s
18MiniMax-M2.7OpenMiniMaxMar 2026205k38426247$0.5355s
19Nemotron 3 UltraOpenNVIDIAJun 2026262k383857155$1.1816s
20Grok 4.3 (high)xAIApr 20261M384166170$1.5614s
21Qwen3.5 397B A17BOpenAlibabaFeb 2026262k34415651$1.3565s
22Mistral Medium 3.5OpenMistralApr 2026256k30355395$3.0023s
23Claude 4.5 HaikuAnthropicOct 2025200k303340100$2.0015s
24Gemma 4 31BOpenGoogleApr 2026256k2939413551s
25MiMo-V2.5-ProOpenXiaomiApr 20261M28375149$1.352.9s
26gpt-oss-120b (high)OpenOpenAIAug 2025131k242938344$0.266.7s
27Nova 2.0 Pro Preview (medium)AmazonNov 2025256k223047117$3.4429s
28K2 Think V2OpenMBZUAI Institute of Foundation ModelsDec 2025262k171615
29gpt-oss-20B (high)OpenOpenAIAug 2025131k151928218$0.099.9s
30Solar Pro 3UpstageApr 2026128k141335
GPT-5.5 Pro (xhigh)OpenAIApr 2026922k

Intelligence Index

Composite of 10 evaluations spanning reasoning, knowledge, math, coding, and agentic tool use. Higher is better.

010203040506060Claude Fable 5 (with fall…56Claude Opus 4.8 (max)55GPT-5.5 (xhigh)54Claude Opus 4.7 (max)50Gemini 3.5 Flash47Claude Sonnet 4.6 (max)47Gemini 3.1 Pro Preview46Qwen3.7 Max44MiniMax-M344GPT-5.3 Codex (xhigh)44DeepSeek V4 Pro (Max)43Muse Spark43Kimi K2.640DeepSeek V4 Flash (Max)40GLM-5.140GPT-5.4 mini (xhigh)39Qwen3.7 Plus38MiniMax-M2.738Nemotron 3 Ultra38Grok 4.3 (high)34Qwen3.5 397B A17B30Mistral Medium 3.530Claude 4.5 Haiku29Gemma 4 31B28MiMo-V2.5-Pro24gpt-oss-120b (high)22Nova 2.0 Pro Preview (med…17K2 Think V215gpt-oss-20B (high)14Solar Pro 3

Incorporates GPQA Diamond, Humanity's Last Exam, AIME 2025, LiveCodeBench, SciCode, IFBench, Terminal-Bench Hard, τ²-Bench, and more.

Intelligence vs. Price

Intelligence Index against blended USD per 1M tokens (3:1 input:output, log scale).

0102030405060$0.1$0.2$0.5$1$2$5$10$20Price (USD per 1M tokens, blended 3:1, log scale)Intelligence Index↖ Most attractive quadrantClaude Fable 5 (with…Claude Opus 4.8 (max)GPT-5.5 (xhigh)Claude Opus 4.7 (max)Gemini 3.5 FlashClaude Sonnet 4.6 (ma…Gemini 3.1 Pro PreviewQwen3.7 MaxMiniMax-M3GPT-5.3 Codex (xhigh)DeepSeek V4 Pro (Max)Kimi K2.6DeepSeek V4 Flash (Ma…GLM-5.1GPT-5.4 mini (xhigh)Qwen3.7 PlusMiniMax-M2.7Nemotron 3 UltraGrok 4.3 (high)Qwen3.5 397B A17BMistral Medium 3.5Claude 4.5 HaikuMiMo-V2.5-Progpt-oss-120b (high)Nova 2.0 Pro Preview…gpt-oss-20B (high)

Up and to the left wins: more intelligence per dollar. Models without public API pricing are excluded.

Intelligence vs. Output Speed

Intelligence Index against median output tokens per second.

01020304050600100200300Output speed (tokens per second)Intelligence IndexMost attractive quadrant ↗Claude Opus 4.8 (max)GPT-5.5 (xhigh)Claude Opus 4.7 (max)Gemini 3.5 FlashClaude Sonnet 4.6 (ma…Gemini 3.1 Pro PreviewQwen3.7 MaxMiniMax-M3GPT-5.3 Codex (xhigh)DeepSeek V4 Pro (Max)Kimi K2.6DeepSeek V4 Flash (Ma…GLM-5.1GPT-5.4 mini (xhigh)Qwen3.7 PlusMiniMax-M2.7Nemotron 3 UltraGrok 4.3 (high)Qwen3.5 397B A17BMistral Medium 3.5Claude 4.5 HaikuGemma 4 31BMiMo-V2.5-Progpt-oss-120b (high)Nova 2.0 Pro Preview…gpt-oss-20B (high)

Up and to the right wins: smart and fast. Speed is the median across providers serving each model.

Frontier Intelligence Over Time

Intelligence Index by release date. The dashed line tracks the running frontier.

2030405060Aug 25Oct 25Dec 25Feb 26Apr 26Jun 26gpt-oss-120b (high)Claude 4.5 HaikuGPT-5.3 Codex (xhigh)Claude Sonnet 4.6 (max)Claude Opus 4.7 (max)GPT-5.5 (xhigh)Claude Opus 4.8 (max)Claude Fable 5 (with fa…Frontier

Claude Fable 5 set the current frontier on June 9, 2026 — seven days before this snapshot.

Coding Index

Composite of coding evaluations (LiveCodeBench, SciCode, Terminal-Bench Hard). Higher is better.

010203040506062Claude Fable 5 (with fall…59GPT-5.5 (xhigh)57Claude Opus 4.8 (max)56Gemini 3.1 Pro Preview53GPT-5.3 Codex (xhigh)53Claude Opus 4.7 (max)52GPT-5.4 mini (xhigh)51Claude Sonnet 4.6 (max)50Qwen3.7 Max48DeepSeek V4 Pro (Max)48Muse Spark47Kimi K2.647Qwen3.7 Plus45Gemini 3.5 Flash43MiniMax-M343GLM-5.142MiniMax-M2.741Qwen3.5 397B A17B41Grok 4.3 (high)39DeepSeek V4 Flash (Max)39Gemma 4 31B38Nemotron 3 Ultra37MiMo-V2.5-Pro35Mistral Medium 3.533Claude 4.5 Haiku30Nova 2.0 Pro Preview (med…29gpt-oss-120b (high)19gpt-oss-20B (high)16K2 Think V213Solar Pro 3

Agentic Index

Tool calling and long-horizon agent tasks (τ²-Bench, Terminal-Bench). Higher is better.

02040608081Claude Fable 5 (with fall…78Claude Opus 4.8 (max)74GPT-5.5 (xhigh)71Claude Opus 4.7 (max)70Gemini 3.5 Flash69MiniMax-M367DeepSeek V4 Pro (Max)67GLM-5.167Qwen3.7 Max66Kimi K2.666Grok 4.3 (high)65Qwen3.7 Plus63Claude Sonnet 4.6 (max)62Muse Spark62MiniMax-M2.761DeepSeek V4 Flash (Max)61GPT-5.3 Codex (xhigh)59Gemini 3.1 Pro Preview59GPT-5.4 mini (xhigh)57Nemotron 3 Ultra56Qwen3.5 397B A17B53Mistral Medium 3.551MiMo-V2.5-Pro47Nova 2.0 Pro Preview (med…41Gemma 4 31B40Claude 4.5 Haiku38gpt-oss-120b (high)35Solar Pro 328gpt-oss-20B (high)15K2 Think V2

Intelligence Breakdown

Individual evaluation scores (0–100) behind the Intelligence Index. Darker is better, normalized per column.

ModelGPQA DiamondHumanity's Last ExamSciCodeIFBenchTerminal-Bench Hardτ²-Bench TelecomAA-LCR (Long Context)CritPtMMMU-Pro
Claude Fable 5 (with fallback)92.653.360.263.562.998.570.028.6
Claude Opus 4.8 (max)92.045.753.562.258.394.467.720.9
GPT-5.5 (xhigh)93.544.356.175.960.693.974.327.179.9
Claude Opus 4.7 (max)91.439.654.558.651.588.670.312.078.8
Gemini 3.5 Flash92.241.053.176.340.995.369.313.184.3
Claude Sonnet 4.6 (max)87.530.046.856.653.075.770.73.173.3
Gemini 3.1 Pro Preview94.144.758.977.153.895.672.717.782.4
Qwen3.7 Max92.338.148.880.550.894.769.013.4
MiniMax-M392.937.145.482.942.488.974.03.7
GPT-5.3 Codex (xhigh)91.539.953.275.453.086.074.016.978.5
DeepSeek V4 Pro (Max)88.835.950.076.546.296.266.312.9
Muse Spark88.439.951.575.945.591.569.711.380.5
Kimi K2.691.135.953.576.043.995.969.78.079.4
DeepSeek V4 Flash (Max)89.432.144.979.235.695.063.07.1
GLM-5.186.828.043.876.343.297.762.34.6
GPT-5.4 mini (xhigh)87.526.649.973.352.383.369.310.073.3
Qwen3.7 Plus90.033.445.578.047.093.065.09.144.8
MiniMax-M2.787.428.147.075.739.484.868.70.6
Nemotron 3 Ultra86.726.639.981.436.483.367.03.1
Grok 4.3 (high)90.135.047.381.337.997.764.38.078.1
Qwen3.5 397B A17B89.327.342.078.840.995.665.71.777.3
Mistral Medium 3.574.812.839.668.833.394.261.00.064.9
Claude 4.5 Haiku67.29.743.354.327.354.770.30.058.6
Gemma 4 31B85.722.743.475.636.459.962.01.473.4
MiMo-V2.5-Pro76.213.339.142.735.672.535.01.1
gpt-oss-120b (high)78.218.538.969.023.565.850.71.1
Nova 2.0 Pro Preview (medium)78.58.942.779.024.292.754.30.064.5
K2 Think V271.39.533.062.86.825.452.70.0
gpt-oss-20B (high)68.89.834.465.110.660.230.71.4
Solar Pro 372.410.124.771.27.686.327.00.0
GPT-5.5 Pro (xhigh)30.6

AIME 2025 and LiveCodeBench are retired for newer models and excluded here; MMMU-Pro applies to multimodal-evaluated models only.

Omniscience Index

Knowledge reliability from -100 to 100: correct answers score positive, hallucinated ones negative.

-60-40-200204040Claude Fable 5 (with fall…33Gemini 3.1 Pro Preview27Claude Opus 4.8 (max)26Claude Opus 4.7 (max)23Gemini 3.5 Flash20GPT-5.5 (xhigh)18Grok 4.3 (high)14Qwen3.7 Max12Claude Sonnet 4.6 (max)10GPT-5.3 Codex (xhigh)6Kimi K2.64Muse Spark2Qwen3.7 Plus2GLM-5.11MiniMax-M31MiniMax-M2.7-1Nemotron 3 Ultra-4Claude 4.5 Haiku-10DeepSeek V4 Pro (Max)-19GPT-5.4 mini (xhigh)-23DeepSeek V4 Flash (Max)-30Qwen3.5 397B A17B-34K2 Think V2-36Mistral Medium 3.5-38MiMo-V2.5-Pro-45Gemma 4 31B-48Nova 2.0 Pro Preview (med…-50gpt-oss-120b (high)-54Solar Pro 3-64gpt-oss-20B (high)

A negative score means the model hallucinates more than it knows. Declining to answer scores zero — most models would rather guess.

GDPval Elo — Real-World Work

Elo from blind pairwise comparisons on real economically valuable work tasks, with web and shell access.

0500100015001815Claude Fable 5 (with fall…1638Claude Opus 4.8 (max)1542Claude Opus 4.7 (max)1531GPT-5.5 (xhigh)1431MiniMax-M31417Claude Sonnet 4.6 (max)1370Gemini 3.5 Flash1332DeepSeek V4 Pro (Max)1308Qwen3.7 Max1281GLM-5.11203DeepSeek V4 Flash (Max)1202Kimi K2.61190GPT-5.4 mini (xhigh)1183Nemotron 3 Ultra1178MiniMax-M2.71164Muse Spark1100Grok 4.3 (high)976Gemini 3.1 Pro Preview961Qwen3.5 397B A17B946Qwen3.7 Plus926Mistral Medium 3.5902Claude 4.5 Haiku783Gemma 4 31B775gpt-oss-120b (high)634Nova 2.0 Pro Preview (med…523gpt-oss-20B (high)463Solar Pro 3

Higher is better. Judged across occupations from software engineering to financial analysis.

ITBench — SRE Incident Analysis

Average precision at full recall diagnosing live Kubernetes incidents. Higher is better.

0.000.100.200.300.400.47Claude Opus 4.7 (max)0.46GPT-5.5 (xhigh)0.42Qwen3.7 Max0.40Gemini 3.5 Flash0.40GLM-5.10.40Claude Sonnet 4.6 (max)0.38DeepSeek V4 Pro (Max)0.37Gemma 4 31B0.35GPT-5.4 mini (xhigh)0.34Qwen3.5 397B A17B0.33Grok 4.3 (high)0.32DeepSeek V4 Flash (Max)0.31Kimi K2.60.30Gemini 3.1 Pro Preview0.27Claude 4.5 Haiku0.27MiniMax-M2.70.06gpt-oss-120b (high)

Models investigate real cluster telemetry to find root causes. Even the frontier tops out below 0.5 — ops work is far from solved.

Output Tokens Used to Run the Intelligence Suite

Total tokens generated answering the full evaluation suite, split into answer and reasoning tokens.

Answer tokensReasoning tokens0M50M100M150M200M250M241MDeepSeek V4 Flash (Max)235MGPT-5.4 mini (xhigh)198MClaude Sonnet 4.6 (max)187MDeepSeek V4 Pro (Max)166MKimi K2.6121MSolar Pro 3112MClaude Opus 4.8 (max)112MClaude Opus 4.7 (max)111MQwen3.7 Plus103MNemotron 3 Ultra99MK2 Think V297MQwen3.7 Max91MMiniMax-M390MMistral Medium 3.588MGrok 4.3 (high)87MClaude 4.5 Haiku87MMiniMax-M2.786MClaude Fable 5 (with fall…86MQwen3.5 397B A17B78Mgpt-oss-120b (high)77MGPT-5.3 Codex (xhigh)75MGPT-5.5 (xhigh)73MGemini 3.5 Flash61Mgpt-oss-20B (high)58MMuse Spark57MGemini 3.1 Pro Preview39MGemma 4 31B36MNova 2.0 Pro Preview (med…28MMiMo-V2.5-Pro

Reasoning-heavy models can burn 20–40× more tokens thinking than answering — which is exactly why blended price alone undersells true cost.

Cost to Run the Intelligence Suite

USD to complete every evaluation in the Intelligence Index, including reasoning tokens. Lower is better.

$0$1,000$2,000$3,000$4,000$5,000$6,000$6,228Claude Fable 5 (with fall…$3,738Claude Opus 4.7 (max)$3,736Claude Opus 4.8 (max)$3,356Claude Sonnet 4.6 (max)$2,865GPT-5.5 (xhigh)$1,158GPT-5.4 mini (xhigh)$1,071Gemini 3.5 Flash$1,061Qwen3.7 Max$931Mistral Medium 3.5$839Kimi K2.6$829Gemini 3.1 Pro Preview$697GLM-5.1$528Qwen3.5 397B A17B$517Claude 4.5 Haiku$443Nemotron 3 Ultra$407Nova 2.0 Pro Preview (med…$292Grok 4.3 (high)$260MiniMax-M3$187DeepSeek V4 Pro (Max)$158Qwen3.7 Plus$144MiniMax-M2.7$96gpt-oss-120b (high)$90DeepSeek V4 Flash (Max)$30gpt-oss-20B (high)

The spread is real: the same suite costs $19 on gpt-oss-20B and over $4,600 on the priciest frontier models.

Output Speed

Median output tokens per second across providers serving each model. Higher is better.

0100200300344gpt-oss-120b (high)218gpt-oss-20B (high)172GPT-5.4 mini (xhigh)170Grok 4.3 (high)159Gemini 3.5 Flash155Nemotron 3 Ultra129Qwen3.7 Max118Gemini 3.1 Pro Preview117Nova 2.0 Pro Preview (med…100Claude 4.5 Haiku95Mistral Medium 3.593DeepSeek V4 Flash (Max)81GPT-5.3 Codex (xhigh)76GLM-5.167DeepSeek V4 Pro (Max)63Claude Opus 4.8 (max)62GPT-5.5 (xhigh)59MiniMax-M353Qwen3.7 Plus51Qwen3.5 397B A17B49Claude Opus 4.7 (max)49MiMo-V2.5-Pro47Claude Sonnet 4.6 (max)47MiniMax-M2.743Kimi K2.635Gemma 4 31B

Latency — Time to First Answer Token

Seconds from request to first answer token, including reasoning time. Lower is better.

0.0s20s40s60s80s100s120s2.9sMiMo-V2.5-Pro6.7sgpt-oss-120b (high)9.4sGPT-5.4 mini (xhigh)9.9sgpt-oss-20B (high)14sGrok 4.3 (high)15sClaude 4.5 Haiku16sNemotron 3 Ultra18sClaude Opus 4.7 (max)19sGemini 3.5 Flash21sQwen3.7 Max23sGemini 3.1 Pro Preview23sMistral Medium 3.529sNova 2.0 Pro Preview (med…34sClaude Opus 4.8 (max)37sMiniMax-M340sQwen3.7 Plus51sGLM-5.151sGemma 4 31B55sMiniMax-M2.762sDeepSeek V4 Flash (Max)65sQwen3.5 397B A17B67sDeepSeek V4 Pro (Max)95sGPT-5.3 Codex (xhigh)107sKimi K2.6108sClaude Sonnet 4.6 (max)117sGPT-5.5 (xhigh)

Max-effort reasoning modes pay for their scores in wait time: the smartest configurations routinely think for one to two minutes.

Pricing — Input and Output

USD per 1M tokens by direction. Lower is better.

Input priceOutput price$0.0$10$20$30$40$50$50Claude Fable 5 (with fall…$30GPT-5.5 (xhigh)$25Claude Opus 4.8 (max)$25Claude Opus 4.7 (max)$15Claude Sonnet 4.6 (max)$14GPT-5.3 Codex (xhigh)$12Gemini 3.1 Pro Preview$7.5Qwen3.7 Max$10Nova 2.0 Pro Preview (med…$9Gemini 3.5 Flash$7.5Mistral Medium 3.5$4.4GLM-5.1$5Claude 4.5 Haiku$4Kimi K2.6$4.5GPT-5.4 mini (xhigh)$2.5Grok 4.3 (high)$3.6Qwen3.5 397B A17B$2.7MiMo-V2.5-Pro$2.675Nemotron 3 Ultra$1.16Qwen3.7 Plus$0.87DeepSeek V4 Pro (Max)$1.2MiniMax-M3$1.2MiniMax-M2.7$0.6gpt-oss-120b (high)$0.28DeepSeek V4 Flash (Max)$0.2gpt-oss-20B (high)

Output tokens typically cost 2–4× input. Reasoning tokens bill as output, so thinking models multiply effective price.

Context Window

Maximum input tokens per request.

0k200k400k600k800k1M1MClaude Fable 5 (with fall…1MClaude Opus 4.8 (max)1MClaude Opus 4.7 (max)1MGemini 3.5 Flash1MClaude Sonnet 4.6 (max)1MGemini 3.1 Pro Preview1MQwen3.7 Max1MMiniMax-M31MDeepSeek V4 Pro (Max)1MDeepSeek V4 Flash (Max)1MQwen3.7 Plus1MGrok 4.3 (high)1MMiMo-V2.5-Pro922kGPT-5.5 (xhigh)922kGPT-5.5 Pro (xhigh)400kGPT-5.3 Codex (xhigh)400kGPT-5.4 mini (xhigh)262kMuse Spark262kNemotron 3 Ultra262kQwen3.5 397B A17B262kK2 Think V2256kKimi K2.6256kMistral Medium 3.5256kGemma 4 31B256kNova 2.0 Pro Preview (med…205kMiniMax-M2.7200kGLM-5.1200kClaude 4.5 Haiku131kgpt-oss-120b (high)131kgpt-oss-20B (high)128kSolar Pro 3

Openness Index

Weights availability plus transparency of methodology and training data, 0–100.

02040608089K2 Think V283Nemotron 3 Ultra50DeepSeek V4 Pro (Max)50DeepSeek V4 Flash (Max)44GLM-5.139Qwen3.5 397B A17B39Gemma 4 31B39gpt-oss-120b (high)39gpt-oss-20B (high)33MiniMax-M333Kimi K2.633Mistral Medium 3.522MiniMax-M2.711Claude 4.5 Haiku

Only models with published openness scores shown. K2 Think V2 and Nemotron 3 Ultra lead; most frontier labs publish nothing.

Latest Insights

Reporting from the eval desk

Methodology: indices are composites of public evaluations run independently with standardized prompts; speed and latency are medians measured across API providers over the trailing 72 hours. Benchmark data: Artificial Analysis (artificialanalysis.ai) public snapshot, June 16, 2026. Prices are list API prices and change frequently. Company logos identify the respective model creators.