Glsrm Benchmarks · July 31, 2026

Independent vibe check of AI

Understand the AI landscape to choose the best model and provider for your use case — intelligence, speed, and cost, measured independently.

Explore the data ↓

Frontier models

Labs tracked

Evals in the index

Snapshot

July 31, 2026

Highlights

Higher is better

Intelligence

Intelligence Index · best model per lab

1Claude Opus 5 (max)61
2GPT-5.6 Sol (max)59
3Kimi K3 (max)57
4Grok 4.5 (high)54
5GLM-5.2 (max)51
6Muse Spark 1.1 (xhigh)51
7Gemini 3.5 Flash50
8DeepSeek V4 Flash 0731 (max)50
9Qwen3.7 Max46

Higher is better

Speed

Median output tokens per second · best per lab

1Qwen3.7 Max200
2Command A+196
3Nemotron 3 Ultra195
4Gemini 3.5 Flash172
5Muse Spark 1.1 (xhigh)130
6Nova 2.0 Pro Preview (medium)115
7GLM-5.2 (max)111
8Inkling86
9Solar Pro 386

Lower is better

Price

USD per 1M tokens, 3:1 blended · cheapest per lab

1DeepSeek V4 Flash 0731 (max)$0.18
2Solar Pro 3$0.26
3MiniMax-M3$0.53
4MiMo-V2.5-Pro$0.54
5Nemotron 3 Ultra$1.18
6Muse Spark 1.1 (xhigh)$2.00
7GLM-5.2 (max)$2.15
8Inkling$2.57
9Grok 4.5 (high)$3.00

Model comparison summary

Every model we track, ranked by Intelligence Index.

#	Model	Creator	Released	Context	Intelligence	Coding	Agentic	Speed (t/s)	Blended $/1M	Latency
1	Claude Opus 5 (max)	Anthropic	Jul 2026	1M	61	78	55	54	$10.00	92s
2	Claude Fable 5 (with fallback)	Anthropic	Jun 2026	1M	60	77	53	64	$20.00	90s
3	GPT-5.6 Sol (max)	OpenAI	Jul 2026	1M	59	77	54	66	$11.25	138s
4	Kimi K3 (max)Open	Kimi	Jul 2026	1.0M	57	76	50	36	$6.00	59s
5	Claude Opus 4.8 (max)	Anthropic	May 2026	1M	56	74	47	61	$10.00	9.5s
6	GPT-5.6 Terra (max)	OpenAI	Jul 2026	1M	55	77	47	132	$4.50	162s
7	GPT-5.5 (xhigh)	OpenAI	Apr 2026	922k	55	75	45	65	$11.25	88s
8	Grok 4.5 (high)	xAI	Jul 2026	500k	54	72	46	58	$3.00	8.0s
9	Claude Opus 4.7 (max)	Anthropic	Apr 2026	1M	54	74	44	48	$10.00	14s
10	Claude Sonnet 5 (max)	Anthropic	Jun 2026	1M	53	72	47	79	$4.00	156s
11	GPT-5.6 Luna (max)	OpenAI	Jul 2026	1M	51	71	46	178	$0.45	117s
12	GLM-5.2 (max)Open	Z AI	Jun 2026	1M	51	69	43	111	$2.15	19s
13	Muse Spark 1.1 (xhigh)	Meta	Jul 2026	1.0M	51	71	38	130	$2.00	18s
14	Gemini 3.5 Flash	Google	May 2026	1M	50	70	37	172	$3.38	22s
15	Gemini 3.6 Flash	Google	Jul 2026	1M	50	69	39	217	$3.00	15s
16	DeepSeek V4 Flash 0731 (max)Open	DeepSeek	Jul 2026	1M	50	69	46	—	$0.18	—
17	Claude Sonnet 4.6 (max)	Anthropic	Feb 2026	1M	47	63	41	55	$6.00	142s
18	Gemini 3.1 Pro Preview	Google	Feb 2026	1M	47	69	21	123	$4.50	22s
19	Qwen3.7 Max	Alibaba	May 2026	1M	46	66	31	200	$3.75	15s
20	MiniMax-M3Open	MiniMax	Jun 2026	1M	44	59	35	69	$0.53	30s
21	GPT-5.3 Codex (xhigh)	OpenAI	Feb 2026	400k	44	—	—	114	$4.81	70s
22	DeepSeek V4 Pro (max)Open	DeepSeek	Apr 2026	1M	44	59	36	65	$0.54	69s
23	Kimi K2.6Open	Kimi	Apr 2026	256k	44	62	30	40	$1.71	114s
24	Muse Spark	Meta	Apr 2026	262k	43	59	29	—	—	—
25	MiMo-V2.5-ProOpen	Xiaomi	Apr 2026	1M	42	60	29	50	$0.54	43s
26	InklingOpen	Thinking Machines	Jul 2026	1M	41	52	32	86	$2.57	25s
27	DeepSeek V4 Flash (max)Open	DeepSeek	Apr 2026	1M	40	56	31	115	$0.18	50s
28	GLM-5.1Open	Z AI	Apr 2026	200k	40	56	30	67	$2.13	59s
29	GPT-5.4 mini (xhigh)	OpenAI	Mar 2026	400k	40	56	30	178	$1.69	9.8s
30	Qwen3.7 Plus	Alibaba	Jun 2026	1M	39	56	21	53	$0.70	40s
31	MiniMax-M2.7Open	MiniMax	Mar 2026	205k	38	53	26	54	$0.53	48s
32	Nemotron 3 UltraOpen	NVIDIA	Jun 2026	262k	38	49	27	195	$1.18	13s
33	Grok 4.3 (high)	xAI	Apr 2026	1M	38	42	24	135	$1.56	19s
34	Gemini 3.5 Flash-Lite	Google	Jul 2026	1M	37	49	27	366	$0.85	7.9s
35	Qwen3.5 397B A17BOpen	Alibaba	Feb 2026	262k	34	48	20	69	$1.35	49s
36	Mistral Medium 3.5Open	Mistral	Apr 2026	256k	30	47	19	63	$3.00	34s
37	Claude 4.5 Haiku	Anthropic	Oct 2025	200k	30	44	16	97	$2.00	16s
38	Gemma 4 31BOpen	Google	Apr 2026	256k	29	43	14	35	—	51s
39	gpt-oss-120b (high)Open	OpenAI	Aug 2025	131k	24	30	13	195	$0.26	11s
40	Command A+Open	Cohere	May 2026	192k	23	28	9	196	—	11s
41	Nova 2.0 Pro Preview (medium)	Amazon	Nov 2025	256k	22	34	7	115	$3.44	29s
42	K2 Think V2Open	MBZUAI IFM	Dec 2025	262k	17	21	2	—	—	—
43	Solar Pro 3	Upstage	Apr 2026	128k	14	16	3	86	$0.26	26s
—	GPT-5.5 Pro (xhigh)	OpenAI	Apr 2026	922k	—	—	—	—	—	—

Intelligence Index

Composite of 10 evaluations spanning reasoning, knowledge, math, coding, and agentic tool use. Higher is better.

Incorporates GPQA Diamond, Humanity's Last Exam, AIME 2025, LiveCodeBench, SciCode, IFBench, Terminal-Bench Hard, τ²-Bench, and more.

Intelligence vs. Price

Intelligence Index against blended USD per 1M tokens (3:1 input:output, log scale).

Up and to the left wins: more intelligence per dollar. Models without public API pricing are excluded.

Intelligence vs. Output Speed

Each dot is a model. Higher = smarter · farther right = faster. Purple guides mark market medians; named labels highlight the efficiency frontier.

Hover any dot for exact scores. The shaded corner is the sweet spot: above-median intelligence and above-median speed. Speed is the median tokens/s across providers serving each model.

Frontier Intelligence Over Time

Intelligence Index by release date. The dashed line tracks the running frontier.

Claude Opus 5 set the current frontier on July 24, 2026 — seven days before this snapshot.

Coding Index

Composite of coding evaluations (LiveCodeBench, SciCode, Terminal-Bench Hard). Higher is better.

Agentic Index

Tool calling and long-horizon agent tasks (τ²-Bench, Terminal-Bench). Higher is better.

Intelligence Breakdown

Individual evaluation scores (0–100) behind the Intelligence Index. Darker is better, normalized per column.

Model	GPQA Diamond	Humanity's Last Exam	SciCode	IFBench	Terminal-Bench Hard	τ²-Bench Telecom	AA-LCR (Long Context)	CritPt	MMMU-Pro
Claude Opus 5 (max)	93.2	52.6	55.7	—	—	—	70.0	29.1	84.7
Claude Fable 5 (with fallback)	92.6	53.3	60.2	63.5	62.9	98.5	70.0	28.6	—
GPT-5.6 Sol (max)	94.1	47.2	56.1	72.7	65.9	85.1	73.7	32.3	83.4
Kimi K3 (max)	93.5	44.3	58.7	—	—	—	74.7	23.4	80.5
Claude Opus 4.8 (max)	92.0	45.7	53.5	62.2	58.3	94.4	67.7	20.9	—
GPT-5.6 Terra (max)	92.5	41.8	53.9	71.2	57.6	86.3	74.0	30.0	80.7
GPT-5.5 (xhigh)	93.5	44.3	56.1	75.9	60.6	93.9	74.3	27.1	79.9
Grok 4.5 (high)	93.1	40.3	54.1	—	—	—	67.7	15.4	80.4
Claude Opus 4.7 (max)	91.4	39.6	54.5	58.6	51.5	88.6	70.3	12.0	78.8
Claude Sonnet 5 (max)	91.1	39.6	53.6	—	—	—	70.7	16.9	77.3
GPT-5.6 Luna (max)	91.1	37.2	52.5	—	—	—	74.0	20.6	78.6
GLM-5.2 (max)	89.5	40.1	50.5	73.3	50.8	99.1	71.3	20.9	—
Muse Spark 1.1 (xhigh)	89.8	45.1	58.2	—	—	—	63.3	15.1	—
Gemini 3.5 Flash	92.2	41.0	53.1	76.3	40.9	95.3	69.3	13.1	84.3
Gemini 3.6 Flash	92.8	38.3	52.7	—	—	—	69.7	10.6	83.2
DeepSeek V4 Flash 0731 (max)	90.8	36.8	49.9	—	—	—	65.7	16.6	—
Claude Sonnet 4.6 (max)	87.5	30.0	46.8	56.6	53.0	75.7	70.7	3.1	73.3
Gemini 3.1 Pro Preview	94.1	44.7	58.9	77.1	53.8	95.6	72.7	17.7	82.4
Qwen3.7 Max	92.3	38.1	48.8	80.5	50.8	94.7	69.0	13.4	—
MiniMax-M3	92.9	37.1	45.4	82.9	42.4	88.9	74.0	3.7	78.6
GPT-5.3 Codex (xhigh)	91.5	39.9	53.2	75.4	53.0	86.0	74.0	16.9	78.5
DeepSeek V4 Pro (max)	88.8	35.9	50.0	76.5	46.2	96.2	66.3	12.9	—
Kimi K2.6	91.1	35.9	53.5	76.0	43.9	95.9	69.7	8.0	79.4
Muse Spark	88.4	39.9	51.5	75.9	45.5	91.5	69.7	11.3	80.5
MiMo-V2.5-Pro	86.6	33.8	50.2	79.9	43.2	94.2	73.3	4.0	—
Inkling	87.2	29.7	46.1	—	—	—	63.3	5.4	73.5
DeepSeek V4 Flash (max)	89.4	32.1	44.9	79.2	35.6	95.0	63.0	7.1	—
GLM-5.1	86.8	28.0	43.8	76.3	43.2	97.7	62.3	4.6	—
GPT-5.4 mini (xhigh)	87.5	26.6	49.9	73.3	52.3	83.3	69.3	10.0	73.3
Qwen3.7 Plus	90.0	33.4	45.5	78.0	47.0	93.0	65.0	9.1	80.5
MiniMax-M2.7	87.4	28.1	47.0	75.7	39.4	84.8	68.7	0.6	—
Nemotron 3 Ultra	86.7	26.6	39.9	81.4	36.4	83.3	67.0	3.1	—
Grok 4.3 (high)	90.1	35.0	47.3	81.3	37.9	97.7	64.3	8.0	78.1
Gemini 3.5 Flash-Lite	83.8	17.5	40.9	—	—	—	62.0	0.0	79.0
Qwen3.5 397B A17B	89.3	27.3	42.0	78.8	40.9	95.6	65.7	1.7	77.3
Mistral Medium 3.5	74.8	12.8	39.6	68.8	33.3	94.2	61.0	0.0	64.9
Claude 4.5 Haiku	67.2	9.7	43.3	54.3	27.3	54.7	70.3	0.0	58.6
Gemma 4 31B	85.7	22.7	43.4	75.6	36.4	59.9	62.0	1.4	73.4
gpt-oss-120b (high)	78.2	18.5	38.9	69.0	23.5	65.8	50.7	1.1	—
Command A+	76.1	11.4	37.8	73.9	25.0	80.7	46.0	0.3	63.2
Nova 2.0 Pro Preview (medium)	78.5	8.9	42.7	79.0	24.2	92.7	54.3	0.0	64.5
K2 Think V2	71.3	9.5	33.0	62.8	6.8	25.4	52.7	0.0	—
Solar Pro 3	72.4	10.1	24.7	71.2	7.6	86.3	27.0	0.0	—
GPT-5.5 Pro (xhigh)	—	—	—	—	—	—	—	30.6	—

AIME 2025 and LiveCodeBench are retired for newer models and excluded here; MMMU-Pro applies to multimodal-evaluated models only.

Omniscience Index

Knowledge reliability from -100 to 100: correct answers score positive, hallucinated ones negative.

A negative score means the model hallucinates more than it knows. Declining to answer scores zero — most models would rather guess.

GDPval Elo — Real-World Work

Elo from blind pairwise comparisons on real economically valuable work tasks, with web and shell access.

Higher is better. Judged across occupations from software engineering to financial analysis.

ITBench — SRE Incident Analysis

Average precision at full recall diagnosing live Kubernetes incidents. Higher is better.

Models investigate real cluster telemetry to find root causes. Even the frontier tops out below 0.5 — ops work is far from solved.

Output Tokens Used to Run the Intelligence Suite

Total tokens generated answering the full evaluation suite, split into answer and reasoning tokens.

Reasoning-heavy models can burn 20–40× more tokens thinking than answering — which is exactly why blended price alone undersells true cost.

Cost to Run the Intelligence Suite

USD to complete every evaluation in the Intelligence Index, including reasoning tokens. Lower is better.

The spread is real: the same suite costs $19 on gpt-oss-20B and over $4,600 on the priciest frontier models.

Output Speed

Median output tokens per second across providers serving each model. Higher is better.

Latency — Time to First Answer Token

Seconds from request to first answer token, including reasoning time. Lower is better.

Max-effort reasoning modes pay for their scores in wait time: the smartest configurations routinely think for one to two minutes.

Pricing — Input and Output

USD per 1M tokens by direction. Lower is better.

Output tokens typically cost 2–4× input. Reasoning tokens bill as output, so thinking models multiply effective price.

Context Window

Maximum input tokens per request.

Openness Index

Weights availability plus transparency of methodology and training data, 0–100.

Only models with published openness scores shown. K2 Think V2 and Nemotron 3 Ultra lead; most frontier labs publish nothing.

Latest Insights

Reporting from the eval desk

Benchmarks

Claude Opus 5 turns ruthless in year-long vending machine test - Memeburn

By Memeburn · 2 hrs ago

News

Expert view: AI will be useful in trading, but it can't replace judgment, says Shai Coelho of Vtrender | Stock Market News

By mint · 3 hrs ago

News

AI in education: Help or hindrance?

By Tribune Online · 3 hrs ago

News

Brilliant high-end gaming laptop deal comes with Nvidia RTX 5080 GPU and Intel Core Ultra 9 275HX CPU

By PC Guide · 6 hrs ago

Methodology: indices are composites of public evaluations run independently with standardized prompts; speed and latency are medians measured across API providers over the trailing 72 hours. Benchmark data: public model-evaluation snapshot, July 31, 2026. Prices are list API prices and change frequently. Company logos identify the respective model creators.