16x Eval Model Evaluation - Image Analysis

Comprehensive evaluation results by 16x Eval team for AI models on image analysis tasks.

Coding (7 Tasks)Technical Writing (1 Task)Image Analysis (3 Tasks)

16x Eval Top Models - Image Analysis

Average Human Rating of 3 Tasks │ Total Cost

GPT-5 (High)

8.67(31.20¢)

8.6731.20¢

Gemini 2.5 Pro

7.83(8.63¢)

7.838.63¢

GPT-5

7.00(26.96¢)

7.0026.96¢

Claude Sonnet 4

6.83(2.27¢)

6.832.27¢

Grok 4

5.92(40.35¢)

5.9240.35¢

Claude Opus 4.1

3.33(11.04¢)

3.3311.04¢

Jump to Task

Image - kanji Image analysis - water bottle Image table data extraction

Latest Model Evaluation Posts

Gemini 2.5 Pro and Claude Sonnet 4 Excel at Image Table Data Extraction

September 13, 2025

GLM-4.5 Coding Evaluation: Budget-Friendly with Thinking Trade-Off

September 11, 2025

GPT-5 High Reasoning Evaluation: A Major Leap in Coding Performance

September 5, 2025

Benchmark Visualization (Difficult)

Coding

JavaScript

Visualization

→ Raw Prompt

Evaluation Rubrics

Criteria: - Side-by-side visualization without label: 8.5/10 - Baseline visualization without label: 8/10 - Horizontal bar chart (if cannot fit in the page): 7.5/10 - Has major formatting issues: 5/10 - Did not run / Code error: 1/10 Additional components: - Side-by-side visualization - Color by benchmark: +0.5 rating - Alternative ways to differentiate benchmarks: +0.5 rating - Color by model: No effect on rating - Clear labels on bar chart: +0.5 rating - Visually pleasing: +0.25 rating - Poor color choice: -0.5 rating - Minor formatting issues: -0.5 rating Additional instructions for variance: - If the code did not run or render in the first try, a second try is given to regenerate the code.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.25/10	Benchmark visualization	5.02¢	Side-by-side no label. Color by benchmark. Visually pleasing
#1	Anthropic Claude Opus 4	9.25/10	Benchmark visualization	21.04¢	Side-by-side no label. Color by benchmark. Visually pleasing
#1	xAI Grok 4	9.25/10	Benchmark visualization	11.26¢	side-by-side clear labels. color by model. Visually pleasing
#1	Moonshot AI Kimi K2 0711	9.25/10	Benchmark visualization	0.55¢	Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing
#1	OpenAI GPT-5 (High)	9.25/10	Benchmark visualization	19.21¢	Clear labels. Visually pleasing. Highlight benchmarks on hover
#6	OpenAI GPT-4.1	8.75/10	Benchmark visualization	1.88¢	Clear labels. Visually pleasing
#6	Google Gemini 2.5 Pro	8.75/10	Benchmark visualization	11.00¢	Clear labels. Visually pleasing
#8	OpenRouter OR: GPT OSS 120B (Cerebras)	8.5/10	Benchmark visualization	0.20¢	baseline. clear labels
#8	OpenAI GPT-5	8.5/10	Benchmark visualization	11.92¢	baseline. clear labels
#8	Z.ai GLM-4.5	8.5/10	Benchmark visualization	1.31¢	Side-by-side visualization. No labels
#11	xAI Grok Code Fast 1	8.25/10	Benchmark visualization	0.61¢	No labels. Visually pleasing
#12	OpenAI o3	8/10	Benchmark visualization	12.74¢	Clear labels. Poor color choice
#12	Google Gemini 2.5 Pro Preview (06-05)	8/10	Benchmark visualization	13.97¢	Clear labels. Poor color choice
#14	Anthropic Claude 3.7 Sonnet	7.5/10	Benchmark visualization	5.10¢	Number labels. Good idea
#15	Google Gemini 2.5 Pro Experimental	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	DeepSeek DeepSeek-V3 (New)	7/10	Benchmark visualization	0.42¢	No labels. Good colors
#15	Google Gemini 2.5 Pro Preview (05-06)	7/10	Benchmark visualization	4.61¢	No labels. Good colors
#15	OpenRouter OpenRouter: Mistral Medium 3	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	OpenRouter OpenRouter: Mistral: Devstral Small	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	OpenRouter (Alibaba Plus) Qwen3 Coder	7/10	Benchmark visualization	1.64¢	horizontal bars. minor formatting issues
#21	Google Gemini 2.5 Pro Preview (03-25)	6/10	Benchmark visualization	5.35¢	Minor bug. No labels
#22	Stealth Horizon Alpha	5.5/10	Benchmark visualization	N/A	Strange visualization with major formatting issues
#22	Stealth Horizon Alpha	5.5/10	Benchmark visualization	N/A	Strange visualization with major formatting issues
#22	DeepSeek DeepSeek-V3.1	5.5/10	Benchmark visualization	0.55¢	Strange visualization with major formatting issues
#22	DeepSeek DeepSeek-V3.1	5.5/10	Benchmark visualization	0.49¢	Strange visualization with major formatting issues
#26	OpenRouter OpenRouter: Qwen3 235B A22B	5/10	Benchmark visualization	N/A	Very small. Hard to read
#26	OpenRouter OpenRouter: Mercury Coder Small Beta	5/10	Benchmark visualization	N/A	No color. Hard to read
#28	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	1/10	Benchmark visualization	N/A	doesn't run. bugfix not obvious.

Clean MDX (Difficult)

Coding

TypeScript

Markdown

MDX

→ Blog Post → Raw Prompt

Evaluation Rubrics

Criteria: - No text content was removed: 9/10 - Some text content was removed: 8/10 Additional components: - Left-over elements: - Left-over tables: -0.5 rating - Left-over mdx import statements: -0.5 rating - Left-over mdx components: -0.5 rating - Newline handling: - The output does not contain newlines: -1 rating - The output has 1 or more newline issues: -0.5 rating - Short code (1500 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	xAI Grok 4	9/10	Clean mdx	25.74¢	100% match
#2	Google Gemini 2.5 Pro	8.5/10	Clean mdx	17.84¢	left-over components
#2	Google Gemini 2.5 Pro	8.5/10	Clean mdx	13.42¢	left-over components
#4	OpenAI GPT-5	8/10	Clean mdx	14.89¢	1 newline issue. left-over component
#4	OpenAI GPT-5	8/10	Clean mdx	16.46¢	1 newline issue. left-over component
#4	xAI Grok 4	8/10	Clean mdx	25.82¢	no newline
#4	OpenRouter OR: GPT OSS 120B (Cerebras)	8/10	Clean mdx	0.17¢	left-over import and components
#4	OpenRouter OR: GPT OSS 120B (Cerebras)	8/10	Clean mdx	0.15¢	left-over import and components
#4	OpenRouter (Alibaba Plus) Qwen3 Coder	8/10	Clean mdx	0.42¢	2 newline issues. left-over component
#4	xAI Grok Code Fast 1	8/10	Clean mdx	0.77¢	3 newline issues. left-over imports
#4	OpenAI GPT-5 (High)	8/10	Clean mdx	28.39¢	3 newline issue. left-over component
#12	Anthropic Claude Sonnet 4	7.5/10	Clean mdx	0.95¢	text removed. left-over component
#12	OpenAI GPT-4.1	7.5/10	Clean mdx	0.77¢	left-over tables. left-over import and components
#12	Anthropic Claude Opus 4	7.5/10	Clean mdx	5.34¢	text removed. left-over component
#12	OpenAI GPT-4.1	7.5/10	Clean mdx	0.95¢	6 newline issues. left over import and components
#12	OpenRouter (Alibaba Plus) Qwen3 Coder	7.5/10	Clean mdx	0.40¢	text removed. left-over component
#12	xAI Grok Code Fast 1	7.5/10	Clean mdx	1.06¢	4 newline issues. left-over imports and component
#12	OpenAI GPT-5 (High)	7.5/10	Clean mdx	25.71¢	text removed. left-over component
#12	Z.ai GLM-4.5	7.5/10	Clean mdx	1.37¢	1 newline issue. left-over import, component
#20	Anthropic Claude Opus 4	7/10	Clean mdx	5.48¢	text removed. left-over import and components
#20	Moonshot AI Kimi K2 0711	7/10	Clean mdx	0.13¢	text removed. left-over import and components
#20	Moonshot AI Kimi K2 0711	7/10	Clean mdx	0.14¢	text removed. left-over import and components
#23	DeepSeek DeepSeek-V3.1	6.5/10	Clean mdx	0.12¢	left-over import and component. left-over table
#23	DeepSeek DeepSeek-V3.1	6.5/10	Clean mdx	0.12¢	left-over import and component. left-over table
#23	Z.ai GLM-4.5	6.5/10	Clean mdx	1.71¢	text removed. left-over import and components. 1 newline issue
#26	Anthropic Claude Sonnet 4	6/10	Clean mdx	0.92¢	text removed. no newline. left-over import and component

Clean markdown (Medium)

Coding

TypeScript

Markdown

→ Raw Prompt

Evaluation Rubrics

Criteria: - Code runs and gives correct (expected) output: 9/10 - The output has 1 or more newline issues: 8.5/10 - The output does not contain newlines: 8/10 Additional components: - Short code (1000 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating - Missing export statement: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	9.25/10	clean markdown v2	4.02¢	correct. short code
#1	Moonshot AI Kimi K2 0711	9.25/10	clean markdown v2	0.11¢	correct. short code
#1	OpenRouter (Alibaba Plus) Qwen3 Coder	9.25/10	clean markdown v2	0.33¢	correct. short code
#1	DeepSeek DeepSeek-V3.1	9.25/10	clean markdown v2	0.09¢	correct. short code
#1	xAI Grok Code Fast 1	9.25/10	clean markdown v2	0.89¢	correct. short code
#6	Google Gemini 2.5 Pro	9/10	clean markdown v2	13.60¢	correct
#6	OpenAI o3	9/10	clean markdown v2	13.79¢	correct
#6	OpenAI GPT-5	9/10	clean markdown v2	19.28¢	correct
#6	OpenAI GPT-5 (High)	9/10	clean markdown v2	29.55¢	correct
#10	OpenAI GPT-4.1	8.5/10	clean markdown v2	1.12¢	1 new line issue
#10	xAI Grok 4	8.5/10	clean markdown v2	13.06¢	1 new line issue
#10	Stealth Horizon Alpha	8.5/10	clean markdown v2	N/A	one newline issue
#10	OpenRouter OR: GPT OSS 120B (Cerebras)	8.5/10	clean markdown v2	0.15¢	one newline issue
#10	Z.ai GLM-4.5	8.5/10	clean markdown v2	3.64¢	didn't add export
#15	Anthropic Claude Sonnet 4	8/10	clean markdown v2	0.77¢	no new lines
#15	DeepSeek DeepSeek-V3 (New)	8/10	clean markdown v2	0.08¢	no new lines

Folder watcher fix (Normal)

Coding

TypeScript

Vue

→ Raw Prompt

Evaluation Rubrics

Criteria: - Correctly solved the task: 9/10 Additional components: - Added helpful extra logic: +0.25 rating - Added unnecessary code: -0.25 rating - Returned code in diff format: -1 rating - Verbose output: -0.5 rating - Concise response (only changed code): +0.25 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	9.5/10	Folder watcher fix	13.70¢	solved. extra logic. concise
#1	Stealth Horizon Alpha	9.5/10	Folder watcher fix	N/A	solved. extra logic. concise, respects indentation well
#1	xAI Grok Code Fast 1	9.5/10	Folder watcher fix	0.43¢	solved. extra logic. concise
#1	Z.ai GLM-4.5	9.5/10	Folder watcher fix	1.06¢	solved. extra logic
#5	OpenAI o4-mini	9.25/10	Folder watcher fix	1.28¢	solved. extra logic
#5	Anthropic Claude Sonnet 4	9.25/10	Folder watcher fix	2.61¢	solved. concise
#5	Moonshot AI Kimi K2 0711	9.25/10	Folder watcher fix	0.48¢	solved. extra logic
#5	Anthropic Claude Opus 4	9.25/10	Folder watcher fix v2	13.18¢	solved. concise
#9	Anthropic Claude 3.7 Sonnet	8.75/10	Folder watcher fix	4.41¢	solved. very verbose. extra logic
#9	xAI Grok 4	8.75/10	Folder watcher fix	4.36¢	solved. extra logic. verbose
#9	OpenRouter (Alibaba Plus) Qwen3 Coder	8.75/10	Folder watcher fix	1.50¢	unnecessary code
#9	OpenAI GPT-5	8.75/10	Folder watcher fix	4.46¢	solved. extra logic. verbose
#13	OpenAI GPT-4.1	8.5/10	Folder watcher fix	1.58¢	solved. verbose
#13	Google Gemini 2.5 Pro Preview (05-06)	8.5/10	Folder watcher fix	2.59¢	solved. verbose
#13	Anthropic Claude Opus 4	8.5/10	Folder watcher fix	16.76¢	solved. verbose
#13	OpenRouter OpenRouter: Mistral Medium 3	8.5/10	Folder watcher fix	N/A	solved. verbose
#13	DeepSeek DeepSeek-V3 (New)	8.5/10	Folder watcher fix	0.43¢	solved. verbose
#13	Google Gemini 2.5 Pro Preview (06-05)	8.5/10	Folder watcher fix	16.20¢	solved. verbose
#13	OpenRouter OR: GPT OSS 120B (Cerebras)	8.5/10	Folder watcher fix	0.28¢	solved. verbose
#13	DeepSeek DeepSeek-V3.1	8.5/10	Folder watcher fix	0.45¢	solved. verbose
#13	OpenAI GPT-5 (High)	8.5/10	Folder watcher fix	8.91¢	solved. verbose
#13	OpenAI GPT-5 (High)	8.5/10	Folder watcher fix v2	10.13¢	solved. verbose
#23	OpenAI o3	8/10	Folder watcher fix	9.82¢	solved. diff format
#23	Google Gemini 2.5 Pro	8/10	Folder watcher fix	21.98¢	solved in a different way. diff format

Image - kanji

Image Analysis

Japanese

Chinese

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Tangentially related explanation: 6/10 - Incorrect or ambiguous explanation: 5/10 - Did not recognize image: 1/10 Additional components: - Provides multiple explanations - Includes one wrong explanation: -0.5 rating - Final or main explanation wrong: -1 rating - Verbose output: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Google Gemini 2.5 Pro Preview (05-06)	9/10	Kanji image	2.27¢	correct
#1	OpenAI o3	9/10	Kanji image	8.27¢	correct
#1	OpenAI GPT-5	9/10	Kanji image	13.53¢	correct explanation
#1	OpenAI GPT-5 (High)	9/10	Kanji image	12.65¢	correct
#5	xAI Grok 4	7.5/10	Kanji image	15.85¢	main exp wrong. alt exp correct. verbose
#6	Anthropic Claude Opus 4	6/10	Kanji image	3.96¢	tangential
#7	OpenAI GPT-4.1	5/10	Kanji image	0.40¢	failed
#7	Anthropic Claude 3.7 Sonnet	5/10	Kanji image	0.80¢	failed
#7	OpenAI GPT-4o	5/10	Kanji image	0.70¢	failed
#7	OpenRouter OpenRouter: Meta: Llama 4 Maverick	5/10	Kanji image	N/A	ambiguous output
#7	Anthropic Claude Sonnet 4	5/10	Kanji image	0.91¢	failed
#7	Google Gemini 2.5 Pro	5/10	Kanji image	5.48¢	failed
#13	OpenRouter OpenRouter: Qwen3 235B A22B	1/10	Kanji image	N/A	Didn't recognize image
#13	Anthropic Claude Opus 4.1	1/10	Kanji image	4.20¢	failed

Image analysis - water bottle

Image Analysis

Physics

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Missed the point: 6/10 Additional components: - Detailed explanation: +0.25 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	xAI Grok 4	9.25/10	Image analysis	8.87¢	correct. detailed explanation
#2	OpenAI GPT-4.1	9/10	Image analysis	0.24¢	correct
#2	Google Gemini 2.5 Pro Experimental	9/10	Image analysis	N/A	correct
#2	Google Gemini 2.5 Pro Preview (05-06)	9/10	Image analysis	1.83¢	correct
#2	OpenAI o3	9/10	Image analysis	2.90¢	correct
#2	Google Gemini 2.5 Pro	9/10	Image analysis	1.45¢	correct
#2	Stealth Horizon Alpha	9/10	Image analysis	N/A	correct
#2	OpenRouter OR: Horizon Beta	9/10	Image analysis	N/A	correct
#2	OpenAI GPT-5	9/10	Image analysis	2.32¢	correct
#2	OpenAI GPT-5 (High)	9/10	Image analysis	6.16¢	correct
#11	Anthropic Claude 3.7 Sonnet	6/10	Image analysis	0.68¢	missed point
#11	OpenRouter OpenRouter: Meta: Llama 4 Maverick	6/10	Image analysis	N/A	missed point
#11	Anthropic Claude Sonnet 4	6/10	Image analysis	0.65¢	missed points
#11	Anthropic Claude Opus 4	6/10	Image analysis	3.35¢	missed points
#11	Anthropic Claude Opus 4.1	6/10	Image analysis	3.35¢	missed point

Image table data extraction

Image Analysis

Table

Data Extraction

→ Blog Post → Raw Prompt

Evaluation Rubrics

# Rubrics for Image Table Data Extraction Project Criteria: - All 4 models are correctly extracted: 9.5/10 - 3 models are correctly extracted: 8/10 - 2 models are correctly extracted: 6/10 - 1 model is correctly extracted: 3/10 - 0 models are correctly extracted: 1/10 Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	Image table data extraction	0.71¢	All 4 models correct
#1	Google Gemini 2.5 Pro	9.5/10	Image table data extraction	1.70¢	All 4 models correct
#1	Anthropic Claude Sonnet 4	9.5/10	Image table data extraction	0.73¢	All 4 models correct
#1	Google Gemini 2.5 Pro	9.5/10	Image table data extraction	1.24¢	All 4 models correct
#5	Google Gemini 2.5 Flash	8/10	Image table data extraction	0.18¢	3 models correct
#5	OpenAI GPT-5 (High)	8/10	Image table data extraction	12.39¢	3 models correct
#7	Google Gemini 2.5 Flash	6/10	Image table data extraction	0.18¢	2 models correct
#7	OpenAI GPT-5 (High)	6/10	Image table data extraction	14.95¢	2 models correct
#9	Anthropic Claude Opus 4.1	3/10	Image table data extraction	3.49¢	1 model correct
#9	Anthropic Claude Opus 4.1	3/10	Image table data extraction	4.54¢	1 model correct
#9	OpenAI GPT-5	3/10	Image table data extraction	11.11¢	1 model correct
#12	OpenAI GPT-5	1/10	Image table data extraction	11.50¢	all wrong
#12	xAI Grok 4	1/10	Image table data extraction	15.63¢	all wrong
#12	xAI Grok 4	1/10	Image table data extraction	15.26¢	all wrong

Next.js TODO add feature (Simple)

Coding

JavaScript

React

→ Raw Prompt

Evaluation Rubrics

Criteria: - Output only changed code (follows instructions): 9/10 - Output full code (does not follow instructions): 8/10 Additional components: - Concise response - Very concise response (<=1300 characters): +0.25 rating - Very very concise response (<=1200 characters): +0.5 rating - Verbose output (>=1500 characters): -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	TODO task (Claude)	0.76¢	Very concise *2
#1	Google Gemini 2.5 Pro Preview (06-05)	9.5/10	TODO task v2 (concise)	1.80¢	Very concise *2
#1	Anthropic Claude Opus 4	9.5/10	TODO task (Claude)	3.88¢	Very concise *2
#1	xAI Grok 4	9.5/10	TODO task	3.87¢	Very concise *2
#1	Google Gemini 2.5 Pro	9.5/10	TODO task v2 (concise)	2.12¢	Very concise *2
#1	OpenAI GPT-5	9.5/10	TODO task v2 (concise)	3.77¢	very concise *2
#1	xAI Grok Code Fast 1	9.5/10	TODO task v2 (concise)	0.17¢	Very concise *2
#1	OpenAI GPT-5 (High)	9.5/10	TODO task v2 (concise)	7.62¢	Very concise *2
#9	OpenAI GPT-4.1	9.25/10	TODO task	0.38¢	Very concise
#9	Google Gemini 2.5 Pro Experimental	9.25/10	TODO task v2 (concise)	N/A	Very concise
#11	DeepSeek DeepSeek-V3 (New)	9/10	TODO task	0.10¢	Follows instruction
#11	Google Gemini 2.5 Pro Preview (05-06)	9/10	TODO task v2 (concise)	2.12¢	Follows instruction
#11	Google Gemini 2.5 Pro Preview (06-05)	9/10	TODO task	3.91¢	Follows instruction
#11	Anthropic Claude 3.5 Sonnet	9/10	TODO task	0.86¢	Follows instruction
#11	OpenAI GPT-5	9/10	TODO task	3.01¢	follows instructions
#16	OpenRouter OpenRouter: OpenAI: Codex Mini	8.5/10	TODO task	N/A	Asked for more context!
#16	OpenRouter google/gemini-2.5-flash-preview-05-20:thinking	8.5/10	TODO task v2 (concise)	N/A	Follows instruction. Verbose
#16	Stealth Horizon Alpha	8.5/10	TODO task	N/A	Follows instruction. Verbose
#16	Stealth Horizon Alpha	8.5/10	TODO task v2 (concise)	N/A	Follows instruction. Verbose
#16	OpenRouter OR: GPT OSS 120B (Cerebras)	8.5/10	TODO task	0.07¢	Follows instruction. Verbose
#21	OpenRouter OpenRouter: Mercury Coder Small Beta	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Qwen3 235B A22B	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Mistral Medium 3	8/10	TODO task	N/A	output full code
#21	Fireworks AI DeepSeek V3 (0324)	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Mistral: Devstral Small	8/10	TODO task	N/A	output full code
#21	Anthropic Claude Sonnet 4	8/10	TODO task	1.12¢	output full code
#21	Anthropic Claude Opus 4	8/10	TODO task	5.66¢	output full code
#21	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	8/10	TODO task v2 (concise)	N/A	output full code
#21	Moonshot AI Kimi K2 0711	8/10	TODO task	0.15¢	output full code
#21	OpenRouter (Alibaba Plus) Qwen3 Coder	8/10	TODO task	0.43¢	output full code
#21	OpenRouter OR: GPT OSS 120B (Cerebras)	8/10	TODO task v2 (concise)	0.08¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task (Claude)	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task v2 (concise)	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task (Claude)	0.11¢	output full code
#21	xAI Grok Code Fast 1	8/10	TODO task	0.32¢	output full code
#21	Z.ai GLM-4.5	8/10	TODO task v2 (concise)	0.57¢	output full code
#38	Anthropic Claude 3.7 Sonnet	7.5/10	TODO task	1.20¢	verbose
#38	Google Gemini 2.5 Pro Preview (03-25)	7.5/10	TODO task	1.14¢	Verbose
#38	Anthropic Claude 3.7 Sonnet	7.5/10	TODO task (Claude)	1.33¢	verbose
#38	OpenRouter OpenRouter: Meta: Llama 4 Maverick	7.5/10	TODO task	N/A	verbose
#38	OpenAI o3	7.5/10	TODO task	5.65¢	diff format

Tailwind CSS v3 z-index

Coding

CSS

Tailwind

Bug

→ Blog Post → Raw Prompt

Evaluation Rubrics

Criteria: - Bug identified and fixed: 9/10 - Bug identified but not fixed: 7/10 - Bug not identified: 1/10 Additional components: - Removes extra z-index values: +0.25 rating - Uses correct custom values syntax (e.g., z-[60]): +0.25 rating - Wrong explanation of the bug despite the correct fix: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.25/10	Tailwind css v3 z-index	2.44¢	fixed. removes extra
#1	Anthropic Claude Sonnet 4	9.25/10	Tailwind css v3 z-index	2.26¢	fixed. removes extra
#1	OpenAI GPT-5	9.25/10	Tailwind css v3 z-index	3.81¢	fixed. correct custom value syntax
#1	OpenAI GPT-5	9.25/10	Tailwind css v3 z-index	4.28¢	fixed. correct custom value syntax
#1	Anthropic Claude Opus 4	9.25/10	Tailwind css v3 z-index	11.49¢	fixed. removes extra
#1	xAI Grok 4	9.25/10	Tailwind css v3 z-index	12.30¢	fixed. correct custom value syntax
#1	xAI Grok 4	9.25/10	Tailwind css v3 z-index	20.80¢	fixed. correct custom value syntax
#1	OpenRouter OR: GPT OSS 120B (Cerebras)	9.25/10	Tailwind css v3 z-index	0.10¢	fixed. correct custom value syntax
#1	Google Gemini 2.5 Pro	9.25/10	Tailwind css v3 z-index	12.96¢	fixed. removes extra
#1	Google Gemini 2.5 Pro	9.25/10	Tailwind css v3 z-index	8.07¢	fixed. removes extra
#1	OpenAI GPT-5 (High)	9.25/10	Tailwind css v3 z-index	8.28¢	fixed. correct custom value syntax
#1	OpenAI GPT-5 (High)	9.25/10	Tailwind css v3 z-index	7.49¢	fixed. removes extra
#13	OpenAI GPT-4.1	9/10	Tailwind css v3 z-index	1.04¢	fixed
#13	OpenAI GPT-4.1	9/10	Tailwind css v3 z-index	1.05¢	fixed
#13	Anthropic Claude Opus 4	9/10	Tailwind css v3 z-index	11.30¢	fixed
#13	OpenRouter OR: GPT OSS 120B (Cerebras)	9/10	Tailwind css v3 z-index	0.20¢	fixed
#17	OpenRouter (Alibaba Plus) Qwen3 Coder	8.75/10	Tailwind css v3 z-index	1.02¢	fixed. removes extra. wrong explanation
#18	Moonshot AI Kimi K2 0711	1/10	Tailwind css v3 z-index	0.15¢	not identified
#18	Moonshot AI Kimi K2 0711	1/10	Tailwind css v3 z-index	0.12¢	not identified
#18	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	Tailwind css v3 z-index	0.95¢	not identified
#18	DeepSeek DeepSeek-V3.1	1/10	Tailwind css v3 z-index	0.24¢	not identified
#18	DeepSeek DeepSeek-V3.1	1/10	Tailwind css v3 z-index	0.22¢	not identified
#18	xAI Grok Code Fast 1	1/10	Tailwind css v3 z-index	0.71¢	not identified
#18	xAI Grok Code Fast 1	1/10	Tailwind css v3 z-index	0.38¢	not identified
#18	Z.ai GLM-4.5	1/10	Tailwind css v3 z-index	1.71¢	not identified
#18	Z.ai GLM-4.5	1/10	Tailwind css v3 z-index	0.46¢	not identified

TypeScript narrowing (Uncommon)

Coding

TypeScript

→ Raw Prompt

Evaluation Rubrics

Criteria: - Provide a working method (without in keyword): 8/10 - Use in keyword: 6/10 - Did not work (Wrong answer): 1/10 Additional components: - Provides multiple working methods: +0.5 rating - Provides multiple methods - Includes one wrong method: -0.5 rating - Final answer wrong: -1 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task to account for large variance in output. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	8.5/10	TypeScript narrowing v3	5.21¢	both methods work
#1	OpenAI GPT-5 (High)	8.5/10	TypeScript narrowing v3	8.50¢	both methods work
#3	Anthropic Claude 3.5 Sonnet	8/10	TypeScript narrowing v3	0.85¢	second and final answer works
#3	Anthropic Claude Sonnet 4	8/10	TypeScript narrowing v3	0.98¢	second method works
#3	xAI Grok Code Fast 1	8/10	TypeScript narrowing v3	0.40¢	correct
#6	OpenRouter OR: GPT OSS 120B (Cerebras)	7.5/10	TypeScript narrowing v3	0.15¢	2nd method works
#6	OpenRouter OR: GPT OSS 120B (Cerebras)	7.5/10	TypeScript narrowing v3	0.11¢	2nd method works
#8	Anthropic Claude 3.7 Sonnet	7/10	TypeScript narrowing v3	1.22¢	second answer works. final answer wrong
#9	OpenAI GPT-4.1	6/10	TypeScript narrowing v3	0.37¢	use in keyword
#9	OpenRouter OpenRouter: Mistral: Devstral Small	6/10	TypeScript narrowing v3	N/A	almost correct
#9	xAI Grok 4	6/10	TypeScript narrowing v3	8.55¢	use in keyword
#9	xAI Grok Code Fast 1	6/10	TypeScript narrowing v3	0.40¢	use in keyword
#9	Z.ai GLM-4.5	6/10	TypeScript narrowing v3	0.80¢	use in keyword
#14	Google Gemini 2.5 Pro Experimental	5.5/10	TypeScript narrowing v3	N/A	use in keyword. verbose
#14	Google Gemini 2.5 Pro Preview (05-06)	5.5/10	TypeScript narrowing v3	2.27¢	use in keyword. verbose
#14	Google Gemini 2.5 Pro Preview (06-05)	5.5/10	TypeScript narrowing v3	12.54¢	use in keyword. verbose
#14	Stealth Horizon Alpha	5.5/10	TypeScript narrowing v3	N/A	first method didn't work. second method uses in keyword
#14	OpenAI GPT-5 (High)	5.5/10	TypeScript narrowing v3	5.26¢	use in keyword. one wrong method
#19	OpenAI o3	1/10	TypeScript narrowing v3	5.46¢	wrong
#19	DeepSeek DeepSeek-V3 (New)	1/10	TypeScript narrowing v3	0.08¢	wrong. mention predicate
#19	OpenRouter OpenRouter: Mistral Medium 3	1/10	TypeScript narrowing v3	N/A	wrong
#19	OpenRouter OpenRouter: Mercury Coder Small Beta	1/10	TypeScript narrowing v3	N/A	wrong
#19	Google Gemini 2.5 Pro	1/10	TypeScript narrowing v3	3.98¢	wrong
#19	Moonshot AI Kimi K2 0711	1/10	TypeScript narrowing v3	0.12¢	wrong
#19	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	TypeScript narrowing v3	0.34¢	wrong
#19	Moonshot AI Kimi K2 0711	1/10	TypeScript narrowing v3	0.10¢	wrong
#19	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	TypeScript narrowing v3	0.38¢	all 3 methods did not work
#19	OpenAI GPT-5	1/10	TypeScript narrowing v3	2.26¢	wrong
#19	OpenAI GPT-5	1/10	TypeScript narrowing v3	1.89¢	wrong
#19	DeepSeek DeepSeek-V3.1	1/10	TypeScript narrowing v3	0.08¢	wrong
#19	DeepSeek DeepSeek-V3.1	1/10	TypeScript narrowing v3	0.08¢	wrong
#19	Z.ai GLM-4.5	1/10	TypeScript narrowing v3	0.55¢	wrong

Writing an AI Timeline

Technical Writing

History

→ Raw Prompt

Evaluation Rubrics

Criteria: - Covers all points (20): 10/10 - Covers almost all points (>=18): 9.5/10 - Covers most points (>=15): 9/10 - Covers major points (>=13): 8.5/10 - Missed some points (<13): 8/10 Additional components: - Bad headline: -0.5 rating - Concise response: +0.25 rating - Too concise response: -0.25 rating - Verbose output: -0.5 rating - Wrong format: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	AI timeline	2.50¢	Covers almost all points
#1	Anthropic Claude Opus 4	9.5/10	AI timeline	14.13¢	Covers almost all points
#1	OpenAI GPT-5	9.5/10	AI timeline	11.45¢	Covers almost all points
#4	OpenAI GPT-4.1	9.25/10	AI timeline	0.76¢	Covers most points. concise
#4	Google Gemini 2.5 Pro Preview (06-05)	9.25/10	AI timeline v2 (concise)	4.92¢	Covers most points. concise
#4	xAI Grok 4	9.25/10	AI timeline	3.15¢	Covers most points. concise
#7	DeepSeek DeepSeek-V3 (New)	8.75/10	AI timeline	0.15¢	Covers most points. Too concise
#8	Anthropic Claude 3.7 Sonnet	8.5/10	AI timeline	2.44¢	Covers most points. Wrong format
#8	OpenRouter OpenRouter: Mistral Medium 3	8.5/10	AI timeline	N/A	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro Experimental	8.5/10	AI timeline v2 (concise)	N/A	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro Preview (05-06)	8.5/10	AI timeline v2 (concise)	2.13¢	Covers most points. Wrong format
#8	OpenAI o3	8.5/10	AI timeline	16.56¢	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro	8.5/10	AI timeline v2 (concise)	5.44¢	Covers major points
#8	Moonshot AI Kimi K2 0711	8.5/10	AI timeline	0.26¢	covers major points
#8	DeepSeek DeepSeek-V3.1	8.5/10	AI timeline	0.23¢	Covers most points. Wrong format
#8	DeepSeek DeepSeek-V3.1	8.5/10	AI timeline v3	0.25¢	Covers most points. Wrong format
#17	Google Gemini 2.5 Pro Experimental	8/10	AI timeline	N/A	Covers most points. Wrong format. Verbose
#17	Fireworks AI DeepSeek V3 (0324)	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	OpenRouter OpenRouter: Qwen3 235B A22B	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	OpenRouter OpenRouter: Meta: Llama 3.3 70B Instruct	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	Google Gemini 2.5 Pro Preview (05-06)	8/10	AI timeline v2 (concise)	5.47¢	Covers major points. Wrong format
#22	Azure OpenAI gpt-4o	7.5/10	AI timeline	0.95¢	Missed some points. Bad headline
#22	OpenRouter OpenRouter: Qwen: Qwen3 8B (free)	7.5/10	AI timeline	N/A	Missed some points. Bad headline
#22	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	7.5/10	AI timeline	N/A	Missed some points. Wrong format
#22	OpenRouter OR: GPT OSS 120B (Cerebras)	7.5/10	AI timeline	0.11¢	missed points. wrong format

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

→ View rubrics and latest results for model evaluations

→ View raw evaluation data