16x Eval Model Evaluation

Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.

Coding (7 Tasks)Technical Writing (1 Task)Image Analysis (3 Tasks)

16x Eval Top Models - Coding

Average Human Rating of 7 Tasks │ Total Cost

Claude Opus 4

8.96(64.68¢)

8.9664.68¢

GPT-5 (High)

8.86(110.46¢)

8.86110.46¢

Claude Sonnet 4

8.68(13.53¢)

8.6813.53¢

Grok 4

8.61(79.14¢)

8.6179.14¢

gpt-oss-120b (Cerebras)

8.39(1.12¢)

8.391.12¢

GPT-4.1

8.21(7.14¢)

8.217.14¢

Gemini 2.5 Pro

7.71(83.48¢)

7.7183.48¢

GPT-5

7.71(60.39¢)

7.7160.39¢

Grok Code Fast 1

7.64(3.98¢)

7.643.98¢

Qwen3 Coder

7.25(5.68¢)

7.255.68¢

GLM-4.5

7.00(10.46¢)

7.0010.46¢

Kimi K2 0711

6.39(1.69¢)

6.391.69¢

DeepSeek-V3.1

5.68(1.64¢)

5.681.64¢

Jump to Task

Benchmark Visualization (Difficult)Clean MDX (Difficult)Clean markdown (Medium)Folder watcher fix (Normal)Next.js TODO add feature (Simple)Tailwind CSS v3 z-index TypeScript narrowing (Uncommon)

Latest Model Evaluation Posts

Gemini 2.5 Pro and Claude Sonnet 4 Excel at Image Table Data Extraction

September 13, 2025

GLM-4.5 Coding Evaluation: Budget-Friendly with Thinking Trade-Off

September 11, 2025

GPT-5 High Reasoning Evaluation: A Major Leap in Coding Performance

September 5, 2025

Benchmark Visualization (Difficult)

Coding

JavaScript

Visualization

→ Raw Prompt

Evaluation Rubrics

Criteria: - Side-by-side visualization without label: 8.5/10 - Baseline visualization without label: 8/10 - Horizontal bar chart (if cannot fit in the page): 7.5/10 - Has major formatting issues: 5/10 - Did not run / Code error: 1/10 Additional components: - Side-by-side visualization - Color by benchmark: +0.5 rating - Alternative ways to differentiate benchmarks: +0.5 rating - Color by model: No effect on rating - Clear labels on bar chart: +0.5 rating - Visually pleasing: +0.25 rating - Poor color choice: -0.5 rating - Minor formatting issues: -0.5 rating Additional instructions for variance: - If the code did not run or render in the first try, a second try is given to regenerate the code.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.25/10	Benchmark visualization	5.02¢	Side-by-side no label. Color by benchmark. Visually pleasing
#1	Anthropic Claude Opus 4	9.25/10	Benchmark visualization	21.04¢	Side-by-side no label. Color by benchmark. Visually pleasing
#1	xAI Grok 4	9.25/10	Benchmark visualization	11.26¢	side-by-side clear labels. color by model. Visually pleasing
#1	Moonshot AI Kimi K2 0711	9.25/10	Benchmark visualization	0.55¢	Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing
#1	OpenAI GPT-5 (High)	9.25/10	Benchmark visualization	19.21¢	Clear labels. Visually pleasing. Highlight benchmarks on hover
#6	OpenAI GPT-4.1	8.75/10	Benchmark visualization	1.88¢	Clear labels. Visually pleasing
#6	Google Gemini 2.5 Pro	8.75/10	Benchmark visualization	11.00¢	Clear labels. Visually pleasing
#8	OpenRouter gpt-oss-120b (Cerebras)	8.5/10	Benchmark visualization	0.20¢	baseline. clear labels
#8	OpenAI GPT-5	8.5/10	Benchmark visualization	11.92¢	baseline. clear labels
#8	Z.ai GLM-4.5	8.5/10	Benchmark visualization	1.31¢	Side-by-side visualization. No labels
#11	xAI Grok Code Fast 1	8.25/10	Benchmark visualization	0.61¢	No labels. Visually pleasing
#12	OpenAI o3	8/10	Benchmark visualization	12.74¢	Clear labels. Poor color choice
#12	Google Gemini 2.5 Pro Preview (06-05)	8/10	Benchmark visualization	13.97¢	Clear labels. Poor color choice
#14	Anthropic Claude 3.7 Sonnet	7.5/10	Benchmark visualization	5.10¢	Number labels. Good idea
#15	Google Gemini 2.5 Pro Experimental	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	DeepSeek DeepSeek-V3 (New)	7/10	Benchmark visualization	0.42¢	No labels. Good colors
#15	Google Gemini 2.5 Pro Preview (05-06)	7/10	Benchmark visualization	4.61¢	No labels. Good colors
#15	OpenRouter OpenRouter: Mistral Medium 3	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	OpenRouter OpenRouter: Mistral: Devstral Small	7/10	Benchmark visualization	N/A	No labels. Good colors
#15	OpenRouter (Alibaba Plus) Qwen3 Coder	7/10	Benchmark visualization	1.64¢	horizontal bars. minor formatting issues
#21	Google Gemini 2.5 Pro Preview (03-25)	6/10	Benchmark visualization	5.35¢	Minor bug. No labels
#22	Stealth Horizon Alpha	5.5/10	Benchmark visualization	N/A	Strange visualization with major formatting issues
#22	Stealth Horizon Alpha	5.5/10	Benchmark visualization	N/A	Strange visualization with major formatting issues
#22	DeepSeek DeepSeek-V3.1	5.5/10	Benchmark visualization	0.55¢	Strange visualization with major formatting issues
#22	DeepSeek DeepSeek-V3.1	5.5/10	Benchmark visualization	0.49¢	Strange visualization with major formatting issues
#26	OpenRouter OpenRouter: Qwen3 235B A22B	5/10	Benchmark visualization	N/A	Very small. Hard to read
#26	OpenRouter OpenRouter: Mercury Coder Small Beta	5/10	Benchmark visualization	N/A	No color. Hard to read
#28	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	1/10	Benchmark visualization	N/A	doesn't run. bugfix not obvious.

Clean MDX (Difficult)

Coding

TypeScript

Markdown

MDX

→ Blog Post → Raw Prompt

Evaluation Rubrics

Criteria: - No text content was removed: 9/10 - Some text content was removed: 8/10 Additional components: - Left-over elements: - Left-over tables: -0.5 rating - Left-over mdx import statements: -0.5 rating - Left-over mdx components: -0.5 rating - Newline handling: - The output does not contain newlines: -1 rating - The output has 1 or more newline issues: -0.5 rating - Short code (1500 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	xAI Grok 4	9/10	Clean mdx	25.74¢	100% match
#2	Google Gemini 2.5 Pro	8.5/10	Clean mdx	17.84¢	left-over components
#2	Google Gemini 2.5 Pro	8.5/10	Clean mdx	13.42¢	left-over components
#4	OpenAI GPT-5	8/10	Clean mdx	14.89¢	1 newline issue. left-over component
#4	OpenAI GPT-5	8/10	Clean mdx	16.46¢	1 newline issue. left-over component
#4	xAI Grok 4	8/10	Clean mdx	25.82¢	no newline
#4	OpenRouter gpt-oss-120b (Cerebras)	8/10	Clean mdx	0.17¢	left-over import and components
#4	OpenRouter gpt-oss-120b (Cerebras)	8/10	Clean mdx	0.15¢	left-over import and components
#4	OpenRouter (Alibaba Plus) Qwen3 Coder	8/10	Clean mdx	0.42¢	2 newline issues. left-over component
#4	xAI Grok Code Fast 1	8/10	Clean mdx	0.77¢	3 newline issues. left-over imports
#4	OpenAI GPT-5 (High)	8/10	Clean mdx	28.39¢	3 newline issue. left-over component
#12	Anthropic Claude Sonnet 4	7.5/10	Clean mdx	0.95¢	text removed. left-over component
#12	OpenAI GPT-4.1	7.5/10	Clean mdx	0.77¢	left-over tables. left-over import and components
#12	Anthropic Claude Opus 4	7.5/10	Clean mdx	5.34¢	text removed. left-over component
#12	OpenAI GPT-4.1	7.5/10	Clean mdx	0.95¢	6 newline issues. left over import and components
#12	OpenRouter (Alibaba Plus) Qwen3 Coder	7.5/10	Clean mdx	0.40¢	text removed. left-over component
#12	xAI Grok Code Fast 1	7.5/10	Clean mdx	1.06¢	4 newline issues. left-over imports and component
#12	OpenAI GPT-5 (High)	7.5/10	Clean mdx	25.71¢	text removed. left-over component
#12	Z.ai GLM-4.5	7.5/10	Clean mdx	1.37¢	1 newline issue. left-over import, component
#20	Anthropic Claude Opus 4	7/10	Clean mdx	5.48¢	text removed. left-over import and components
#20	Moonshot AI Kimi K2 0711	7/10	Clean mdx	0.13¢	text removed. left-over import and components
#20	Moonshot AI Kimi K2 0711	7/10	Clean mdx	0.14¢	text removed. left-over import and components
#23	DeepSeek DeepSeek-V3.1	6.5/10	Clean mdx	0.12¢	left-over import and component. left-over table
#23	DeepSeek DeepSeek-V3.1	6.5/10	Clean mdx	0.12¢	left-over import and component. left-over table
#23	Z.ai GLM-4.5	6.5/10	Clean mdx	1.71¢	text removed. left-over import and components. 1 newline issue
#26	Anthropic Claude Sonnet 4	6/10	Clean mdx	0.92¢	text removed. no newline. left-over import and component

Clean markdown (Medium)

Coding

TypeScript

Markdown

→ Raw Prompt

Evaluation Rubrics

Criteria: - Code runs and gives correct (expected) output: 9/10 - The output has 1 or more newline issues: 8.5/10 - The output does not contain newlines: 8/10 Additional components: - Short code (1000 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating - Missing export statement: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	9.25/10	clean markdown v2	4.02¢	correct. short code
#1	Moonshot AI Kimi K2 0711	9.25/10	clean markdown v2	0.11¢	correct. short code
#1	OpenRouter (Alibaba Plus) Qwen3 Coder	9.25/10	clean markdown v2	0.33¢	correct. short code
#1	DeepSeek DeepSeek-V3.1	9.25/10	clean markdown v2	0.09¢	correct. short code
#1	xAI Grok Code Fast 1	9.25/10	clean markdown v2	0.89¢	correct. short code
#6	Google Gemini 2.5 Pro	9/10	clean markdown v2	13.60¢	correct
#6	OpenAI o3	9/10	clean markdown v2	13.79¢	correct
#6	OpenAI GPT-5	9/10	clean markdown v2	19.28¢	correct
#6	OpenAI GPT-5 (High)	9/10	clean markdown v2	29.55¢	correct
#10	OpenAI GPT-4.1	8.5/10	clean markdown v2	1.12¢	1 new line issue
#10	xAI Grok 4	8.5/10	clean markdown v2	13.06¢	1 new line issue
#10	Stealth Horizon Alpha	8.5/10	clean markdown v2	N/A	one newline issue
#10	OpenRouter gpt-oss-120b (Cerebras)	8.5/10	clean markdown v2	0.15¢	one newline issue
#10	Z.ai GLM-4.5	8.5/10	clean markdown v2	3.64¢	didn't add export
#15	Anthropic Claude Sonnet 4	8/10	clean markdown v2	0.77¢	no new lines
#15	DeepSeek DeepSeek-V3 (New)	8/10	clean markdown v2	0.08¢	no new lines

Folder watcher fix (Normal)

Coding

TypeScript

Vue

→ Raw Prompt

Evaluation Rubrics

Criteria: - Correctly solved the task: 9/10 Additional components: - Added helpful extra logic: +0.25 rating - Added unnecessary code: -0.25 rating - Returned code in diff format: -1 rating - Verbose output: -0.5 rating - Concise response (only changed code): +0.25 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	9.5/10	Folder watcher fix	13.70¢	solved. extra logic. concise
#1	Stealth Horizon Alpha	9.5/10	Folder watcher fix	N/A	solved. extra logic. concise, respects indentation well
#1	xAI Grok Code Fast 1	9.5/10	Folder watcher fix	0.43¢	solved. extra logic. concise
#1	Z.ai GLM-4.5	9.5/10	Folder watcher fix	1.06¢	solved. extra logic
#5	OpenAI o4-mini	9.25/10	Folder watcher fix	1.28¢	solved. extra logic
#5	Anthropic Claude Sonnet 4	9.25/10	Folder watcher fix	2.61¢	solved. concise
#5	Moonshot AI Kimi K2 0711	9.25/10	Folder watcher fix	0.48¢	solved. extra logic
#5	Anthropic Claude Opus 4	9.25/10	Folder watcher fix v2	13.18¢	solved. concise
#9	Anthropic Claude 3.7 Sonnet	8.75/10	Folder watcher fix	4.41¢	solved. very verbose. extra logic
#9	xAI Grok 4	8.75/10	Folder watcher fix	4.36¢	solved. extra logic. verbose
#9	OpenRouter (Alibaba Plus) Qwen3 Coder	8.75/10	Folder watcher fix	1.50¢	unnecessary code
#9	OpenAI GPT-5	8.75/10	Folder watcher fix	4.46¢	solved. extra logic. verbose
#13	OpenAI GPT-4.1	8.5/10	Folder watcher fix	1.58¢	solved. verbose
#13	Google Gemini 2.5 Pro Preview (05-06)	8.5/10	Folder watcher fix	2.59¢	solved. verbose
#13	Anthropic Claude Opus 4	8.5/10	Folder watcher fix	16.76¢	solved. verbose
#13	OpenRouter OpenRouter: Mistral Medium 3	8.5/10	Folder watcher fix	N/A	solved. verbose
#13	DeepSeek DeepSeek-V3 (New)	8.5/10	Folder watcher fix	0.43¢	solved. verbose
#13	Google Gemini 2.5 Pro Preview (06-05)	8.5/10	Folder watcher fix	16.20¢	solved. verbose
#13	OpenRouter gpt-oss-120b (Cerebras)	8.5/10	Folder watcher fix	0.28¢	solved. verbose
#13	DeepSeek DeepSeek-V3.1	8.5/10	Folder watcher fix	0.45¢	solved. verbose
#13	OpenAI GPT-5 (High)	8.5/10	Folder watcher fix	8.91¢	solved. verbose
#13	OpenAI GPT-5 (High)	8.5/10	Folder watcher fix v2	10.13¢	solved. verbose
#23	OpenAI o3	8/10	Folder watcher fix	9.82¢	solved. diff format
#23	Google Gemini 2.5 Pro	8/10	Folder watcher fix	21.98¢	solved in a different way. diff format

Image - kanji

Image Analysis

Japanese

Chinese

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Tangentially related explanation: 6/10 - Incorrect or ambiguous explanation: 5/10 - Did not recognize image: 1/10 Additional components: - Provides multiple explanations - Includes one wrong explanation: -0.5 rating - Final or main explanation wrong: -1 rating - Verbose output: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Google Gemini 2.5 Pro Preview (05-06)	9/10	Kanji image	2.27¢	correct
#1	OpenAI o3	9/10	Kanji image	8.27¢	correct
#1	OpenAI GPT-5	9/10	Kanji image	13.53¢	correct explanation
#1	OpenAI GPT-5 (High)	9/10	Kanji image	12.65¢	correct
#5	xAI Grok 4	7.5/10	Kanji image	15.85¢	main exp wrong. alt exp correct. verbose
#6	Anthropic Claude Opus 4	6/10	Kanji image	3.96¢	tangential
#7	OpenAI GPT-4.1	5/10	Kanji image	0.40¢	failed
#7	Anthropic Claude 3.7 Sonnet	5/10	Kanji image	0.80¢	failed
#7	OpenAI GPT-4o	5/10	Kanji image	0.70¢	failed
#7	OpenRouter OpenRouter: Meta: Llama 4 Maverick	5/10	Kanji image	N/A	ambiguous output
#7	Anthropic Claude Sonnet 4	5/10	Kanji image	0.91¢	failed
#7	Google Gemini 2.5 Pro	5/10	Kanji image	5.48¢	failed
#13	OpenRouter OpenRouter: Qwen3 235B A22B	1/10	Kanji image	N/A	Didn't recognize image
#13	Anthropic Claude Opus 4.1	1/10	Kanji image	4.20¢	failed

Image analysis - water bottle

Image Analysis

Physics

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Missed the point: 6/10 Additional components: - Detailed explanation: +0.25 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	xAI Grok 4	9.25/10	Image analysis	8.87¢	correct. detailed explanation
#2	OpenAI GPT-4.1	9/10	Image analysis	0.24¢	correct
#2	Google Gemini 2.5 Pro Experimental	9/10	Image analysis	N/A	correct
#2	Google Gemini 2.5 Pro Preview (05-06)	9/10	Image analysis	1.83¢	correct
#2	OpenAI o3	9/10	Image analysis	2.90¢	correct
#2	Google Gemini 2.5 Pro	9/10	Image analysis	1.45¢	correct
#2	Stealth Horizon Alpha	9/10	Image analysis	N/A	correct
#2	OpenRouter OR: Horizon Beta	9/10	Image analysis	N/A	correct
#2	OpenAI GPT-5	9/10	Image analysis	2.32¢	correct
#2	OpenAI GPT-5 (High)	9/10	Image analysis	6.16¢	correct
#11	Anthropic Claude 3.7 Sonnet	6/10	Image analysis	0.68¢	missed point
#11	OpenRouter OpenRouter: Meta: Llama 4 Maverick	6/10	Image analysis	N/A	missed point
#11	Anthropic Claude Sonnet 4	6/10	Image analysis	0.65¢	missed points
#11	Anthropic Claude Opus 4	6/10	Image analysis	3.35¢	missed points
#11	Anthropic Claude Opus 4.1	6/10	Image analysis	3.35¢	missed point

Image table data extraction

Image Analysis

Table

Data Extraction

→ Blog Post → Raw Prompt

Evaluation Rubrics

# Rubrics for Image Table Data Extraction Project Criteria: - All 4 models are correctly extracted: 9.5/10 - 3 models are correctly extracted: 8/10 - 2 models are correctly extracted: 6/10 - 1 model is correctly extracted: 3/10 - 0 models are correctly extracted: 1/10 Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	Image table data extraction	0.71¢	All 4 models correct
#1	Google Gemini 2.5 Pro	9.5/10	Image table data extraction	1.70¢	All 4 models correct
#1	Anthropic Claude Sonnet 4	9.5/10	Image table data extraction	0.73¢	All 4 models correct
#1	Google Gemini 2.5 Pro	9.5/10	Image table data extraction	1.24¢	All 4 models correct
#5	Google Gemini 2.5 Flash	8/10	Image table data extraction	0.18¢	3 models correct
#5	OpenAI GPT-5 (High)	8/10	Image table data extraction	12.39¢	3 models correct
#7	Google Gemini 2.5 Flash	6/10	Image table data extraction	0.18¢	2 models correct
#7	OpenAI GPT-5 (High)	6/10	Image table data extraction	14.95¢	2 models correct
#9	Anthropic Claude Opus 4.1	3/10	Image table data extraction	3.49¢	1 model correct
#9	Anthropic Claude Opus 4.1	3/10	Image table data extraction	4.54¢	1 model correct
#9	OpenAI GPT-5	3/10	Image table data extraction	11.11¢	1 model correct
#12	OpenAI GPT-5	1/10	Image table data extraction	11.50¢	all wrong
#12	xAI Grok 4	1/10	Image table data extraction	15.63¢	all wrong
#12	xAI Grok 4	1/10	Image table data extraction	15.26¢	all wrong

Next.js TODO add feature (Simple)

Coding

JavaScript

React

→ Raw Prompt

Evaluation Rubrics

Criteria: - Output only changed code (follows instructions): 9/10 - Output full code (does not follow instructions): 8/10 Additional components: - Concise response - Very concise response (<=1300 characters): +0.25 rating - Very very concise response (<=1200 characters): +0.5 rating - Verbose output (>=1500 characters): -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	TODO task (Claude)	0.76¢	Very concise *2
#1	Google Gemini 2.5 Pro Preview (06-05)	9.5/10	TODO task v2 (concise)	1.80¢	Very concise *2
#1	Anthropic Claude Opus 4	9.5/10	TODO task (Claude)	3.88¢	Very concise *2
#1	xAI Grok 4	9.5/10	TODO task	3.87¢	Very concise *2
#1	Google Gemini 2.5 Pro	9.5/10	TODO task v2 (concise)	2.12¢	Very concise *2
#1	OpenAI GPT-5	9.5/10	TODO task v2 (concise)	3.77¢	very concise *2
#1	xAI Grok Code Fast 1	9.5/10	TODO task v2 (concise)	0.17¢	Very concise *2
#1	OpenAI GPT-5 (High)	9.5/10	TODO task v2 (concise)	7.62¢	Very concise *2
#9	OpenAI GPT-4.1	9.25/10	TODO task	0.38¢	Very concise
#9	Google Gemini 2.5 Pro Experimental	9.25/10	TODO task v2 (concise)	N/A	Very concise
#11	DeepSeek DeepSeek-V3 (New)	9/10	TODO task	0.10¢	Follows instruction
#11	Google Gemini 2.5 Pro Preview (05-06)	9/10	TODO task v2 (concise)	2.12¢	Follows instruction
#11	Google Gemini 2.5 Pro Preview (06-05)	9/10	TODO task	3.91¢	Follows instruction
#11	Anthropic Claude 3.5 Sonnet	9/10	TODO task	0.86¢	Follows instruction
#11	OpenAI GPT-5	9/10	TODO task	3.01¢	follows instructions
#16	OpenRouter OpenRouter: OpenAI: Codex Mini	8.5/10	TODO task	N/A	Asked for more context!
#16	OpenRouter google/gemini-2.5-flash-preview-05-20:thinking	8.5/10	TODO task v2 (concise)	N/A	Follows instruction. Verbose
#16	Stealth Horizon Alpha	8.5/10	TODO task	N/A	Follows instruction. Verbose
#16	Stealth Horizon Alpha	8.5/10	TODO task v2 (concise)	N/A	Follows instruction. Verbose
#16	OpenRouter gpt-oss-120b (Cerebras)	8.5/10	TODO task	0.07¢	Follows instruction. Verbose
#21	OpenRouter OpenRouter: Mercury Coder Small Beta	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Qwen3 235B A22B	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Mistral Medium 3	8/10	TODO task	N/A	output full code
#21	Fireworks AI DeepSeek V3 (0324)	8/10	TODO task	N/A	output full code
#21	OpenRouter OpenRouter: Mistral: Devstral Small	8/10	TODO task	N/A	output full code
#21	Anthropic Claude Sonnet 4	8/10	TODO task	1.12¢	output full code
#21	Anthropic Claude Opus 4	8/10	TODO task	5.66¢	output full code
#21	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	8/10	TODO task v2 (concise)	N/A	output full code
#21	Moonshot AI Kimi K2 0711	8/10	TODO task	0.15¢	output full code
#21	OpenRouter (Alibaba Plus) Qwen3 Coder	8/10	TODO task	0.43¢	output full code
#21	OpenRouter gpt-oss-120b (Cerebras)	8/10	TODO task v2 (concise)	0.08¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task (Claude)	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task v2 (concise)	0.11¢	output full code
#21	DeepSeek DeepSeek-V3.1	8/10	TODO task (Claude)	0.11¢	output full code
#21	xAI Grok Code Fast 1	8/10	TODO task	0.32¢	output full code
#21	Z.ai GLM-4.5	8/10	TODO task v2 (concise)	0.57¢	output full code
#38	Anthropic Claude 3.7 Sonnet	7.5/10	TODO task	1.20¢	verbose
#38	Google Gemini 2.5 Pro Preview (03-25)	7.5/10	TODO task	1.14¢	Verbose
#38	Anthropic Claude 3.7 Sonnet	7.5/10	TODO task (Claude)	1.33¢	verbose
#38	OpenRouter OpenRouter: Meta: Llama 4 Maverick	7.5/10	TODO task	N/A	verbose
#38	OpenAI o3	7.5/10	TODO task	5.65¢	diff format

Tailwind CSS v3 z-index

Coding

CSS

Tailwind

Bug

→ Blog Post → Raw Prompt

Evaluation Rubrics

Criteria: - Bug identified and fixed: 9/10 - Bug identified but not fixed: 7/10 - Bug not identified: 1/10 Additional components: - Removes extra z-index values: +0.25 rating - Uses correct custom values syntax (e.g., z-[60]): +0.25 rating - Wrong explanation of the bug despite the correct fix: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.25/10	Tailwind css v3 z-index	2.44¢	fixed. removes extra
#1	Anthropic Claude Sonnet 4	9.25/10	Tailwind css v3 z-index	2.26¢	fixed. removes extra
#1	OpenAI GPT-5	9.25/10	Tailwind css v3 z-index	3.81¢	fixed. correct custom value syntax
#1	OpenAI GPT-5	9.25/10	Tailwind css v3 z-index	4.28¢	fixed. correct custom value syntax
#1	Anthropic Claude Opus 4	9.25/10	Tailwind css v3 z-index	11.49¢	fixed. removes extra
#1	xAI Grok 4	9.25/10	Tailwind css v3 z-index	12.30¢	fixed. correct custom value syntax
#1	xAI Grok 4	9.25/10	Tailwind css v3 z-index	20.80¢	fixed. correct custom value syntax
#1	OpenRouter gpt-oss-120b (Cerebras)	9.25/10	Tailwind css v3 z-index	0.10¢	fixed. correct custom value syntax
#1	Google Gemini 2.5 Pro	9.25/10	Tailwind css v3 z-index	12.96¢	fixed. removes extra
#1	Google Gemini 2.5 Pro	9.25/10	Tailwind css v3 z-index	8.07¢	fixed. removes extra
#1	OpenAI GPT-5 (High)	9.25/10	Tailwind css v3 z-index	8.28¢	fixed. correct custom value syntax
#1	OpenAI GPT-5 (High)	9.25/10	Tailwind css v3 z-index	7.49¢	fixed. removes extra
#13	OpenAI GPT-4.1	9/10	Tailwind css v3 z-index	1.04¢	fixed
#13	OpenAI GPT-4.1	9/10	Tailwind css v3 z-index	1.05¢	fixed
#13	Anthropic Claude Opus 4	9/10	Tailwind css v3 z-index	11.30¢	fixed
#13	OpenRouter gpt-oss-120b (Cerebras)	9/10	Tailwind css v3 z-index	0.20¢	fixed
#17	OpenRouter (Alibaba Plus) Qwen3 Coder	8.75/10	Tailwind css v3 z-index	1.02¢	fixed. removes extra. wrong explanation
#18	Moonshot AI Kimi K2 0711	1/10	Tailwind css v3 z-index	0.15¢	not identified
#18	Moonshot AI Kimi K2 0711	1/10	Tailwind css v3 z-index	0.12¢	not identified
#18	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	Tailwind css v3 z-index	0.95¢	not identified
#18	DeepSeek DeepSeek-V3.1	1/10	Tailwind css v3 z-index	0.24¢	not identified
#18	DeepSeek DeepSeek-V3.1	1/10	Tailwind css v3 z-index	0.22¢	not identified
#18	xAI Grok Code Fast 1	1/10	Tailwind css v3 z-index	0.71¢	not identified
#18	xAI Grok Code Fast 1	1/10	Tailwind css v3 z-index	0.38¢	not identified
#18	Z.ai GLM-4.5	1/10	Tailwind css v3 z-index	1.71¢	not identified
#18	Z.ai GLM-4.5	1/10	Tailwind css v3 z-index	0.46¢	not identified

TypeScript narrowing (Uncommon)

Coding

TypeScript

→ Raw Prompt

Evaluation Rubrics

Criteria: - Provide a working method (without in keyword): 8/10 - Use in keyword: 6/10 - Did not work (Wrong answer): 1/10 Additional components: - Provides multiple working methods: +0.5 rating - Provides multiple methods - Includes one wrong method: -0.5 rating - Final answer wrong: -1 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task to account for large variance in output. The higher rating will be used.

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Opus 4	8.5/10	TypeScript narrowing v3	5.21¢	both methods work
#1	OpenAI GPT-5 (High)	8.5/10	TypeScript narrowing v3	8.50¢	both methods work
#3	Anthropic Claude 3.5 Sonnet	8/10	TypeScript narrowing v3	0.85¢	second and final answer works
#3	Anthropic Claude Sonnet 4	8/10	TypeScript narrowing v3	0.98¢	second method works
#3	xAI Grok Code Fast 1	8/10	TypeScript narrowing v3	0.40¢	correct
#6	OpenRouter gpt-oss-120b (Cerebras)	7.5/10	TypeScript narrowing v3	0.15¢	2nd method works
#6	OpenRouter gpt-oss-120b (Cerebras)	7.5/10	TypeScript narrowing v3	0.11¢	2nd method works
#8	Anthropic Claude 3.7 Sonnet	7/10	TypeScript narrowing v3	1.22¢	second answer works. final answer wrong
#9	OpenAI GPT-4.1	6/10	TypeScript narrowing v3	0.37¢	use in keyword
#9	OpenRouter OpenRouter: Mistral: Devstral Small	6/10	TypeScript narrowing v3	N/A	almost correct
#9	xAI Grok 4	6/10	TypeScript narrowing v3	8.55¢	use in keyword
#9	xAI Grok Code Fast 1	6/10	TypeScript narrowing v3	0.40¢	use in keyword
#9	Z.ai GLM-4.5	6/10	TypeScript narrowing v3	0.80¢	use in keyword
#14	Google Gemini 2.5 Pro Experimental	5.5/10	TypeScript narrowing v3	N/A	use in keyword. verbose
#14	Google Gemini 2.5 Pro Preview (05-06)	5.5/10	TypeScript narrowing v3	2.27¢	use in keyword. verbose
#14	Google Gemini 2.5 Pro Preview (06-05)	5.5/10	TypeScript narrowing v3	12.54¢	use in keyword. verbose
#14	Stealth Horizon Alpha	5.5/10	TypeScript narrowing v3	N/A	first method didn't work. second method uses in keyword
#14	OpenAI GPT-5 (High)	5.5/10	TypeScript narrowing v3	5.26¢	use in keyword. one wrong method
#19	OpenAI o3	1/10	TypeScript narrowing v3	5.46¢	wrong
#19	DeepSeek DeepSeek-V3 (New)	1/10	TypeScript narrowing v3	0.08¢	wrong. mention predicate
#19	OpenRouter OpenRouter: Mistral Medium 3	1/10	TypeScript narrowing v3	N/A	wrong
#19	OpenRouter OpenRouter: Mercury Coder Small Beta	1/10	TypeScript narrowing v3	N/A	wrong
#19	Google Gemini 2.5 Pro	1/10	TypeScript narrowing v3	3.98¢	wrong
#19	Moonshot AI Kimi K2 0711	1/10	TypeScript narrowing v3	0.12¢	wrong
#19	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	TypeScript narrowing v3	0.34¢	wrong
#19	Moonshot AI Kimi K2 0711	1/10	TypeScript narrowing v3	0.10¢	wrong
#19	OpenRouter (Alibaba Plus) Qwen3 Coder	1/10	TypeScript narrowing v3	0.38¢	all 3 methods did not work
#19	OpenAI GPT-5	1/10	TypeScript narrowing v3	2.26¢	wrong
#19	OpenAI GPT-5	1/10	TypeScript narrowing v3	1.89¢	wrong
#19	DeepSeek DeepSeek-V3.1	1/10	TypeScript narrowing v3	0.08¢	wrong
#19	DeepSeek DeepSeek-V3.1	1/10	TypeScript narrowing v3	0.08¢	wrong
#19	Z.ai GLM-4.5	1/10	TypeScript narrowing v3	0.55¢	wrong

Writing an AI Timeline

Technical Writing

History

→ Raw Prompt

Evaluation Rubrics

Criteria: - Covers all points (20): 10/10 - Covers almost all points (>=18): 9.5/10 - Covers most points (>=15): 9/10 - Covers major points (>=13): 8.5/10 - Missed some points (<13): 8/10 Additional components: - Bad headline: -0.5 rating - Concise response: +0.25 rating - Too concise response: -0.25 rating - Verbose output: -0.5 rating - Wrong format: -0.5 rating

Rank	Model	Rating	Prompt	Cost	Notes
#1	Anthropic Claude Sonnet 4	9.5/10	AI timeline	2.50¢	Covers almost all points
#1	Anthropic Claude Opus 4	9.5/10	AI timeline	14.13¢	Covers almost all points
#1	OpenAI GPT-5	9.5/10	AI timeline	11.45¢	Covers almost all points
#4	OpenAI GPT-4.1	9.25/10	AI timeline	0.76¢	Covers most points. concise
#4	Google Gemini 2.5 Pro Preview (06-05)	9.25/10	AI timeline v2 (concise)	4.92¢	Covers most points. concise
#4	xAI Grok 4	9.25/10	AI timeline	3.15¢	Covers most points. concise
#7	DeepSeek DeepSeek-V3 (New)	8.75/10	AI timeline	0.15¢	Covers most points. Too concise
#8	Anthropic Claude 3.7 Sonnet	8.5/10	AI timeline	2.44¢	Covers most points. Wrong format
#8	OpenRouter OpenRouter: Mistral Medium 3	8.5/10	AI timeline	N/A	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro Experimental	8.5/10	AI timeline v2 (concise)	N/A	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro Preview (05-06)	8.5/10	AI timeline v2 (concise)	2.13¢	Covers most points. Wrong format
#8	OpenAI o3	8.5/10	AI timeline	16.56¢	Covers most points. Wrong format
#8	Google Gemini 2.5 Pro	8.5/10	AI timeline v2 (concise)	5.44¢	Covers major points
#8	Moonshot AI Kimi K2 0711	8.5/10	AI timeline	0.26¢	covers major points
#8	DeepSeek DeepSeek-V3.1	8.5/10	AI timeline	0.23¢	Covers most points. Wrong format
#8	DeepSeek DeepSeek-V3.1	8.5/10	AI timeline v3	0.25¢	Covers most points. Wrong format
#17	Google Gemini 2.5 Pro Experimental	8/10	AI timeline	N/A	Covers most points. Wrong format. Verbose
#17	Fireworks AI DeepSeek V3 (0324)	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	OpenRouter OpenRouter: Qwen3 235B A22B	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	OpenRouter OpenRouter: Meta: Llama 3.3 70B Instruct	8/10	AI timeline	N/A	Covers major points. Wrong format
#17	Google Gemini 2.5 Pro Preview (05-06)	8/10	AI timeline v2 (concise)	5.47¢	Covers major points. Wrong format
#22	Azure OpenAI gpt-4o	7.5/10	AI timeline	0.95¢	Missed some points. Bad headline
#22	OpenRouter OpenRouter: Qwen: Qwen3 8B (free)	7.5/10	AI timeline	N/A	Missed some points. Bad headline
#22	OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B	7.5/10	AI timeline	N/A	Missed some points. Wrong format
#22	OpenRouter OR: GPT OSS 120B (Cerebras)	7.5/10	AI timeline	0.11¢	missed points. wrong format

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

→ View rubrics and latest results for model evaluations

→ View raw evaluation data