Logo16x Eval

16x Eval Model Evaluation

Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.

16x Eval Top Models - Coding

Average Human Rating of 7 Tasks │ Total Cost

Claude Opus 4
8.96(64.68¢)
GPT-5 (High)
8.86(110.46¢)
Claude Sonnet 4
8.68(13.53¢)
Grok 4
8.61(79.14¢)
gpt-oss-120b (Cerebras)
8.39(1.12¢)
GPT-4.1
8.21(7.14¢)
Gemini 2.5 Pro
7.71(83.48¢)
GPT-5
7.71(60.39¢)
Grok Code Fast 1
7.64(3.98¢)
Qwen3 Coder
7.25(5.68¢)
GLM-4.5
7.00(10.46¢)
Kimi K2 0711
6.39(1.69¢)
DeepSeek-V3.1
5.68(1.64¢)

Evaluation Rubrics

Criteria: - Side-by-side visualization without label: 8.5/10 - Baseline visualization without label: 8/10 - Horizontal bar chart (if cannot fit in the page): 7.5/10 - Has major formatting issues: 5/10 - Did not run / Code error: 1/10 Additional components: - Side-by-side visualization - Color by benchmark: +0.5 rating - Alternative ways to differentiate benchmarks: +0.5 rating - Color by model: No effect on rating - Clear labels on bar chart: +0.5 rating - Visually pleasing: +0.25 rating - Poor color choice: -0.5 rating - Minor formatting issues: -0.5 rating Additional instructions for variance: - If the code did not run or render in the first try, a second try is given to regenerate the code.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.25/10
Benchmark visualization5.02¢
Side-by-side no label. Color by benchmark. Visually pleasing
#1
Anthropic
Claude Opus 4
9.25/10
Benchmark visualization21.04¢
Side-by-side no label. Color by benchmark. Visually pleasing
#1
xAI
Grok 4
9.25/10
Benchmark visualization11.26¢
side-by-side clear labels. color by model. Visually pleasing
#1
Moonshot AI
Kimi K2 0711
9.25/10
Benchmark visualization0.55¢
Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing
#1
OpenAI
GPT-5 (High)
9.25/10
Benchmark visualization19.21¢
Clear labels. Visually pleasing. Highlight benchmarks on hover
#6
OpenAI
GPT-4.1
8.75/10
Benchmark visualization1.88¢
Clear labels. Visually pleasing
#6
Google
Gemini 2.5 Pro
8.75/10
Benchmark visualization11.00¢
Clear labels. Visually pleasing
#8
OpenRouter
gpt-oss-120b (Cerebras)
8.5/10
Benchmark visualization0.20¢
baseline. clear labels
#8
OpenAI
GPT-5
8.5/10
Benchmark visualization11.92¢
baseline. clear labels
#8
Z.ai
GLM-4.5
8.5/10
Benchmark visualization1.31¢
Side-by-side visualization. No labels
#11
xAI
Grok Code Fast 1
8.25/10
Benchmark visualization0.61¢
No labels. Visually pleasing
#12
OpenAI
o3
8/10
Benchmark visualization12.74¢
Clear labels. Poor color choice
#12
Google
Gemini 2.5 Pro Preview (06-05)
8/10
Benchmark visualization13.97¢
Clear labels. Poor color choice
#14
Anthropic
Claude 3.7 Sonnet
7.5/10
Benchmark visualization5.10¢
Number labels. Good idea
#15
Google
Gemini 2.5 Pro Experimental
7/10
Benchmark visualizationN/A
No labels. Good colors
#15
DeepSeek
DeepSeek-V3 (New)
7/10
Benchmark visualization0.42¢
No labels. Good colors
#15
Google
Gemini 2.5 Pro Preview (05-06)
7/10
Benchmark visualization4.61¢
No labels. Good colors
#15
OpenRouter
OpenRouter: Mistral Medium 3
7/10
Benchmark visualizationN/A
No labels. Good colors
#15
OpenRouter
OpenRouter: Mistral: Devstral Small
7/10
Benchmark visualizationN/A
No labels. Good colors
#15
OpenRouter (Alibaba Plus)
Qwen3 Coder
7/10
Benchmark visualization1.64¢
horizontal bars. minor formatting issues
#21
Google
Gemini 2.5 Pro Preview (03-25)
6/10
Benchmark visualization5.35¢
Minor bug. No labels
#22
Stealth
Horizon Alpha
5.5/10
Benchmark visualizationN/A
Strange visualization with major formatting issues
#22
Stealth
Horizon Alpha
5.5/10
Benchmark visualizationN/A
Strange visualization with major formatting issues
#22
DeepSeek
DeepSeek-V3.1
5.5/10
Benchmark visualization0.55¢
Strange visualization with major formatting issues
#22
DeepSeek
DeepSeek-V3.1
5.5/10
Benchmark visualization0.49¢
Strange visualization with major formatting issues
#26
OpenRouter
OpenRouter: Qwen3 235B A22B
5/10
Benchmark visualizationN/A
Very small. Hard to read
#26
OpenRouter
OpenRouter: Mercury Coder Small Beta
5/10
Benchmark visualizationN/A
No color. Hard to read
#28
OpenRouter
OpenRouter: Deepseek R1 0528 Qwen3 8B
1/10
Benchmark visualizationN/A
doesn't run. bugfix not obvious.

Evaluation Rubrics

Criteria: - No text content was removed: 9/10 - Some text content was removed: 8/10 Additional components: - Left-over elements: - Left-over tables: -0.5 rating - Left-over mdx import statements: -0.5 rating - Left-over mdx components: -0.5 rating - Newline handling: - The output does not contain newlines: -1 rating - The output has 1 or more newline issues: -0.5 rating - Short code (1500 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.
RankModelRatingPromptCostNotes
#1
xAI
Grok 4
9/10
Clean mdx25.74¢
100% match
#2
Google
Gemini 2.5 Pro
8.5/10
Clean mdx17.84¢
left-over components
#2
Google
Gemini 2.5 Pro
8.5/10
Clean mdx13.42¢
left-over components
#4
OpenAI
GPT-5
8/10
Clean mdx14.89¢
1 newline issue. left-over component
#4
OpenAI
GPT-5
8/10
Clean mdx16.46¢
1 newline issue. left-over component
#4
xAI
Grok 4
8/10
Clean mdx25.82¢
no newline
#4
OpenRouter
gpt-oss-120b (Cerebras)
8/10
Clean mdx0.17¢
left-over import and components
#4
OpenRouter
gpt-oss-120b (Cerebras)
8/10
Clean mdx0.15¢
left-over import and components
#4
OpenRouter (Alibaba Plus)
Qwen3 Coder
8/10
Clean mdx0.42¢
2 newline issues. left-over component
#4
xAI
Grok Code Fast 1
8/10
Clean mdx0.77¢
3 newline issues. left-over imports
#4
OpenAI
GPT-5 (High)
8/10
Clean mdx28.39¢
3 newline issue. left-over component
#12
Anthropic
Claude Sonnet 4
7.5/10
Clean mdx0.95¢
text removed. left-over component
#12
OpenAI
GPT-4.1
7.5/10
Clean mdx0.77¢
left-over tables. left-over import and components
#12
Anthropic
Claude Opus 4
7.5/10
Clean mdx5.34¢
text removed. left-over component
#12
OpenAI
GPT-4.1
7.5/10
Clean mdx0.95¢
6 newline issues. left over import and components
#12
OpenRouter (Alibaba Plus)
Qwen3 Coder
7.5/10
Clean mdx0.40¢
text removed. left-over component
#12
xAI
Grok Code Fast 1
7.5/10
Clean mdx1.06¢
4 newline issues. left-over imports and component
#12
OpenAI
GPT-5 (High)
7.5/10
Clean mdx25.71¢
text removed. left-over component
#12
Z.ai
GLM-4.5
7.5/10
Clean mdx1.37¢
1 newline issue. left-over import, component
#20
Anthropic
Claude Opus 4
7/10
Clean mdx5.48¢
text removed. left-over import and components
#20
Moonshot AI
Kimi K2 0711
7/10
Clean mdx0.13¢
text removed. left-over import and components
#20
Moonshot AI
Kimi K2 0711
7/10
Clean mdx0.14¢
text removed. left-over import and components
#23
DeepSeek
DeepSeek-V3.1
6.5/10
Clean mdx0.12¢
left-over import and component. left-over table
#23
DeepSeek
DeepSeek-V3.1
6.5/10
Clean mdx0.12¢
left-over import and component. left-over table
#23
Z.ai
GLM-4.5
6.5/10
Clean mdx1.71¢
text removed. left-over import and components. 1 newline issue
#26
Anthropic
Claude Sonnet 4
6/10
Clean mdx0.92¢
text removed. no newline. left-over import and component

Evaluation Rubrics

Criteria: - Code runs and gives correct (expected) output: 9/10 - The output has 1 or more newline issues: 8.5/10 - The output does not contain newlines: 8/10 Additional components: - Short code (1000 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating - Missing export statement: -0.5 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9.25/10
clean markdown v24.02¢
correct. short code
#1
Moonshot AI
Kimi K2 0711
9.25/10
clean markdown v20.11¢
correct. short code
#1
OpenRouter (Alibaba Plus)
Qwen3 Coder
9.25/10
clean markdown v20.33¢
correct. short code
#1
DeepSeek
DeepSeek-V3.1
9.25/10
clean markdown v20.09¢
correct. short code
#1
xAI
Grok Code Fast 1
9.25/10
clean markdown v20.89¢
correct. short code
#6
Google
Gemini 2.5 Pro
9/10
clean markdown v213.60¢
correct
#6
OpenAI
o3
9/10
clean markdown v213.79¢
correct
#6
OpenAI
GPT-5
9/10
clean markdown v219.28¢
correct
#6
OpenAI
GPT-5 (High)
9/10
clean markdown v229.55¢
correct
#10
OpenAI
GPT-4.1
8.5/10
clean markdown v21.12¢
1 new line issue
#10
xAI
Grok 4
8.5/10
clean markdown v213.06¢
1 new line issue
#10
Stealth
Horizon Alpha
8.5/10
clean markdown v2N/A
one newline issue
#10
OpenRouter
gpt-oss-120b (Cerebras)
8.5/10
clean markdown v20.15¢
one newline issue
#10
Z.ai
GLM-4.5
8.5/10
clean markdown v23.64¢
didn't add export
#15
Anthropic
Claude Sonnet 4
8/10
clean markdown v20.77¢
no new lines
#15
DeepSeek
DeepSeek-V3 (New)
8/10
clean markdown v20.08¢
no new lines

Evaluation Rubrics

Criteria: - Correctly solved the task: 9/10 Additional components: - Added helpful extra logic: +0.25 rating - Added unnecessary code: -0.25 rating - Returned code in diff format: -1 rating - Verbose output: -0.5 rating - Concise response (only changed code): +0.25 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9.5/10
Folder watcher fix13.70¢
solved. extra logic. concise
#1
Stealth
Horizon Alpha
9.5/10
Folder watcher fixN/A
solved. extra logic. concise, respects indentation well
#1
xAI
Grok Code Fast 1
9.5/10
Folder watcher fix0.43¢
solved. extra logic. concise
#1
Z.ai
GLM-4.5
9.5/10
Folder watcher fix1.06¢
solved. extra logic
#5
OpenAI
o4-mini
9.25/10
Folder watcher fix1.28¢
solved. extra logic
#5
Anthropic
Claude Sonnet 4
9.25/10
Folder watcher fix2.61¢
solved. concise
#5
Moonshot AI
Kimi K2 0711
9.25/10
Folder watcher fix0.48¢
solved. extra logic
#5
Anthropic
Claude Opus 4
9.25/10
Folder watcher fix v213.18¢
solved. concise
#9
Anthropic
Claude 3.7 Sonnet
8.75/10
Folder watcher fix4.41¢
solved. very verbose. extra logic
#9
xAI
Grok 4
8.75/10
Folder watcher fix4.36¢
solved. extra logic. verbose
#9
OpenRouter (Alibaba Plus)
Qwen3 Coder
8.75/10
Folder watcher fix1.50¢
unnecessary code
#9
OpenAI
GPT-5
8.75/10
Folder watcher fix4.46¢
solved. extra logic. verbose
#13
OpenAI
GPT-4.1
8.5/10
Folder watcher fix1.58¢
solved. verbose
#13
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
Folder watcher fix2.59¢
solved. verbose
#13
Anthropic
Claude Opus 4
8.5/10
Folder watcher fix16.76¢
solved. verbose
#13
OpenRouter
OpenRouter: Mistral Medium 3
8.5/10
Folder watcher fixN/A
solved. verbose
#13
DeepSeek
DeepSeek-V3 (New)
8.5/10
Folder watcher fix0.43¢
solved. verbose
#13
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
Folder watcher fix16.20¢
solved. verbose
#13
OpenRouter
gpt-oss-120b (Cerebras)
8.5/10
Folder watcher fix0.28¢
solved. verbose
#13
DeepSeek
DeepSeek-V3.1
8.5/10
Folder watcher fix0.45¢
solved. verbose
#13
OpenAI
GPT-5 (High)
8.5/10
Folder watcher fix8.91¢
solved. verbose
#13
OpenAI
GPT-5 (High)
8.5/10
Folder watcher fix v210.13¢
solved. verbose
#23
OpenAI
o3
8/10
Folder watcher fix9.82¢
solved. diff format
#23
Google
Gemini 2.5 Pro
8/10
Folder watcher fix21.98¢
solved in a different way. diff format

Image - kanji

Image Analysis
Japanese
Chinese

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Tangentially related explanation: 6/10 - Incorrect or ambiguous explanation: 5/10 - Did not recognize image: 1/10 Additional components: - Provides multiple explanations - Includes one wrong explanation: -0.5 rating - Final or main explanation wrong: -1 rating - Verbose output: -0.5 rating
RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Kanji image2.27¢
correct
#1
OpenAI
o3
9/10
Kanji image8.27¢
correct
#1
OpenAI
GPT-5
9/10
Kanji image13.53¢
correct explanation
#1
OpenAI
GPT-5 (High)
9/10
Kanji image12.65¢
correct
#5
xAI
Grok 4
7.5/10
Kanji image15.85¢
main exp wrong. alt exp correct. verbose
#6
Anthropic
Claude Opus 4
6/10
Kanji image3.96¢
tangential
#7
OpenAI
GPT-4.1
5/10
Kanji image0.40¢
failed
#7
Anthropic
Claude 3.7 Sonnet
5/10
Kanji image0.80¢
failed
#7
OpenAI
GPT-4o
5/10
Kanji image0.70¢
failed
#7
OpenRouter
OpenRouter: Meta: Llama 4 Maverick
5/10
Kanji imageN/A
ambiguous output
#7
Anthropic
Claude Sonnet 4
5/10
Kanji image0.91¢
failed
#7
Google
Gemini 2.5 Pro
5/10
Kanji image5.48¢
failed
#13
OpenRouter
OpenRouter: Qwen3 235B A22B
1/10
Kanji imageN/A
Didn't recognize image
#13
Anthropic
Claude Opus 4.1
1/10
Kanji image4.20¢
failed

Image analysis - water bottle

Image Analysis
Physics

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Missed the point: 6/10 Additional components: - Detailed explanation: +0.25 rating
RankModelRatingPromptCostNotes
#1
xAI
Grok 4
9.25/10
Image analysis8.87¢
correct. detailed explanation
#2
OpenAI
GPT-4.1
9/10
Image analysis0.24¢
correct
#2
Google
Gemini 2.5 Pro Experimental
9/10
Image analysisN/A
correct
#2
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Image analysis1.83¢
correct
#2
OpenAI
o3
9/10
Image analysis2.90¢
correct
#2
Google
Gemini 2.5 Pro
9/10
Image analysis1.45¢
correct
#2
Stealth
Horizon Alpha
9/10
Image analysisN/A
correct
#2
OpenRouter
OR: Horizon Beta
9/10
Image analysisN/A
correct
#2
OpenAI
GPT-5
9/10
Image analysis2.32¢
correct
#2
OpenAI
GPT-5 (High)
9/10
Image analysis6.16¢
correct
#11
Anthropic
Claude 3.7 Sonnet
6/10
Image analysis0.68¢
missed point
#11
OpenRouter
OpenRouter: Meta: Llama 4 Maverick
6/10
Image analysisN/A
missed point
#11
Anthropic
Claude Sonnet 4
6/10
Image analysis0.65¢
missed points
#11
Anthropic
Claude Opus 4
6/10
Image analysis3.35¢
missed points
#11
Anthropic
Claude Opus 4.1
6/10
Image analysis3.35¢
missed point

Evaluation Rubrics

# Rubrics for Image Table Data Extraction Project Criteria: - All 4 models are correctly extracted: 9.5/10 - 3 models are correctly extracted: 8/10 - 2 models are correctly extracted: 6/10 - 1 model is correctly extracted: 3/10 - 0 models are correctly extracted: 1/10 Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
Image table data extraction0.71¢
All 4 models correct
#1
Google
Gemini 2.5 Pro
9.5/10
Image table data extraction1.70¢
All 4 models correct
#1
Anthropic
Claude Sonnet 4
9.5/10
Image table data extraction0.73¢
All 4 models correct
#1
Google
Gemini 2.5 Pro
9.5/10
Image table data extraction1.24¢
All 4 models correct
#5
Google
Gemini 2.5 Flash
8/10
Image table data extraction0.18¢
3 models correct
#5
OpenAI
GPT-5 (High)
8/10
Image table data extraction12.39¢
3 models correct
#7
Google
Gemini 2.5 Flash
6/10
Image table data extraction0.18¢
2 models correct
#7
OpenAI
GPT-5 (High)
6/10
Image table data extraction14.95¢
2 models correct
#9
Anthropic
Claude Opus 4.1
3/10
Image table data extraction3.49¢
1 model correct
#9
Anthropic
Claude Opus 4.1
3/10
Image table data extraction4.54¢
1 model correct
#9
OpenAI
GPT-5
3/10
Image table data extraction11.11¢
1 model correct
#12
OpenAI
GPT-5
1/10
Image table data extraction11.50¢
all wrong
#12
xAI
Grok 4
1/10
Image table data extraction15.63¢
all wrong
#12
xAI
Grok 4
1/10
Image table data extraction15.26¢
all wrong

Evaluation Rubrics

Criteria: - Output only changed code (follows instructions): 9/10 - Output full code (does not follow instructions): 8/10 Additional components: - Concise response - Very concise response (<=1300 characters): +0.25 rating - Very very concise response (<=1200 characters): +0.5 rating - Verbose output (>=1500 characters): -0.5 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
TODO task (Claude)0.76¢
Very concise *2
#1
Google
Gemini 2.5 Pro Preview (06-05)
9.5/10
TODO task v2 (concise)1.80¢
Very concise *2
#1
Anthropic
Claude Opus 4
9.5/10
TODO task (Claude)3.88¢
Very concise *2
#1
xAI
Grok 4
9.5/10
TODO task3.87¢
Very concise *2
#1
Google
Gemini 2.5 Pro
9.5/10
TODO task v2 (concise)2.12¢
Very concise *2
#1
OpenAI
GPT-5
9.5/10
TODO task v2 (concise)3.77¢
very concise *2
#1
xAI
Grok Code Fast 1
9.5/10
TODO task v2 (concise)0.17¢
Very concise *2
#1
OpenAI
GPT-5 (High)
9.5/10
TODO task v2 (concise)7.62¢
Very concise *2
#9
OpenAI
GPT-4.1
9.25/10
TODO task0.38¢
Very concise
#9
Google
Gemini 2.5 Pro Experimental
9.25/10
TODO task v2 (concise)N/A
Very concise
#11
DeepSeek
DeepSeek-V3 (New)
9/10
TODO task0.10¢
Follows instruction
#11
Google
Gemini 2.5 Pro Preview (05-06)
9/10
TODO task v2 (concise)2.12¢
Follows instruction
#11
Google
Gemini 2.5 Pro Preview (06-05)
9/10
TODO task3.91¢
Follows instruction
#11
Anthropic
Claude 3.5 Sonnet
9/10
TODO task0.86¢
Follows instruction
#11
OpenAI
GPT-5
9/10
TODO task3.01¢
follows instructions
#16
OpenRouter
OpenRouter: OpenAI: Codex Mini
8.5/10
TODO taskN/A
Asked for more context!
#16
OpenRouter
google/gemini-2.5-flash-preview-05-20:thinking
8.5/10
TODO task v2 (concise)N/A
Follows instruction. Verbose
#16
Stealth
Horizon Alpha
8.5/10
TODO taskN/A
Follows instruction. Verbose
#16
Stealth
Horizon Alpha
8.5/10
TODO task v2 (concise)N/A
Follows instruction. Verbose
#16
OpenRouter
gpt-oss-120b (Cerebras)
8.5/10
TODO task0.07¢
Follows instruction. Verbose
#21
OpenRouter
OpenRouter: Mercury Coder Small Beta
8/10
TODO taskN/A
output full code
#21
OpenRouter
OpenRouter: Qwen3 235B A22B
8/10
TODO taskN/A
output full code
#21
OpenRouter
OpenRouter: Mistral Medium 3
8/10
TODO taskN/A
output full code
#21
Fireworks AI
DeepSeek V3 (0324)
8/10
TODO taskN/A
output full code
#21
OpenRouter
OpenRouter: Mistral: Devstral Small
8/10
TODO taskN/A
output full code
#21
Anthropic
Claude Sonnet 4
8/10
TODO task1.12¢
output full code
#21
Anthropic
Claude Opus 4
8/10
TODO task5.66¢
output full code
#21
OpenRouter
OpenRouter: Deepseek R1 0528 Qwen3 8B
8/10
TODO task v2 (concise)N/A
output full code
#21
Moonshot AI
Kimi K2 0711
8/10
TODO task0.15¢
output full code
#21
OpenRouter (Alibaba Plus)
Qwen3 Coder
8/10
TODO task0.43¢
output full code
#21
OpenRouter
gpt-oss-120b (Cerebras)
8/10
TODO task v2 (concise)0.08¢
output full code
#21
DeepSeek
DeepSeek-V3.1
8/10
TODO task (Claude)0.11¢
output full code
#21
DeepSeek
DeepSeek-V3.1
8/10
TODO task0.11¢
output full code
#21
DeepSeek
DeepSeek-V3.1
8/10
TODO task v2 (concise)0.11¢
output full code
#21
DeepSeek
DeepSeek-V3.1
8/10
TODO task (Claude)0.11¢
output full code
#21
xAI
Grok Code Fast 1
8/10
TODO task0.32¢
output full code
#21
Z.ai
GLM-4.5
8/10
TODO task v2 (concise)0.57¢
output full code
#38
Anthropic
Claude 3.7 Sonnet
7.5/10
TODO task1.20¢
verbose
#38
Google
Gemini 2.5 Pro Preview (03-25)
7.5/10
TODO task1.14¢
Verbose
#38
Anthropic
Claude 3.7 Sonnet
7.5/10
TODO task (Claude)1.33¢
verbose
#38
OpenRouter
OpenRouter: Meta: Llama 4 Maverick
7.5/10
TODO taskN/A
verbose
#38
OpenAI
o3
7.5/10
TODO task5.65¢
diff format

Evaluation Rubrics

Criteria: - Bug identified and fixed: 9/10 - Bug identified but not fixed: 7/10 - Bug not identified: 1/10 Additional components: - Removes extra z-index values: +0.25 rating - Uses correct custom values syntax (e.g., z-[60]): +0.25 rating - Wrong explanation of the bug despite the correct fix: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task. The higher rating will be used.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.25/10
Tailwind css v3 z-index2.44¢
fixed. removes extra
#1
Anthropic
Claude Sonnet 4
9.25/10
Tailwind css v3 z-index2.26¢
fixed. removes extra
#1
OpenAI
GPT-5
9.25/10
Tailwind css v3 z-index3.81¢
fixed. correct custom value syntax
#1
OpenAI
GPT-5
9.25/10
Tailwind css v3 z-index4.28¢
fixed. correct custom value syntax
#1
Anthropic
Claude Opus 4
9.25/10
Tailwind css v3 z-index11.49¢
fixed. removes extra
#1
xAI
Grok 4
9.25/10
Tailwind css v3 z-index12.30¢
fixed. correct custom value syntax
#1
xAI
Grok 4
9.25/10
Tailwind css v3 z-index20.80¢
fixed. correct custom value syntax
#1
OpenRouter
gpt-oss-120b (Cerebras)
9.25/10
Tailwind css v3 z-index0.10¢
fixed. correct custom value syntax
#1
Google
Gemini 2.5 Pro
9.25/10
Tailwind css v3 z-index12.96¢
fixed. removes extra
#1
Google
Gemini 2.5 Pro
9.25/10
Tailwind css v3 z-index8.07¢
fixed. removes extra
#1
OpenAI
GPT-5 (High)
9.25/10
Tailwind css v3 z-index8.28¢
fixed. correct custom value syntax
#1
OpenAI
GPT-5 (High)
9.25/10
Tailwind css v3 z-index7.49¢
fixed. removes extra
#13
OpenAI
GPT-4.1
9/10
Tailwind css v3 z-index1.04¢
fixed
#13
OpenAI
GPT-4.1
9/10
Tailwind css v3 z-index1.05¢
fixed
#13
Anthropic
Claude Opus 4
9/10
Tailwind css v3 z-index11.30¢
fixed
#13
OpenRouter
gpt-oss-120b (Cerebras)
9/10
Tailwind css v3 z-index0.20¢
fixed
#17
OpenRouter (Alibaba Plus)
Qwen3 Coder
8.75/10
Tailwind css v3 z-index1.02¢
fixed. removes extra. wrong explanation
#18
Moonshot AI
Kimi K2 0711
1/10
Tailwind css v3 z-index0.15¢
not identified
#18
Moonshot AI
Kimi K2 0711
1/10
Tailwind css v3 z-index0.12¢
not identified
#18
OpenRouter (Alibaba Plus)
Qwen3 Coder
1/10
Tailwind css v3 z-index0.95¢
not identified
#18
DeepSeek
DeepSeek-V3.1
1/10
Tailwind css v3 z-index0.24¢
not identified
#18
DeepSeek
DeepSeek-V3.1
1/10
Tailwind css v3 z-index0.22¢
not identified
#18
xAI
Grok Code Fast 1
1/10
Tailwind css v3 z-index0.71¢
not identified
#18
xAI
Grok Code Fast 1
1/10
Tailwind css v3 z-index0.38¢
not identified
#18
Z.ai
GLM-4.5
1/10
Tailwind css v3 z-index1.71¢
not identified
#18
Z.ai
GLM-4.5
1/10
Tailwind css v3 z-index0.46¢
not identified

Evaluation Rubrics

Criteria: - Provide a working method (without in keyword): 8/10 - Use in keyword: 6/10 - Did not work (Wrong answer): 1/10 Additional components: - Provides multiple working methods: +0.5 rating - Provides multiple methods - Includes one wrong method: -0.5 rating - Final answer wrong: -1 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task to account for large variance in output. The higher rating will be used.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
8.5/10
TypeScript narrowing v35.21¢
both methods work
#1
OpenAI
GPT-5 (High)
8.5/10
TypeScript narrowing v38.50¢
both methods work
#3
Anthropic
Claude 3.5 Sonnet
8/10
TypeScript narrowing v30.85¢
second and final answer works
#3
Anthropic
Claude Sonnet 4
8/10
TypeScript narrowing v30.98¢
second method works
#3
xAI
Grok Code Fast 1
8/10
TypeScript narrowing v30.40¢
correct
#6
OpenRouter
gpt-oss-120b (Cerebras)
7.5/10
TypeScript narrowing v30.15¢
2nd method works
#6
OpenRouter
gpt-oss-120b (Cerebras)
7.5/10
TypeScript narrowing v30.11¢
2nd method works
#8
Anthropic
Claude 3.7 Sonnet
7/10
TypeScript narrowing v31.22¢
second answer works. final answer wrong
#9
OpenAI
GPT-4.1
6/10
TypeScript narrowing v30.37¢
use in keyword
#9
OpenRouter
OpenRouter: Mistral: Devstral Small
6/10
TypeScript narrowing v3N/A
almost correct
#9
xAI
Grok 4
6/10
TypeScript narrowing v38.55¢
use in keyword
#9
xAI
Grok Code Fast 1
6/10
TypeScript narrowing v30.40¢
use in keyword
#9
Z.ai
GLM-4.5
6/10
TypeScript narrowing v30.80¢
use in keyword
#14
Google
Gemini 2.5 Pro Experimental
5.5/10
TypeScript narrowing v3N/A
use in keyword. verbose
#14
Google
Gemini 2.5 Pro Preview (05-06)
5.5/10
TypeScript narrowing v32.27¢
use in keyword. verbose
#14
Google
Gemini 2.5 Pro Preview (06-05)
5.5/10
TypeScript narrowing v312.54¢
use in keyword. verbose
#14
Stealth
Horizon Alpha
5.5/10
TypeScript narrowing v3N/A
first method didn't work. second method uses in keyword
#14
OpenAI
GPT-5 (High)
5.5/10
TypeScript narrowing v35.26¢
use in keyword. one wrong method
#19
OpenAI
o3
1/10
TypeScript narrowing v35.46¢
wrong
#19
DeepSeek
DeepSeek-V3 (New)
1/10
TypeScript narrowing v30.08¢
wrong. mention predicate
#19
OpenRouter
OpenRouter: Mistral Medium 3
1/10
TypeScript narrowing v3N/A
wrong
#19
OpenRouter
OpenRouter: Mercury Coder Small Beta
1/10
TypeScript narrowing v3N/A
wrong
#19
Google
Gemini 2.5 Pro
1/10
TypeScript narrowing v33.98¢
wrong
#19
Moonshot AI
Kimi K2 0711
1/10
TypeScript narrowing v30.12¢
wrong
#19
OpenRouter (Alibaba Plus)
Qwen3 Coder
1/10
TypeScript narrowing v30.34¢
wrong
#19
Moonshot AI
Kimi K2 0711
1/10
TypeScript narrowing v30.10¢
wrong
#19
OpenRouter (Alibaba Plus)
Qwen3 Coder
1/10
TypeScript narrowing v30.38¢
all 3 methods did not work
#19
OpenAI
GPT-5
1/10
TypeScript narrowing v32.26¢
wrong
#19
OpenAI
GPT-5
1/10
TypeScript narrowing v31.89¢
wrong
#19
DeepSeek
DeepSeek-V3.1
1/10
TypeScript narrowing v30.08¢
wrong
#19
DeepSeek
DeepSeek-V3.1
1/10
TypeScript narrowing v30.08¢
wrong
#19
Z.ai
GLM-4.5
1/10
TypeScript narrowing v30.55¢
wrong

Writing an AI Timeline

Technical Writing
AI
History

Evaluation Rubrics

Criteria: - Covers all points (20): 10/10 - Covers almost all points (>=18): 9.5/10 - Covers most points (>=15): 9/10 - Covers major points (>=13): 8.5/10 - Missed some points (<13): 8/10 Additional components: - Bad headline: -0.5 rating - Concise response: +0.25 rating - Too concise response: -0.25 rating - Verbose output: -0.5 rating - Wrong format: -0.5 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
AI timeline2.50¢
Covers almost all points
#1
Anthropic
Claude Opus 4
9.5/10
AI timeline14.13¢
Covers almost all points
#1
OpenAI
GPT-5
9.5/10
AI timeline11.45¢
Covers almost all points
#4
OpenAI
GPT-4.1
9.25/10
AI timeline0.76¢
Covers most points. concise
#4
Google
Gemini 2.5 Pro Preview (06-05)
9.25/10
AI timeline v2 (concise)4.92¢
Covers most points. concise
#4
xAI
Grok 4
9.25/10
AI timeline3.15¢
Covers most points. concise
#7
DeepSeek
DeepSeek-V3 (New)
8.75/10
AI timeline0.15¢
Covers most points. Too concise
#8
Anthropic
Claude 3.7 Sonnet
8.5/10
AI timeline2.44¢
Covers most points. Wrong format
#8
OpenRouter
OpenRouter: Mistral Medium 3
8.5/10
AI timelineN/A
Covers most points. Wrong format
#8
Google
Gemini 2.5 Pro Experimental
8.5/10
AI timeline v2 (concise)N/A
Covers most points. Wrong format
#8
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)2.13¢
Covers most points. Wrong format
#8
OpenAI
o3
8.5/10
AI timeline16.56¢
Covers most points. Wrong format
#8
Google
Gemini 2.5 Pro
8.5/10
AI timeline v2 (concise)5.44¢
Covers major points
#8
Moonshot AI
Kimi K2 0711
8.5/10
AI timeline0.26¢
covers major points
#8
DeepSeek
DeepSeek-V3.1
8.5/10
AI timeline0.23¢
Covers most points. Wrong format
#8
DeepSeek
DeepSeek-V3.1
8.5/10
AI timeline v30.25¢
Covers most points. Wrong format
#17
Google
Gemini 2.5 Pro Experimental
8/10
AI timelineN/A
Covers most points. Wrong format. Verbose
#17
Fireworks AI
DeepSeek V3 (0324)
8/10
AI timelineN/A
Covers major points. Wrong format
#17
OpenRouter
OpenRouter: Qwen3 235B A22B
8/10
AI timelineN/A
Covers major points. Wrong format
#17
OpenRouter
OpenRouter: Meta: Llama 3.3 70B Instruct
8/10
AI timelineN/A
Covers major points. Wrong format
#17
Google
Gemini 2.5 Pro Preview (05-06)
8/10
AI timeline v2 (concise)5.47¢
Covers major points. Wrong format
#22
Azure OpenAI
gpt-4o
7.5/10
AI timeline0.95¢
Missed some points. Bad headline
#22
OpenRouter
OpenRouter: Qwen: Qwen3 8B (free)
7.5/10
AI timelineN/A
Missed some points. Bad headline
#22
OpenRouter
OpenRouter: Deepseek R1 0528 Qwen3 8B
7.5/10
AI timelineN/A
Missed some points. Wrong format
#22
OpenRouter
OR: GPT OSS 120B (Cerebras)
7.5/10
AI timeline0.11¢
missed points. wrong format

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

→ View rubrics and latest results for model evaluations

→ View raw evaluation data

Download 16x Eval

No sign-up. No login. No coding. Just evals.