Logo16x Eval

16x Eval Model Evaluation Results

Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.

🥇
Grok 4
xAI
Avg: 8.31/10
Individual Experiment Ratings:
Benchmark Visualization (Difficult):8/10
Clean markdown (Medium):8.5/10
Folder watcher fix (Normal):8.5/10
Image - kanji:7.5/10
Image analysis - water bottle:9.25/10
Next.js TODO add feature (Simple):9.5/10
TypeScript narrowing (Uncommon):6/10
Writing an AI Timeline:9.25/10
🥈
Claude Opus 4
Anthropic
Avg: 8.28/10
Individual Experiment Ratings:
Benchmark Visualization (Difficult):8.5/10
Clean markdown (Medium):9.25/10
Folder watcher fix (Normal):9/10
Image - kanji:6/10
Image analysis - water bottle:6/10
Next.js TODO add feature (Simple):9.5/10
TypeScript narrowing (Uncommon):8.5/10
Writing an AI Timeline:9.5/10
🥉
GPT-4.1
OpenAI
Avg: 8/10
Individual Experiment Ratings:
Benchmark Visualization (Difficult):8.5/10
Clean markdown (Medium):8.5/10
Folder watcher fix (Normal):8.5/10
Image - kanji:5/10
Image analysis - water bottle:9/10
Next.js TODO add feature (Simple):9.25/10
TypeScript narrowing (Uncommon):6/10
Writing an AI Timeline:9.25/10

Benchmark Visualization (Difficult) Evaluation Results

RankModelRatingPromptCostNotes
#1
OpenAI
GPT-4.1
8.5/10
Benchmark visualization$0.0188
Clear labels
#1
Anthropic
Claude Sonnet 4
8.5/10
Benchmark visualization$0.0502
Side-by-side No label No color-coding
#1
Anthropic
Claude Opus 4
8.5/10
Benchmark visualization$0.2104
Side-by-side No label No color-coding
#4
OpenAI
o3
8/10
Benchmark visualization$0.1274
Clear labels Poor color choice
#4
Google
Gemini 2.5 Pro Preview (06-05)
8/10
Benchmark visualization$0.1397
Clear labels Poor color choice
#4
xAI
Grok 4
8/10
Benchmark visualization$0.1126
side-by-side clear labels bad color usage
#7
Anthropic
Claude 3.7 Sonnet
7.5/10
Benchmark visualization$0.0510
Number labels Good idea
#8
Google
Gemini 2.5 Pro Experimental
7/10
Benchmark visualizationN/A
No labels Good colors
#8
DeepSeek
DeepSeek-V3 (New)
7/10
Benchmark visualization$0.0026
No labels Good colors
#8
Google
Gemini 2.5 Pro Preview (05-06)
7/10
Benchmark visualization$0.0461
No labels Good colors
#8
OpenRouter
mistralai/mistral-medium-3
7/10
Benchmark visualizationN/A
No labels Good colors
#8
OpenRouter
mistralai/devstral-small
7/10
Benchmark visualizationN/A
No labels Good colors
#13
Google
Gemini 2.5 Pro Preview (03-25)
6/10
Benchmark visualization$0.0535
Minor bug No labels
#14
OpenRouter
qwen/qwen3-235b-a22b
5/10
Benchmark visualizationN/A
Very small Hard to read
#14
OpenRouter
inception/mercury-coder-small-beta
5/10
Benchmark visualizationN/A
No color Hard to read
#16
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
1/10
Benchmark visualizationN/A
doesn't run. bugfix not obvious.
#16
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
1/10
Benchmark visualizationN/A
output json instead of html

Clean markdown (Medium) Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9.25/10
clean markdown v2$0.0402
correct output short code
#2
Google
Gemini 2.5 Pro
9/10
clean markdown v2$0.1360
correct output
#2
OpenAI
o3
9/10
clean markdown v2$0.1379
correct output
#4
OpenAI
GPT-4.1
8.5/10
clean markdown v2$0.0112
2 new line issues
#4
xAI
Grok 4
8.5/10
clean markdown v2$0.1306
2 new line issues
#6
Anthropic
Claude Sonnet 4
8/10
clean markdown v2$0.0077
no new lines
#6
DeepSeek
DeepSeek-V3 (New)
8/10
clean markdown v2$0.0005
no new lines

Folder watcher fix (Normal) Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9/10
Folder watcher fix$0.1370
solved extra logic concise
#1
Anthropic
Claude Sonnet 4
9/10
Folder watcher fix$0.0261
solved concise
#3
OpenAI
o4-mini
8.75/10
Folder watcher fix$0.0128
solved extra logic
#4
OpenAI
GPT-4.1
8.5/10
Folder watcher fix$0.0158
solved verbose
#4
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
Folder watcher fix$0.0259
solved verbose
#4
Anthropic
Claude Opus 4
8.5/10
Folder watcher fix$0.1676
solved verbose
#4
OpenRouter
mistralai/mistral-medium-3
8.5/10
Folder watcher fixN/A
solved verbose
#4
DeepSeek
DeepSeek-V3 (New)
8.5/10
Folder watcher fix$0.0022
solved verbose
#4
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
Folder watcher fix$0.1620
solved verbose
#4
xAI
Grok 4
8.5/10
Folder watcher fix$0.0436
solved extra logic verbose
#11
OpenAI
o3
8/10
Folder watcher fix$0.0982
solved diff format
#11
Anthropic
Claude 3.7 Sonnet
8/10
Folder watcher fix$0.0441
solved very verbose

Image - kanji

image_analysis

Image - kanji Evaluation Results

RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Kanji image$0.0227
correct
#1
OpenAI
o3
9/10
Kanji image$0.0827
correct
#3
xAI
Grok 4
7.5/10
Kanji image$0.1585
main exp wrong alt exp correct verbose
#4
Anthropic
Claude Opus 4
6/10
Kanji image$0.0396
a bit ambiguous
#5
OpenAI
GPT-4.1
5/10
Kanji image$0.0040
failed
#5
Anthropic
Claude 3.7 Sonnet
5/10
Kanji image$0.0080
failed
#5
OpenAI
GPT-4o
5/10
Kanji image$0.0070
failed
#5
OpenRouter
meta-llama/llama-4-maverick
5/10
Kanji imageN/A
ambiguous output
#5
Anthropic
Claude Sonnet 4
5/10
Kanji image$0.0091
failed
#10
OpenRouter
qwen/qwen3-235b-a22b
1/10
Kanji imageN/A
Didn't recognize image

Image analysis - water bottle Evaluation Results

RankModelRatingPromptCostNotes
#1
xAI
Grok 4
9.25/10
Image analysis$0.0887
correct detailed explanation
#2
OpenAI
GPT-4.1
9/10
Image analysis$0.0024
correct
#2
Google
Gemini 2.5 Pro Experimental
9/10
Image analysisN/A
correct
#2
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Image analysis$0.0183
correct
#2
OpenAI
o3
9/10
Image analysis$0.0290
correct
#6
Anthropic
Claude 3.7 Sonnet
6/10
Image analysis$0.0068
missed point
#6
OpenRouter
meta-llama/llama-4-maverick
6/10
Image analysisN/A
missed point
#6
Anthropic
Claude Sonnet 4
6/10
Image analysis$0.0065
missed points
#6
Anthropic
Claude Opus 4
6/10
Image analysis$0.0335
missed points

Next.js TODO add feature (Simple) Evaluation Results

RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (06-05)
9.5/10
TODO task v2 (concise)$0.0180
Very concise *2 Follows instruction well
#1
Anthropic
Claude Opus 4
9.5/10
TODO task (Claude)$0.0388
Very concise *2 Follows instruction well
#1
xAI
Grok 4
9.5/10
TODO task$0.0387
Very concise *2 Follows instructions well
#4
OpenAI
GPT-4.1
9.25/10
TODO task$0.0038
Very concise Follows instruction well
#4
Anthropic
Claude Sonnet 4
9.25/10
TODO task (Claude)$0.0076
Very concise Follows instruction well
#6
DeepSeek
DeepSeek-V3 (New)
9/10
TODO task$0.0006
Concise Follows instruction well
#6
Google
Gemini 2.5 Pro Experimental
9/10
TODO task v2 (concise)N/A
Concise Follows instruction well
#6
Google
Gemini 2.5 Pro Preview (05-06)
9/10
TODO task v2 (concise)$0.0212
Concise Follows instruction well
#9
OpenRouter
mistralai/mistral-medium-3
8.5/10
TODO taskN/A
Concise
#9
OpenRouter
openai/codex-mini
8.5/10
TODO taskN/A
Asked for more context!
#9
OpenRouter
google/gemini-2.5-flash-preview-05-20:thinking
8.5/10
TODO task v2 (concise)N/A
okay
#9
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
TODO task$0.0391
follows instruction
#9
Anthropic
Claude 3.5 Sonnet
8.5/10
TODO task$0.0086
Slightly verbose Follows instruction
#14
Anthropic
Claude 3.7 Sonnet
8/10
TODO task$0.0120
Output full code
#14
OpenRouter
inception/mercury-coder-small-beta
8/10
TODO taskN/A
output full code
#14
OpenRouter
qwen/qwen3-235b-a22b
8/10
TODO taskN/A
output full code
#14
Anthropic
Claude 3.7 Sonnet
8/10
TODO task (Claude)$0.0133
Output full code
#14
Custom
accounts/fireworks/models/deepseek-v3-0324
8/10
TODO taskN/A
output full code
#14
OpenRouter
mistralai/devstral-small
8/10
TODO taskN/A
output full code
#14
Anthropic
Claude Sonnet 4
8/10
TODO task$0.0112
output full code
#14
Anthropic
Claude Opus 4
8/10
TODO task$0.0566
output full code
#14
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
8/10
TODO task v2 (concise)N/A
output full code
#23
Google
Gemini 2.5 Pro Preview (03-25)
7.5/10
TODO task$0.0114
Verbose
#23
OpenRouter
meta-llama/llama-4-maverick
7.5/10
TODO taskN/A
verbose
#23
OpenAI
o3
7.5/10
TODO task$0.0565
diff format
#26
Google
Gemini 2.5 Flash Preview (05-20)
7/10
TODO task v2 (concise)N/A
verbose
#27
OpenRouter
google/gemini-2.5-flash-preview-05-20
6/10
TODO task v2 (concise)N/A
diff format verbose

TypeScript narrowing (Uncommon) Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
8.5/10
TypeScript narrowing v3$0.0521
both methods work
#2
Anthropic
Claude 3.5 Sonnet
8/10
TypeScript narrowing v3$0.0085
second and final answer works
#2
Anthropic
Claude Sonnet 4
8/10
TypeScript narrowing v3$0.0098
second method works
#4
Anthropic
Claude 3.7 Sonnet
7/10
TypeScript narrowing v3$0.0122
second answer works final answer wrong
#5
OpenAI
GPT-4.1
6/10
TypeScript narrowing v3$0.0037
use in keyword
#5
OpenRouter
mistralai/devstral-small
6/10
TypeScript narrowing v3N/A
almost correct
#5
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
6/10
TypeScript narrowing v3N/A
use in keyword
#5
xAI
Grok 4
6/10
TypeScript narrowing v3$0.0855
use in keyword
#9
Google
Gemini 2.5 Pro Experimental
5/10
TypeScript narrowing v3N/A
use in keyword verbose
#9
Google
Gemini 2.5 Pro Preview (05-06)
5/10
TypeScript narrowing v3$0.0227
use in keyword verbose
#9
Google
Gemini 2.5 Pro Preview (06-05)
5/10
TypeScript narrowing v3$0.1254
use in keyword verbose
#12
DeepSeek
DeepSeek-V3 (New)
4/10
TypeScript narrowing v3$0.0005
mention predicate
#13
OpenAI
o3
1/10
TypeScript narrowing v4$0.0458
wrong
#13
DeepSeek
DeepSeek-V3 (New)
1/10
TypeScript narrowing v4$0.0005
wrong
#13
OpenAI
o4-mini
1/10
TypeScript narrowing v4$0.0091
wrong
#13
Google
Gemini 2.5 Pro Experimental
1/10
TypeScript narrowing v4N/A
wrong
#13
Google
Gemini 2.5 Pro Preview (05-06)
1/10
TypeScript narrowing v4$0.0207
wrong
#13
OpenAI
o3
1/10
TypeScript narrowing v3$0.0546
wrong
#13
OpenRouter
mistralai/mistral-medium-3
1/10
TypeScript narrowing v3N/A
wrong
#13
OpenRouter
inception/mercury-coder-small-beta
1/10
TypeScript narrowing v3N/A
wrong

Writing an AI Timeline Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
AI timeline$0.0250
Covers almost all points
#1
Anthropic
Claude Opus 4
9.5/10
AI timeline$0.1413
Covers almost all points
#3
OpenAI
GPT-4.1
9.25/10
AI timeline$0.0076
Covers most points concise
#3
Google
Gemini 2.5 Pro Preview (06-05)
9.25/10
AI timeline v2 (concise)$0.0492
Covers most points concise
#3
xAI
Grok 4
9.25/10
AI timeline$0.0315
Covers most points concise
#6
DeepSeek
DeepSeek-V3 (New)
9/10
AI timeline$0.0009
Covers most points Too concise
#7
Anthropic
Claude 3.7 Sonnet
8.5/10
AI timeline$0.0244
Covers most points Wrong format
#7
OpenRouter
mistralai/mistral-medium-3
8.5/10
AI timelineN/A
Covers most points Wrong format
#7
Google
Gemini 2.5 Pro Experimental
8.5/10
AI timeline v2 (concise)N/A
Covers most points Wrong format
#7
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)$0.0213
Covers most points Wrong format
#7
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)$0.0547
Covers major points Wrong format
#7
OpenAI
o3
8.5/10
AI timeline$0.1656
Covers most points Wrong format
#13
Google
Gemini 2.5 Pro Experimental
8/10
AI timelineN/A
Covers most points Wrong format Verbose
#13
Custom
accounts/fireworks/models/deepseek-v3-0324
8/10
AI timelineN/A
Covers major points Wrong format
#13
OpenRouter
qwen/qwen3-235b-a22b
8/10
AI timelineN/A
Covers major points Wrong format
#13
OpenRouter
meta-llama/llama-3.3-70b-instruct
8/10
AI timelineN/A
Covers major points Wrong format
#17
Azure OpenAI
GPT-4o
7.5/10
AI timeline$0.0095
Missed some points Bad headline
#17
OpenRouter
qwen/qwen3-8b:free
7.5/10
AI timelineN/A
Missed some points Bad headline
#17
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
7.5/10
AI timelineN/A
wrong format -- major points

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

View raw evaluation data →

Download 16x Eval

Join AI builders and power users in running your own evaluations