Logo16x Eval

16x Eval Model Evaluation Results

Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.

🥇
Claude Opus 4
Anthropic
8.8/10
Tasks Evaluated:5
Rating Range:8.0 - 9.5
🥈
Claude Sonnet 4
Anthropic
8.7/10
Tasks Evaluated:5
Rating Range:8.0 - 9.5
🥉
GPT-4.1
OpenAI
8.3/10
Tasks Evaluated:5
Rating Range:6.0 - 9.3

Benchmark Visualization Evaluation

coding

Benchmark Visualization (Difficult) Evaluation Results

RankModelRatingPromptCostNotes
#1
OpenAI
GPT-4.1
8.5/10
Benchmark visualization$0.0188
Clear labels
#1
Anthropic
Claude Sonnet 4
8.5/10
Benchmark visualization$0.0502
Side-by-side No label No color-coding
#1
Anthropic
Claude Opus 4
8.5/10
Benchmark visualization$0.2104
Side-by-side No label No color-coding
#4
OpenAI
o3
8/10
Benchmark visualization$0.1274
Clear labels Poor color choice
#4
Google
Gemini 2.5 Pro Preview (06-05)
8/10
Benchmark visualization$0.1397
Clear labels Poor color choice
#6
Anthropic
Claude 3.7 Sonnet
7.5/10
Benchmark visualization$0.0510
Number labels Good idea
#7
Google
Gemini 2.5 Pro Experimental
7/10
Benchmark visualizationN/A
No labels Good colors
#7
DeepSeek
DeepSeek-V3 (New)
7/10
Benchmark visualization$0.0026
No labels Good colors
#7
Google
Gemini 2.5 Pro Preview (05-06)
7/10
Benchmark visualization$0.0461
No labels Good colors
#7
OpenRouter
mistralai/mistral-medium-3
7/10
Benchmark visualizationN/A
No labels Good colors
#7
OpenRouter
mistralai/devstral-small
7/10
Benchmark visualizationN/A
No labels Good colors
#12
Google
Gemini 2.5 Pro Preview (03-25)
6/10
Benchmark visualization$0.0535
Minor bug No labels
#13
OpenRouter
qwen/qwen3-235b-a22b
5/10
Benchmark visualizationN/A
Very small Hard to read
#13
OpenRouter
inception/mercury-coder-small-beta
5/10
Benchmark visualizationN/A
No color Hard to read
#15
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
1/10
Benchmark visualizationN/A
doesn't run. bugfix not obvious.
#15
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
1/10
Benchmark visualizationN/A
output json instead of html

Folder Watcher Fix Evaluation

coding

Folder Watcher Fix (Normal) Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9/10
Folder watcher fix$0.1370
solved extra logic concise
#1
Anthropic
Claude Sonnet 4
9/10
Folder watcher fix$0.0261
solved concise
#3
OpenAI
o4-mini
8.75/10
Folder watcher fix$0.0128
solved extra logic
#4
OpenAI
GPT-4.1
8.5/10
Folder watcher fix$0.0158
solved verbose
#4
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
Folder watcher fix$0.0259
solved verbose
#4
Anthropic
Claude Opus 4
8.5/10
Folder watcher fix$0.1676
solved verbose
#4
OpenRouter
mistralai/mistral-medium-3
8.5/10
Folder watcher fixN/A
solved verbose
#4
DeepSeek
DeepSeek-V3 (New)
8.5/10
Folder watcher fix$0.0022
solved verbose
#4
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
Folder watcher fix$0.1620
solved verbose
#10
OpenAI
o3
8/10
Folder watcher fix$0.0982
solved diff format
#10
Anthropic
Claude 3.7 Sonnet
8/10
Folder watcher fix$0.0441
solved very verbose

Next.js TODO Evaluation

coding

Next.js TODO add feature (Simple) Evaluation Results

RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (06-05)
9.5/10
TODO task v2 (concise)$0.0180
Very concise Follows instruction well
#1
Anthropic
Claude Opus 4
9.5/10
TODO task (Claude)$0.0388
Very concise Follows instruction well
#3
OpenAI
GPT-4.1
9.25/10
TODO task$0.0038
Very concise Follows instruction well
#3
Anthropic
Claude Sonnet 4
9.25/10
TODO task (Claude)$0.0076
Very concise Follows instruction well
#5
DeepSeek
DeepSeek-V3 (New)
9/10
TODO task$0.0006
Concise Follows instruction well
#5
Google
Gemini 2.5 Pro Experimental
9/10
TODO task v2 (concise)N/A
Concise Follows instruction well
#5
Google
Gemini 2.5 Pro Preview (05-06)
9/10
TODO task v2 (concise)$0.0212
Concise Follows instruction well
#8
OpenRouter
mistralai/mistral-medium-3
8.5/10
TODO taskN/A
Concise
#8
OpenRouter
openai/codex-mini
8.5/10
TODO taskN/A
Asked for more context!
#8
OpenRouter
google/gemini-2.5-flash-preview-05-20:thinking
8.5/10
TODO task v2 (concise)N/A
okay
#8
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
TODO task$0.0391
follows instruction
#8
Anthropic
Claude 3.5 Sonnet
8.5/10
TODO task$0.0086
Slightly verbose Follows instruction
#13
Anthropic
Claude 3.7 Sonnet
8/10
TODO task$0.0120
Output full code
#13
OpenRouter
inception/mercury-coder-small-beta
8/10
TODO taskN/A
output full code
#13
OpenRouter
qwen/qwen3-235b-a22b
8/10
TODO taskN/A
output full code
#13
Anthropic
Claude 3.7 Sonnet
8/10
TODO task (Claude)$0.0133
Output full code
#13
Custom
accounts/fireworks/models/deepseek-v3-0324
8/10
TODO taskN/A
output full code
#13
OpenRouter
mistralai/devstral-small
8/10
TODO taskN/A
output full code
#13
Anthropic
Claude Sonnet 4
8/10
TODO task$0.0112
output full code
#13
Anthropic
Claude Opus 4
8/10
TODO task$0.0566
output full code
#13
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
8/10
TODO task v2 (concise)N/A
output full code
#22
Google
Gemini 2.5 Pro Preview (03-25)
7.5/10
TODO task$0.0114
Verbose
#22
OpenRouter
meta-llama/llama-4-maverick
7.5/10
TODO taskN/A
verbose
#22
OpenAI
o3
7.5/10
TODO task$0.0565
diff format
#25
Google
Gemini 2.5 Flash Preview (05-20)
7/10
TODO task v2 (concise)N/A
verbose
#25
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
7/10
TODO task (Claude)N/A
verbose slow
#27
OpenRouter
google/gemini-2.5-flash-preview-05-20
6/10
TODO task v2 (concise)N/A
diff format verbose

TypeScript Narrowing Evaluation

coding

TypeScript Narrowing (Uncommon) Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
8.5/10
TypeScript narrowing v3$0.0521
both methods work
#2
Anthropic
Claude 3.5 Sonnet
8/10
TypeScript narrowing v3$0.0085
second and final answer works
#2
Anthropic
Claude Sonnet 4
8/10
TypeScript narrowing v3$0.0098
second method works
#4
Anthropic
Claude 3.7 Sonnet
7/10
TypeScript narrowing v3$0.0122
second answer works final answer wrong
#5
OpenAI
GPT-4.1
6/10
TypeScript narrowing v3$0.0037
use in keyword
#5
OpenRouter
mistralai/devstral-small
6/10
TypeScript narrowing v3N/A
almost correct
#5
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
6/10
TypeScript narrowing v3N/A
use in keyword
#8
Google
Gemini 2.5 Pro Experimental
5/10
TypeScript narrowing v3N/A
use in keyword verbose
#8
Google
Gemini 2.5 Pro Preview (05-06)
5/10
TypeScript narrowing v3$0.0227
use in keyword verbose
#8
Google
Gemini 2.5 Pro Preview (06-05)
5/10
TypeScript narrowing v3$0.1254
use in keyword verbose
#11
DeepSeek
DeepSeek-V3 (New)
4/10
TypeScript narrowing v3$0.0005
mention predicate
#12
OpenAI
o3
1/10
TypeScript narrowing v4$0.0458
wrong
#12
DeepSeek
DeepSeek-V3 (New)
1/10
TypeScript narrowing v4$0.0005
wrong
#12
OpenAI
o4-mini
1/10
TypeScript narrowing v4$0.0091
wrong
#12
Google
Gemini 2.5 Pro Experimental
1/10
TypeScript narrowing v4N/A
wrong
#12
Google
Gemini 2.5 Pro Preview (05-06)
1/10
TypeScript narrowing v4$0.0207
wrong
#12
OpenAI
o3
1/10
TypeScript narrowing v3$0.0546
wrong
#12
OpenRouter
mistralai/mistral-medium-3
1/10
TypeScript narrowing v3N/A
wrong
#12
OpenRouter
inception/mercury-coder-small-beta
1/10
TypeScript narrowing v3N/A
wrong

AI Timeline Evaluation

writing

Writing an AI Timeline Evaluation Results

RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
AI timeline$0.0250
Covers almost all points
#1
Anthropic
Claude Opus 4
9.5/10
AI timeline$0.1413
Covers almost all points
#3
OpenAI
GPT-4.1
9.25/10
AI timeline$0.0076
Covers most points concise
#3
Google
Gemini 2.5 Pro Preview (06-05)
9.25/10
AI timeline v2 (concise)$0.0492
Covers most points concise
#5
DeepSeek
DeepSeek-V3 (New)
9/10
AI timeline$0.0009
Covers most points Too concise
#6
Anthropic
Claude 3.7 Sonnet
8.5/10
AI timeline$0.0244
Covers most points Wrong format
#6
OpenRouter
mistralai/mistral-medium-3
8.5/10
AI timelineN/A
Covers most points Wrong format
#6
OpenRouter
meta-llama/llama-3.3-70b-instruct
8.5/10
AI timelineN/A
Covers most points Wrong format
#6
Google
Gemini 2.5 Pro Experimental
8.5/10
AI timeline v2 (concise)N/A
Covers most points Wrong format
#6
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)$0.0213
Covers most points Wrong format
#6
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)$0.0547
Covers major points Wrong format
#12
Google
Gemini 2.5 Pro Experimental
8/10
AI timelineN/A
Covers most points Wrong format Verbose
#12
Custom
accounts/fireworks/models/deepseek-v3-0324
8/10
AI timelineN/A
Covers major points Wrong format
#12
OpenRouter
qwen/qwen3-235b-a22b
8/10
AI timelineN/A
Covers major points Wrong format
#15
Azure OpenAI
GPT-4o
7.5/10
AI timeline$0.0095
Missed some points Bad headline
#15
OpenRouter
qwen/qwen3-8b:free
7.5/10
AI timelineN/A
Missed some points Bad headline
#15
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
7.5/10
AI timelineN/A
wrong format -- major points

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

View raw evaluation data →

Download 16x Eval

Join AI builders and power users in running your own evaluations