Logo16x Eval

16x Eval Model Evaluation Results

Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.

Evaluation Results

🥇
Claude Sonnet 4
Anthropic
Avg: 9.5/10
Individual Experiment Ratings:
Writing an AI Timeline:9.5/10
🥈
Claude Opus 4
Anthropic
Avg: 9.5/10
Individual Experiment Ratings:
Writing an AI Timeline:9.5/10
🥉
GPT-4.1
OpenAI
Avg: 9.25/10
Individual Experiment Ratings:
Writing an AI Timeline:9.25/10

Top Models - Technical Writing

Claude Sonnet 49.50
Claude Opus 49.50
GPT-4.19.25
Gemini 2.5 Pro Preview (06-05)9.25
Grok 49.25
DeepSeek V3 (New)8.75
Claude 3.7 Sonnet8.50
mistralai/mistral-medium-38.50

Jump to Experiment

Benchmark Visualization (Difficult)

Coding
JavaScript
Visualization

Evaluation Rubrics

Criteria: - Side-by-side visualization without label: 8.5/10 - Baseline visualization without label: 8/10 - Horizontal bar chart (if cannot fit in the page): 7.5/10 - Has major formatting issues: 5/10 - Did not run / Code error: 1/10 Additional components: - Side-by-side visualization - Color by benchmark: +0.5 rating - Alternative ways to differentiate benchmarks: +0.5 rating - Color by model: No effect on rating - Clear labels on bar chart: +0.5 rating - Visually pleasing: +0.25 rating - Poor color choice: -0.5 rating - Minor formatting issues: -0.5 rating Additional instructions for variance: - If the code did not run or render in the first try, a second try is given to regenerate the code.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.25/10
Benchmark visualization5.02¢
Side-by-side no label. Color by benchmark. Visually pleasing
#1
Anthropic
Claude Opus 4
9.25/10
Benchmark visualization21.04¢
Side-by-side no label. Color by benchmark. Visually pleasing
#1
xAI
Grok 4
9.25/10
Benchmark visualization11.26¢
side-by-side clear labels. color by model. Visually pleasing
#1
Moonshot AI
Kimi K2
9.25/10
Benchmark visualizationN/A
Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing
#5
OpenAI
GPT-4.1
8.75/10
Benchmark visualization1.88¢
Clear labels. Visually pleasing
#5
Google
Gemini 2.5 Pro
8.75/10
Benchmark visualization11.00¢
Clear labels. Visually pleasing
#7
OpenRouter
openai/gpt-oss-120b
8.5/10
Benchmark visualizationN/A
baseline. clear labels
#8
OpenAI
o3
8/10
Benchmark visualization12.74¢
Clear labels. Poor color choice
#8
Google
Gemini 2.5 Pro Preview (06-05)
8/10
Benchmark visualization13.97¢
Clear labels. Poor color choice
#10
Anthropic
Claude 3.7 Sonnet
7.5/10
Benchmark visualization5.10¢
Number labels. Good idea
#11
Google
Gemini 2.5 Pro Experimental
7/10
Benchmark visualizationN/A
No labels. Good colors
#11
DeepSeek
DeepSeek-V3 (New)
7/10
Benchmark visualization0.26¢
No labels. Good colors
#11
Google
Gemini 2.5 Pro Preview (05-06)
7/10
Benchmark visualization4.61¢
No labels. Good colors
#11
OpenRouter
mistralai/mistral-medium-3
7/10
Benchmark visualizationN/A
No labels. Good colors
#11
OpenRouter
mistralai/devstral-small
7/10
Benchmark visualizationN/A
No labels. Good colors
#11
OpenRouter (Alibaba Plus)
Qwen3 Coder
7/10
Benchmark visualizationN/A
horizontal bars. minor formatting issues
#17
Google
Gemini 2.5 Pro Preview (03-25)
6/10
Benchmark visualization5.35¢
Minor bug. No labels
#18
Stealth
Horizon Alpha
5.5/10
Benchmark visualizationN/A
Strange visualization with major formatting issues
#18
Stealth
Horizon Alpha
5.5/10
Benchmark visualizationN/A
Strange visualization with major formatting issues
#20
OpenRouter
qwen/qwen3-235b-a22b
5/10
Benchmark visualizationN/A
Very small. Hard to read
#20
OpenRouter
inception/mercury-coder-small-beta
5/10
Benchmark visualizationN/A
No color. Hard to read
#22
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
1/10
Benchmark visualizationN/A
doesn't run. bugfix not obvious.

Clean markdown (Medium)

Coding
TypeScript

Evaluation Rubrics

Criteria: - Code runs and gives correct (expected) output: 9/10 - The output has 1 or more newline issues: 8.5/10 - The output does not contain newlines: 8/10 Additional components: - Short code (1000 characters or less) that is correct: +0.25 rating - Verbose output: -0.5 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9.25/10
clean markdown v24.02¢
correct output. short code
#1
Moonshot AI
Kimi K2
9.25/10
clean markdown v2N/A
correct. short code
#1
OpenRouter (Alibaba Plus)
Qwen3 Coder
9.25/10
clean markdown v2N/A
correct. short code
#4
Google
Gemini 2.5 Pro
9/10
clean markdown v213.60¢
correct
#4
OpenAI
o3
9/10
clean markdown v213.79¢
correct
#6
OpenAI
GPT-4.1
8.5/10
clean markdown v21.12¢
1 new line issue
#6
xAI
Grok 4
8.5/10
clean markdown v213.06¢
1 new line issue
#6
Stealth
Horizon Alpha
8.5/10
clean markdown v2N/A
one newline issue
#6
OpenRouter
openai/gpt-oss-120b
8.5/10
clean markdown v2N/A
one newline issue
#10
Anthropic
Claude Sonnet 4
8/10
clean markdown v20.77¢
no new lines
#10
DeepSeek
DeepSeek-V3 (New)
8/10
clean markdown v20.05¢
no new lines

Folder watcher fix (Normal)

Coding
TypeScript
Vue

Evaluation Rubrics

Criteria: - Correctly solved the task: 9/10 Additional components: - Added helpful extra logic: +0.25 rating - Added unnecessary code: -0.25 rating - Returned code in diff format: -1 rating - Verbose output: -0.5 rating - Concise response (only changed code): +0.25 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
9.5/10
Folder watcher fix13.70¢
solved. extra logic. concise
#1
Stealth
Horizon Alpha
9.5/10
Folder watcher fixN/A
solved. extra logic. concise, respects indentation well
#3
OpenAI
o4-mini
9.25/10
Folder watcher fix1.28¢
solved. extra logic
#3
Anthropic
Claude Sonnet 4
9.25/10
Folder watcher fix2.61¢
solved. concise
#3
Moonshot AI
Kimi K2
9.25/10
Folder watcher fixN/A
solved. extra logic
#6
Anthropic
Claude 3.7 Sonnet
8.75/10
Folder watcher fix4.41¢
solved. very verbose. extra logic
#6
xAI
Grok 4
8.75/10
Folder watcher fix4.36¢
solved. extra logic. verbose
#6
OpenRouter (Alibaba Plus)
Qwen3 Coder
8.75/10
Folder watcher fixN/A
unnecessary code
#9
OpenAI
GPT-4.1
8.5/10
Folder watcher fix1.58¢
solved. verbose
#9
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
Folder watcher fix2.59¢
solved. verbose
#9
Anthropic
Claude Opus 4
8.5/10
Folder watcher fix16.76¢
solved. verbose
#9
OpenRouter
mistralai/mistral-medium-3
8.5/10
Folder watcher fixN/A
solved. verbose
#9
DeepSeek
DeepSeek-V3 (New)
8.5/10
Folder watcher fix0.22¢
solved. verbose
#9
Google
Gemini 2.5 Pro Preview (06-05)
8.5/10
Folder watcher fix16.20¢
solved. verbose
#9
OpenRouter
openai/gpt-oss-120b
8.5/10
Folder watcher fixN/A
solved. verbose
#16
OpenAI
o3
8/10
Folder watcher fix9.82¢
solved. diff format
#16
Google
Gemini 2.5 Pro
8/10
Folder watcher fix21.98¢
solved in a different way. diff format

Image - kanji

Image Analysis
Japanese
Chinese

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Tangentially related explanation: 6/10 - Incorrect or ambiguous explanation: 5/10 - Did not recognize image: 1/10 Additional components: - Provides multiple explanations - Includes one wrong explanation: -0.5 rating - Final or main explanation wrong: -1 rating - Verbose output: -0.5 rating
RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Kanji image2.27¢
correct
#1
OpenAI
o3
9/10
Kanji image8.27¢
correct
#3
xAI
Grok 4
7.5/10
Kanji image15.85¢
main exp wrong. alt exp correct. verbose
#4
Anthropic
Claude Opus 4
6/10
Kanji image3.96¢
tangential
#5
OpenAI
GPT-4.1
5/10
Kanji image0.40¢
failed
#5
Anthropic
Claude 3.7 Sonnet
5/10
Kanji image0.80¢
failed
#5
OpenAI
GPT-4o
5/10
Kanji image0.70¢
failed
#5
OpenRouter
meta-llama/llama-4-maverick
5/10
Kanji imageN/A
ambiguous output
#5
Anthropic
Claude Sonnet 4
5/10
Kanji image0.91¢
failed
#5
Google
Gemini 2.5 Pro
5/10
Kanji image5.48¢
failed
#11
OpenRouter
qwen/qwen3-235b-a22b
1/10
Kanji imageN/A
Didn't recognize image

Image analysis - water bottle

Image Analysis
Physics

Evaluation Rubrics

Criteria: - Correct explanation: 9/10 - Missed the point: 6/10 Additional components: - Detailed explanation: +0.25 rating
RankModelRatingPromptCostNotes
#1
xAI
Grok 4
9.25/10
Image analysis8.87¢
correct. detailed explanation
#2
OpenAI
GPT-4.1
9/10
Image analysis0.24¢
correct
#2
Google
Gemini 2.5 Pro Experimental
9/10
Image analysisN/A
correct
#2
Google
Gemini 2.5 Pro Preview (05-06)
9/10
Image analysis1.83¢
correct
#2
OpenAI
o3
9/10
Image analysis2.90¢
correct
#2
Google
Gemini 2.5 Pro
9/10
Image analysis1.45¢
correct
#7
Anthropic
Claude 3.7 Sonnet
6/10
Image analysis0.68¢
missed point
#7
OpenRouter
meta-llama/llama-4-maverick
6/10
Image analysisN/A
missed point
#7
Anthropic
Claude Sonnet 4
6/10
Image analysis0.65¢
missed points
#7
Anthropic
Claude Opus 4
6/10
Image analysis3.35¢
missed points

Evaluation Rubrics

Criteria: - Output only changed code (follows instructions): 9/10 - Output full code (does not follow instructions): 8/10 Additional components: - Concise response - Very concise response: +0.25 rating - Very very concise response: +0.5 rating - Verbose output: -0.5 rating
RankModelRatingPromptCostNotes
#1
Google
Gemini 2.5 Pro Preview (06-05)
9.5/10
TODO task v2 (concise)1.80¢
Very concise *2. Follows instruction well
#1
Anthropic
Claude Opus 4
9.5/10
TODO task (Claude)3.88¢
Very concise *2. Follows instruction well
#1
xAI
Grok 4
9.5/10
TODO task3.87¢
Very concise *2. Follows instructions well
#1
Google
Gemini 2.5 Pro
9.5/10
TODO task v2 (concise)2.12¢
Very concise *2. Follows instruction well
#5
OpenAI
GPT-4.1
9.25/10
TODO task0.38¢
Very concise. Follows instruction well
#5
Anthropic
Claude Sonnet 4
9.25/10
TODO task (Claude)0.76¢
Very concise. Follows instruction well
#7
DeepSeek
DeepSeek-V3 (New)
9/10
TODO task0.06¢
Follows instruction
#7
Google
Gemini 2.5 Pro Experimental
9/10
TODO task v2 (concise)N/A
Follows instruction
#7
Google
Gemini 2.5 Pro Preview (05-06)
9/10
TODO task v2 (concise)2.12¢
Follows instruction
#7
Google
Gemini 2.5 Pro Preview (06-05)
9/10
TODO task3.91¢
Follows instruction
#11
OpenRouter
mistralai/mistral-medium-3
8.5/10
TODO taskN/A
Follows instruction. Verbose
#11
OpenRouter
openai/codex-mini
8.5/10
TODO taskN/A
Asked for more context!
#11
OpenRouter
google/gemini-2.5-flash-preview-05-20:thinking
8.5/10
TODO task v2 (concise)N/A
Follows instruction. Verbose
#11
Anthropic
Claude 3.5 Sonnet
8.5/10
TODO task0.86¢
Slightly verbose. Follows instruction
#11
Stealth
Horizon Alpha
8.5/10
TODO taskN/A
Follows instruction. Verbose
#11
Stealth
Horizon Alpha
8.5/10
TODO task v2 (concise)N/A
Follows instruction. Verbose
#11
OpenRouter
openai/gpt-oss-120b
8.5/10
TODO taskN/A
Follows instruction. Verbose
#11
OpenRouter
openai/gpt-oss-120b
8.5/10
TODO task v2 (concise)N/A
Follows instruction. Verbose
#19
Anthropic
Claude 3.7 Sonnet
8/10
TODO task1.20¢
Output full code
#19
OpenRouter
inception/mercury-coder-small-beta
8/10
TODO taskN/A
output full code
#19
OpenRouter
qwen/qwen3-235b-a22b
8/10
TODO taskN/A
output full code
#19
Anthropic
Claude 3.7 Sonnet
8/10
TODO task (Claude)1.33¢
Output full code
#19
Fireworks AI
DeepSeek V3 (0324)
8/10
TODO taskN/A
output full code
#19
OpenRouter
mistralai/devstral-small
8/10
TODO taskN/A
output full code
#19
Anthropic
Claude Sonnet 4
8/10
TODO task1.12¢
output full code
#19
Anthropic
Claude Opus 4
8/10
TODO task5.66¢
output full code
#19
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
8/10
TODO task v2 (concise)N/A
output full code
#19
Moonshot AI
Kimi K2
8/10
TODO taskN/A
output full code
#19
OpenRouter (Alibaba Plus)
Qwen3 Coder
8/10
TODO taskN/A
output full code
#30
Google
Gemini 2.5 Pro Preview (03-25)
7.5/10
TODO task1.14¢
Verbose
#30
OpenRouter
meta-llama/llama-4-maverick
7.5/10
TODO taskN/A
verbose
#30
OpenAI
o3
7.5/10
TODO task5.65¢
diff format

Evaluation Rubrics

Criteria: - Provide a working method (without in keyword): 8/10 - Use in keyword: 6/10 - Did not work (Wrong answer): 1/10 Additional components: - Provides multiple working methods: +0.5 rating - Provides multiple methods - Includes one wrong method: -0.5 rating - Final answer wrong: -1 rating - Verbose output: -0.5 rating Additional instructions for variance: - Each model is given two tries for this task to account for large variance in output. The higher rating will be used.
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Opus 4
8.5/10
TypeScript narrowing v35.21¢
both methods work
#2
Anthropic
Claude 3.5 Sonnet
8/10
TypeScript narrowing v30.85¢
second and final answer works
#2
Anthropic
Claude Sonnet 4
8/10
TypeScript narrowing v30.98¢
second method works
#4
OpenRouter
openai/gpt-oss-120b
7.5/10
TypeScript narrowing v3N/A
2nd method works
#4
OpenRouter
openai/gpt-oss-120b
7.5/10
TypeScript narrowing v3N/A
2nd method works
#6
Anthropic
Claude 3.7 Sonnet
7/10
TypeScript narrowing v31.22¢
second answer works. final answer wrong
#7
OpenAI
GPT-4.1
6/10
TypeScript narrowing v30.37¢
use in keyword
#7
OpenRouter
mistralai/devstral-small
6/10
TypeScript narrowing v3N/A
almost correct
#7
xAI
Grok 4
6/10
TypeScript narrowing v38.55¢
use in keyword
#10
Google
Gemini 2.5 Pro Experimental
5.5/10
TypeScript narrowing v3N/A
use in keyword. verbose
#10
Google
Gemini 2.5 Pro Preview (05-06)
5.5/10
TypeScript narrowing v32.27¢
use in keyword. verbose
#10
Google
Gemini 2.5 Pro Preview (06-05)
5.5/10
TypeScript narrowing v312.54¢
use in keyword. verbose
#10
Stealth
Horizon Alpha
5.5/10
TypeScript narrowing v3N/A
first method didn't work. second method uses in keyword
#14
OpenAI
o3
1/10
TypeScript narrowing v44.58¢
wrong
#14
DeepSeek
DeepSeek-V3 (New)
1/10
TypeScript narrowing v40.05¢
wrong
#14
OpenAI
o4-mini
1/10
TypeScript narrowing v40.91¢
wrong
#14
Google
Gemini 2.5 Pro Experimental
1/10
TypeScript narrowing v4N/A
wrong
#14
Google
Gemini 2.5 Pro Preview (05-06)
1/10
TypeScript narrowing v42.07¢
wrong
#14
OpenAI
o3
1/10
TypeScript narrowing v35.46¢
wrong
#14
DeepSeek
DeepSeek-V3 (New)
1/10
TypeScript narrowing v30.05¢
wrong. mention predicate
#14
OpenRouter
mistralai/mistral-medium-3
1/10
TypeScript narrowing v3N/A
wrong
#14
OpenRouter
inception/mercury-coder-small-beta
1/10
TypeScript narrowing v3N/A
wrong
#14
Google
Gemini 2.5 Pro
1/10
TypeScript narrowing v33.98¢
wrong
#14
Moonshot AI
Kimi K2
1/10
TypeScript narrowing v3N/A
wrong
#14
OpenRouter (Alibaba Plus)
Qwen3 Coder
1/10
TypeScript narrowing v3N/A
wrong
#14
Moonshot AI
Kimi K2
1/10
TypeScript narrowing v3N/A
wrong
#14
OpenRouter (Alibaba Plus)
Qwen3 Coder
1/10
TypeScript narrowing v3N/A
all 3 methods did not work

Writing an AI Timeline

Technical Writing
AI
History

Evaluation Rubrics

Criteria: - Covers all points (20): 10/10 - Covers almost all points (>=18): 9.5/10 - Covers most points (>=15): 9/10 - Covers major points (>=13): 8.5/10 - Missed some points (<13): 8/10 Additional components: - Bad headline: -0.5 rating - Concise response: +0.25 rating - Too concise response: -0.25 rating - Verbose output: -0.5 rating - Wrong format: -0.5 rating
RankModelRatingPromptCostNotes
#1
Anthropic
Claude Sonnet 4
9.5/10
AI timeline2.50¢
Covers almost all points
#1
Anthropic
Claude Opus 4
9.5/10
AI timeline14.13¢
Covers almost all points
#3
OpenAI
GPT-4.1
9.25/10
AI timeline0.76¢
Covers most points. concise
#3
Google
Gemini 2.5 Pro Preview (06-05)
9.25/10
AI timeline v2 (concise)4.92¢
Covers most points. concise
#3
xAI
Grok 4
9.25/10
AI timeline3.15¢
Covers most points. concise
#6
DeepSeek
DeepSeek-V3 (New)
8.75/10
AI timeline0.09¢
Covers most points. Too concise
#7
Anthropic
Claude 3.7 Sonnet
8.5/10
AI timeline2.44¢
Covers most points. Wrong format
#7
OpenRouter
mistralai/mistral-medium-3
8.5/10
AI timelineN/A
Covers most points. Wrong format
#7
Google
Gemini 2.5 Pro Experimental
8.5/10
AI timeline v2 (concise)N/A
Covers most points. Wrong format
#7
Google
Gemini 2.5 Pro Preview (05-06)
8.5/10
AI timeline v2 (concise)2.13¢
Covers most points. Wrong format
#7
OpenAI
o3
8.5/10
AI timeline16.56¢
Covers most points. Wrong format
#7
Google
Gemini 2.5 Pro
8.5/10
AI timeline v2 (concise)5.44¢
Covers major points
#7
Moonshot AI
Kimi K2
8.5/10
AI timelineN/A
covers major points
#14
Google
Gemini 2.5 Pro Experimental
8/10
AI timelineN/A
Covers most points. Wrong format. Verbose
#14
Fireworks AI
DeepSeek V3 (0324)
8/10
AI timelineN/A
Covers major points. Wrong format
#14
OpenRouter
qwen/qwen3-235b-a22b
8/10
AI timelineN/A
Covers major points. Wrong format
#14
OpenRouter
meta-llama/llama-3.3-70b-instruct
8/10
AI timelineN/A
Covers major points. Wrong format
#14
Google
Gemini 2.5 Pro Preview (05-06)
8/10
AI timeline v2 (concise)5.47¢
Covers major points. Wrong format
#19
Azure OpenAI
gpt-4o
7.5/10
AI timeline0.95¢
Missed some points. Bad headline
#19
OpenRouter
qwen/qwen3-8b:free
7.5/10
AI timelineN/A
Missed some points. Bad headline
#19
OpenRouter
deepseek/deepseek-r1-0528-qwen3-8b
7.5/10
AI timelineN/A
Missed some points. Wrong format

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

→ View rubrics and latest results for model evaluations

→ View raw evaluation data

Download 16x Eval

No sign-up or login required. Create your own evals in minutes.