16x Eval Model Evaluation Results
Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.
Benchmark Visualization (Difficult) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | OpenAI GPT-4.1 | 8.5/10 | Benchmark visualization | $0.0188 | Clear labels |
#1 | Anthropic Claude Sonnet 4 | 8.5/10 | Benchmark visualization | $0.0502 | Side-by-side
No label
No color-coding |
#1 | Anthropic Claude Opus 4 | 8.5/10 | Benchmark visualization | $0.2104 | Side-by-side
No label
No color-coding |
#4 | OpenAI o3 | 8/10 | Benchmark visualization | $0.1274 | Clear labels
Poor color choice |
#4 | Google Gemini 2.5 Pro Preview (06-05) | 8/10 | Benchmark visualization | $0.1397 | Clear labels
Poor color choice |
#4 | xAI Grok 4 | 8/10 | Benchmark visualization | $0.1126 | side-by-side
clear labels
bad color usage |
#7 | Anthropic Claude 3.7 Sonnet | 7.5/10 | Benchmark visualization | $0.0510 | Number labels
Good idea |
#8 | Google Gemini 2.5 Pro Experimental | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#8 | DeepSeek DeepSeek-V3 (New) | 7/10 | Benchmark visualization | $0.0026 | No labels
Good colors |
#8 | Google Gemini 2.5 Pro Preview (05-06) | 7/10 | Benchmark visualization | $0.0461 | No labels
Good colors |
#8 | OpenRouter mistralai/mistral-medium-3 | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#8 | OpenRouter mistralai/devstral-small | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#13 | Google Gemini 2.5 Pro Preview (03-25) | 6/10 | Benchmark visualization | $0.0535 | Minor bug
No labels |
#14 | OpenRouter qwen/qwen3-235b-a22b | 5/10 | Benchmark visualization | N/A | Very small
Hard to read |
#14 | OpenRouter inception/mercury-coder-small-beta | 5/10 | Benchmark visualization | N/A | No color
Hard to read |
#16 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 1/10 | Benchmark visualization | N/A | doesn't run.
bugfix not obvious. |
#16 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 1/10 | Benchmark visualization | N/A | output json instead of html |
Clean markdown (Medium)
Clean markdown (Medium) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9.25/10 | clean markdown v2 | $0.0402 | correct output
short code |
#2 | Google Gemini 2.5 Pro | 9/10 | clean markdown v2 | $0.1360 | correct output |
#2 | OpenAI o3 | 9/10 | clean markdown v2 | $0.1379 | correct output |
#4 | OpenAI GPT-4.1 | 8.5/10 | clean markdown v2 | $0.0112 | 2 new line issues |
#4 | xAI Grok 4 | 8.5/10 | clean markdown v2 | $0.1306 | 2 new line issues |
#6 | Anthropic Claude Sonnet 4 | 8/10 | clean markdown v2 | $0.0077 | no new lines |
#6 | DeepSeek DeepSeek-V3 (New) | 8/10 | clean markdown v2 | $0.0005 | no new lines |
Folder watcher fix (Normal) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9/10 | Folder watcher fix | $0.1370 | solved
extra logic
concise |
#1 | Anthropic Claude Sonnet 4 | 9/10 | Folder watcher fix | $0.0261 | solved
concise |
#3 | OpenAI o4-mini | 8.75/10 | Folder watcher fix | $0.0128 | solved
extra logic |
#4 | OpenAI GPT-4.1 | 8.5/10 | Folder watcher fix | $0.0158 | solved
verbose |
#4 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | Folder watcher fix | $0.0259 | solved
verbose |
#4 | Anthropic Claude Opus 4 | 8.5/10 | Folder watcher fix | $0.1676 | solved
verbose |
#4 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | Folder watcher fix | N/A | solved
verbose |
#4 | DeepSeek DeepSeek-V3 (New) | 8.5/10 | Folder watcher fix | $0.0022 | solved
verbose |
#4 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | Folder watcher fix | $0.1620 | solved
verbose |
#4 | xAI Grok 4 | 8.5/10 | Folder watcher fix | $0.0436 | solved
extra logic
verbose |
#11 | OpenAI o3 | 8/10 | Folder watcher fix | $0.0982 | solved
diff format |
#11 | Anthropic Claude 3.7 Sonnet | 8/10 | Folder watcher fix | $0.0441 | solved
very verbose |
Image - kanji
Image - kanji Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Kanji image | $0.0227 | correct |
#1 | OpenAI o3 | 9/10 | Kanji image | $0.0827 | correct |
#3 | xAI Grok 4 | 7.5/10 | Kanji image | $0.1585 | main exp wrong
alt exp correct
verbose |
#4 | Anthropic Claude Opus 4 | 6/10 | Kanji image | $0.0396 | a bit ambiguous |
#5 | OpenAI GPT-4.1 | 5/10 | Kanji image | $0.0040 | failed |
#5 | Anthropic Claude 3.7 Sonnet | 5/10 | Kanji image | $0.0080 | failed |
#5 | OpenAI GPT-4o | 5/10 | Kanji image | $0.0070 | failed |
#5 | OpenRouter meta-llama/llama-4-maverick | 5/10 | Kanji image | N/A | ambiguous output |
#5 | Anthropic Claude Sonnet 4 | 5/10 | Kanji image | $0.0091 | failed |
#10 | OpenRouter qwen/qwen3-235b-a22b | 1/10 | Kanji image | N/A | Didn't recognize image |
Image analysis - water bottle
Image analysis - water bottle Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | xAI Grok 4 | 9.25/10 | Image analysis | $0.0887 | correct
detailed explanation |
#2 | OpenAI GPT-4.1 | 9/10 | Image analysis | $0.0024 | correct |
#2 | Google Gemini 2.5 Pro Experimental | 9/10 | Image analysis | N/A | correct |
#2 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Image analysis | $0.0183 | correct |
#2 | OpenAI o3 | 9/10 | Image analysis | $0.0290 | correct |
#6 | Anthropic Claude 3.7 Sonnet | 6/10 | Image analysis | $0.0068 | missed point |
#6 | OpenRouter meta-llama/llama-4-maverick | 6/10 | Image analysis | N/A | missed point |
#6 | Anthropic Claude Sonnet 4 | 6/10 | Image analysis | $0.0065 | missed points |
#6 | Anthropic Claude Opus 4 | 6/10 | Image analysis | $0.0335 | missed points |
Next.js TODO add feature (Simple) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (06-05) | 9.5/10 | TODO task v2 (concise) | $0.0180 | Very concise *2
Follows instruction well |
#1 | Anthropic Claude Opus 4 | 9.5/10 | TODO task (Claude) | $0.0388 | Very concise *2
Follows instruction well |
#1 | xAI Grok 4 | 9.5/10 | TODO task | $0.0387 | Very concise *2
Follows instructions well |
#4 | OpenAI GPT-4.1 | 9.25/10 | TODO task | $0.0038 | Very concise
Follows instruction well |
#4 | Anthropic Claude Sonnet 4 | 9.25/10 | TODO task (Claude) | $0.0076 | Very concise
Follows instruction well |
#6 | DeepSeek DeepSeek-V3 (New) | 9/10 | TODO task | $0.0006 | Concise
Follows instruction well |
#6 | Google Gemini 2.5 Pro Experimental | 9/10 | TODO task v2 (concise) | N/A | Concise
Follows instruction well |
#6 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | TODO task v2 (concise) | $0.0212 | Concise
Follows instruction well |
#9 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | TODO task | N/A | Concise |
#9 | OpenRouter openai/codex-mini | 8.5/10 | TODO task | N/A | Asked for more context! |
#9 | OpenRouter google/gemini-2.5-flash-preview-05-20:thinking | 8.5/10 | TODO task v2 (concise) | N/A | okay |
#9 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | TODO task | $0.0391 | follows instruction |
#9 | Anthropic Claude 3.5 Sonnet | 8.5/10 | TODO task | $0.0086 | Slightly verbose
Follows instruction |
#14 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task | $0.0120 | Output full code |
#14 | OpenRouter inception/mercury-coder-small-beta | 8/10 | TODO task | N/A | output full code |
#14 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | TODO task | N/A | output full code |
#14 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task (Claude) | $0.0133 | Output full code |
#14 | Custom accounts/fireworks/models/deepseek-v3-0324 | 8/10 | TODO task | N/A | output full code |
#14 | OpenRouter mistralai/devstral-small | 8/10 | TODO task | N/A | output full code |
#14 | Anthropic Claude Sonnet 4 | 8/10 | TODO task | $0.0112 | output full code |
#14 | Anthropic Claude Opus 4 | 8/10 | TODO task | $0.0566 | output full code |
#14 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 8/10 | TODO task v2 (concise) | N/A | output full code |
#23 | Google Gemini 2.5 Pro Preview (03-25) | 7.5/10 | TODO task | $0.0114 | Verbose |
#23 | OpenRouter meta-llama/llama-4-maverick | 7.5/10 | TODO task | N/A | verbose |
#23 | OpenAI o3 | 7.5/10 | TODO task | $0.0565 | diff format |
#26 | Google Gemini 2.5 Flash Preview (05-20) | 7/10 | TODO task v2 (concise) | N/A | verbose |
#27 | OpenRouter google/gemini-2.5-flash-preview-05-20 | 6/10 | TODO task v2 (concise) | N/A | diff format
verbose |
TypeScript narrowing (Uncommon) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 8.5/10 | TypeScript narrowing v3 | $0.0521 | both methods work |
#2 | Anthropic Claude 3.5 Sonnet | 8/10 | TypeScript narrowing v3 | $0.0085 | second and final answer works |
#2 | Anthropic Claude Sonnet 4 | 8/10 | TypeScript narrowing v3 | $0.0098 | second method works |
#4 | Anthropic Claude 3.7 Sonnet | 7/10 | TypeScript narrowing v3 | $0.0122 | second answer works
final answer wrong |
#5 | OpenAI GPT-4.1 | 6/10 | TypeScript narrowing v3 | $0.0037 | use in keyword |
#5 | OpenRouter mistralai/devstral-small | 6/10 | TypeScript narrowing v3 | N/A | almost correct |
#5 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 6/10 | TypeScript narrowing v3 | N/A | use in keyword |
#5 | xAI Grok 4 | 6/10 | TypeScript narrowing v3 | $0.0855 | use in keyword |
#9 | Google Gemini 2.5 Pro Experimental | 5/10 | TypeScript narrowing v3 | N/A | use in keyword
verbose |
#9 | Google Gemini 2.5 Pro Preview (05-06) | 5/10 | TypeScript narrowing v3 | $0.0227 | use in keyword
verbose |
#9 | Google Gemini 2.5 Pro Preview (06-05) | 5/10 | TypeScript narrowing v3 | $0.1254 | use in keyword
verbose |
#12 | DeepSeek DeepSeek-V3 (New) | 4/10 | TypeScript narrowing v3 | $0.0005 | mention predicate |
#13 | OpenAI o3 | 1/10 | TypeScript narrowing v4 | $0.0458 | wrong |
#13 | DeepSeek DeepSeek-V3 (New) | 1/10 | TypeScript narrowing v4 | $0.0005 | wrong |
#13 | OpenAI o4-mini | 1/10 | TypeScript narrowing v4 | $0.0091 | wrong |
#13 | Google Gemini 2.5 Pro Experimental | 1/10 | TypeScript narrowing v4 | N/A | wrong |
#13 | Google Gemini 2.5 Pro Preview (05-06) | 1/10 | TypeScript narrowing v4 | $0.0207 | wrong |
#13 | OpenAI o3 | 1/10 | TypeScript narrowing v3 | $0.0546 | wrong |
#13 | OpenRouter mistralai/mistral-medium-3 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#13 | OpenRouter inception/mercury-coder-small-beta | 1/10 | TypeScript narrowing v3 | N/A | wrong |
Writing an AI Timeline
Writing an AI Timeline Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | AI timeline | $0.0250 | Covers almost all points |
#1 | Anthropic Claude Opus 4 | 9.5/10 | AI timeline | $0.1413 | Covers almost all points |
#3 | OpenAI GPT-4.1 | 9.25/10 | AI timeline | $0.0076 | Covers most points
concise |
#3 | Google Gemini 2.5 Pro Preview (06-05) | 9.25/10 | AI timeline v2 (concise) | $0.0492 | Covers most points
concise |
#3 | xAI Grok 4 | 9.25/10 | AI timeline | $0.0315 | Covers most points
concise |
#6 | DeepSeek DeepSeek-V3 (New) | 9/10 | AI timeline | $0.0009 | Covers most points
Too concise |
#7 | Anthropic Claude 3.7 Sonnet | 8.5/10 | AI timeline | $0.0244 | Covers most points
Wrong format |
#7 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | AI timeline | N/A | Covers most points
Wrong format |
#7 | Google Gemini 2.5 Pro Experimental | 8.5/10 | AI timeline v2 (concise) | N/A | Covers most points
Wrong format |
#7 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | $0.0213 | Covers most points
Wrong format |
#7 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | $0.0547 | Covers major points
Wrong format |
#7 | OpenAI o3 | 8.5/10 | AI timeline | $0.1656 | Covers most points
Wrong format |
#13 | Google Gemini 2.5 Pro Experimental | 8/10 | AI timeline | N/A | Covers most points
Wrong format
Verbose |
#13 | Custom accounts/fireworks/models/deepseek-v3-0324 | 8/10 | AI timeline | N/A | Covers major points
Wrong format |
#13 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | AI timeline | N/A | Covers major points
Wrong format |
#13 | OpenRouter meta-llama/llama-3.3-70b-instruct | 8/10 | AI timeline | N/A | Covers major points
Wrong format |
#17 | Azure OpenAI GPT-4o | 7.5/10 | AI timeline | $0.0095 | Missed some points
Bad headline |
#17 | OpenRouter qwen/qwen3-8b:free | 7.5/10 | AI timeline | N/A | Missed some points
Bad headline |
#17 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 7.5/10 | AI timeline | N/A | wrong format --
major points |
Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.
Prompt variations are used on a best-effort basis to perform style control across models.