16x Eval Model Evaluation Results
Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.
🥇
Claude Opus 4
Anthropic
8.8/10
Tasks Evaluated:5
Rating Range:8.0 - 9.5
🥈
Claude Sonnet 4
Anthropic
8.7/10
Tasks Evaluated:5
Rating Range:8.0 - 9.5
🥉
GPT-4.1
OpenAI
8.3/10
Tasks Evaluated:5
Rating Range:6.0 - 9.3
Benchmark Visualization Evaluation
coding
Benchmark Visualization (Difficult) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | OpenAI GPT-4.1 | 8.5/10 | Benchmark visualization | $0.0188 | Clear labels |
#1 | Anthropic Claude Sonnet 4 | 8.5/10 | Benchmark visualization | $0.0502 | Side-by-side
No label
No color-coding |
#1 | Anthropic Claude Opus 4 | 8.5/10 | Benchmark visualization | $0.2104 | Side-by-side
No label
No color-coding |
#4 | OpenAI o3 | 8/10 | Benchmark visualization | $0.1274 | Clear labels
Poor color choice |
#4 | Google Gemini 2.5 Pro Preview (06-05) | 8/10 | Benchmark visualization | $0.1397 | Clear labels
Poor color choice |
#6 | Anthropic Claude 3.7 Sonnet | 7.5/10 | Benchmark visualization | $0.0510 | Number labels
Good idea |
#7 | Google Gemini 2.5 Pro Experimental | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#7 | DeepSeek DeepSeek-V3 (New) | 7/10 | Benchmark visualization | $0.0026 | No labels
Good colors |
#7 | Google Gemini 2.5 Pro Preview (05-06) | 7/10 | Benchmark visualization | $0.0461 | No labels
Good colors |
#7 | OpenRouter mistralai/mistral-medium-3 | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#7 | OpenRouter mistralai/devstral-small | 7/10 | Benchmark visualization | N/A | No labels
Good colors |
#12 | Google Gemini 2.5 Pro Preview (03-25) | 6/10 | Benchmark visualization | $0.0535 | Minor bug
No labels |
#13 | OpenRouter qwen/qwen3-235b-a22b | 5/10 | Benchmark visualization | N/A | Very small
Hard to read |
#13 | OpenRouter inception/mercury-coder-small-beta | 5/10 | Benchmark visualization | N/A | No color
Hard to read |
#15 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 1/10 | Benchmark visualization | N/A | doesn't run.
bugfix not obvious. |
#15 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 1/10 | Benchmark visualization | N/A | output json instead of html |
Folder Watcher Fix Evaluation
coding
Folder Watcher Fix (Normal) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9/10 | Folder watcher fix | $0.1370 | solved
extra logic
concise |
#1 | Anthropic Claude Sonnet 4 | 9/10 | Folder watcher fix | $0.0261 | solved
concise |
#3 | OpenAI o4-mini | 8.75/10 | Folder watcher fix | $0.0128 | solved
extra logic |
#4 | OpenAI GPT-4.1 | 8.5/10 | Folder watcher fix | $0.0158 | solved
verbose |
#4 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | Folder watcher fix | $0.0259 | solved
verbose |
#4 | Anthropic Claude Opus 4 | 8.5/10 | Folder watcher fix | $0.1676 | solved
verbose |
#4 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | Folder watcher fix | N/A | solved
verbose |
#4 | DeepSeek DeepSeek-V3 (New) | 8.5/10 | Folder watcher fix | $0.0022 | solved
verbose |
#4 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | Folder watcher fix | $0.1620 | solved
verbose |
#10 | OpenAI o3 | 8/10 | Folder watcher fix | $0.0982 | solved
diff format |
#10 | Anthropic Claude 3.7 Sonnet | 8/10 | Folder watcher fix | $0.0441 | solved
very verbose |
Next.js TODO Evaluation
coding
Next.js TODO add feature (Simple) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (06-05) | 9.5/10 | TODO task v2 (concise) | $0.0180 | Very concise
Follows instruction well |
#1 | Anthropic Claude Opus 4 | 9.5/10 | TODO task (Claude) | $0.0388 | Very concise
Follows instruction well |
#3 | OpenAI GPT-4.1 | 9.25/10 | TODO task | $0.0038 | Very concise
Follows instruction well |
#3 | Anthropic Claude Sonnet 4 | 9.25/10 | TODO task (Claude) | $0.0076 | Very concise
Follows instruction well |
#5 | DeepSeek DeepSeek-V3 (New) | 9/10 | TODO task | $0.0006 | Concise
Follows instruction well |
#5 | Google Gemini 2.5 Pro Experimental | 9/10 | TODO task v2 (concise) | N/A | Concise
Follows instruction well |
#5 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | TODO task v2 (concise) | $0.0212 | Concise
Follows instruction well |
#8 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | TODO task | N/A | Concise |
#8 | OpenRouter openai/codex-mini | 8.5/10 | TODO task | N/A | Asked for more context! |
#8 | OpenRouter google/gemini-2.5-flash-preview-05-20:thinking | 8.5/10 | TODO task v2 (concise) | N/A | okay |
#8 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | TODO task | $0.0391 | follows instruction |
#8 | Anthropic Claude 3.5 Sonnet | 8.5/10 | TODO task | $0.0086 | Slightly verbose
Follows instruction |
#13 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task | $0.0120 | Output full code |
#13 | OpenRouter inception/mercury-coder-small-beta | 8/10 | TODO task | N/A | output full code |
#13 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | TODO task | N/A | output full code |
#13 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task (Claude) | $0.0133 | Output full code |
#13 | Custom accounts/fireworks/models/deepseek-v3-0324 | 8/10 | TODO task | N/A | output full code |
#13 | OpenRouter mistralai/devstral-small | 8/10 | TODO task | N/A | output full code |
#13 | Anthropic Claude Sonnet 4 | 8/10 | TODO task | $0.0112 | output full code |
#13 | Anthropic Claude Opus 4 | 8/10 | TODO task | $0.0566 | output full code |
#13 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 8/10 | TODO task v2 (concise) | N/A | output full code |
#22 | Google Gemini 2.5 Pro Preview (03-25) | 7.5/10 | TODO task | $0.0114 | Verbose |
#22 | OpenRouter meta-llama/llama-4-maverick | 7.5/10 | TODO task | N/A | verbose |
#22 | OpenAI o3 | 7.5/10 | TODO task | $0.0565 | diff format |
#25 | Google Gemini 2.5 Flash Preview (05-20) | 7/10 | TODO task v2 (concise) | N/A | verbose |
#25 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 7/10 | TODO task (Claude) | N/A | verbose
slow |
#27 | OpenRouter google/gemini-2.5-flash-preview-05-20 | 6/10 | TODO task v2 (concise) | N/A | diff format
verbose |
TypeScript Narrowing Evaluation
coding
TypeScript Narrowing (Uncommon) Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 8.5/10 | TypeScript narrowing v3 | $0.0521 | both methods work |
#2 | Anthropic Claude 3.5 Sonnet | 8/10 | TypeScript narrowing v3 | $0.0085 | second and final answer works |
#2 | Anthropic Claude Sonnet 4 | 8/10 | TypeScript narrowing v3 | $0.0098 | second method works |
#4 | Anthropic Claude 3.7 Sonnet | 7/10 | TypeScript narrowing v3 | $0.0122 | second answer works
final answer wrong |
#5 | OpenAI GPT-4.1 | 6/10 | TypeScript narrowing v3 | $0.0037 | use in keyword |
#5 | OpenRouter mistralai/devstral-small | 6/10 | TypeScript narrowing v3 | N/A | almost correct |
#5 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 6/10 | TypeScript narrowing v3 | N/A | use in keyword |
#8 | Google Gemini 2.5 Pro Experimental | 5/10 | TypeScript narrowing v3 | N/A | use in keyword
verbose |
#8 | Google Gemini 2.5 Pro Preview (05-06) | 5/10 | TypeScript narrowing v3 | $0.0227 | use in keyword
verbose |
#8 | Google Gemini 2.5 Pro Preview (06-05) | 5/10 | TypeScript narrowing v3 | $0.1254 | use in keyword
verbose |
#11 | DeepSeek DeepSeek-V3 (New) | 4/10 | TypeScript narrowing v3 | $0.0005 | mention predicate |
#12 | OpenAI o3 | 1/10 | TypeScript narrowing v4 | $0.0458 | wrong |
#12 | DeepSeek DeepSeek-V3 (New) | 1/10 | TypeScript narrowing v4 | $0.0005 | wrong |
#12 | OpenAI o4-mini | 1/10 | TypeScript narrowing v4 | $0.0091 | wrong |
#12 | Google Gemini 2.5 Pro Experimental | 1/10 | TypeScript narrowing v4 | N/A | wrong |
#12 | Google Gemini 2.5 Pro Preview (05-06) | 1/10 | TypeScript narrowing v4 | $0.0207 | wrong |
#12 | OpenAI o3 | 1/10 | TypeScript narrowing v3 | $0.0546 | wrong |
#12 | OpenRouter mistralai/mistral-medium-3 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#12 | OpenRouter inception/mercury-coder-small-beta | 1/10 | TypeScript narrowing v3 | N/A | wrong |
AI Timeline Evaluation
writing
Writing an AI Timeline Evaluation Results
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | AI timeline | $0.0250 | Covers almost all points |
#1 | Anthropic Claude Opus 4 | 9.5/10 | AI timeline | $0.1413 | Covers almost all points |
#3 | OpenAI GPT-4.1 | 9.25/10 | AI timeline | $0.0076 | Covers most points
concise |
#3 | Google Gemini 2.5 Pro Preview (06-05) | 9.25/10 | AI timeline v2 (concise) | $0.0492 | Covers most points
concise |
#5 | DeepSeek DeepSeek-V3 (New) | 9/10 | AI timeline | $0.0009 | Covers most points
Too concise |
#6 | Anthropic Claude 3.7 Sonnet | 8.5/10 | AI timeline | $0.0244 | Covers most points
Wrong format |
#6 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | AI timeline | N/A | Covers most points
Wrong format |
#6 | OpenRouter meta-llama/llama-3.3-70b-instruct | 8.5/10 | AI timeline | N/A | Covers most points
Wrong format |
#6 | Google Gemini 2.5 Pro Experimental | 8.5/10 | AI timeline v2 (concise) | N/A | Covers most points
Wrong format |
#6 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | $0.0213 | Covers most points
Wrong format |
#6 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | $0.0547 | Covers major points
Wrong format |
#12 | Google Gemini 2.5 Pro Experimental | 8/10 | AI timeline | N/A | Covers most points
Wrong format
Verbose |
#12 | Custom accounts/fireworks/models/deepseek-v3-0324 | 8/10 | AI timeline | N/A | Covers major points
Wrong format |
#12 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | AI timeline | N/A | Covers major points
Wrong format |
#15 | Azure OpenAI GPT-4o | 7.5/10 | AI timeline | $0.0095 | Missed some points
Bad headline |
#15 | OpenRouter qwen/qwen3-8b:free | 7.5/10 | AI timeline | N/A | Missed some points
Bad headline |
#15 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 7.5/10 | AI timeline | N/A | wrong format --
major points |
Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.
Prompt variations are used on a best-effort basis to perform style control across models.