16x Eval Model Evaluation Results
Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.
Latest Model Evaluation Blog Posts
Evaluation Results
🥇
Gemini 2.5 Pro Preview (05-06)
Google
Avg: 9/10
Individual Experiment Ratings:
Image - kanji:9/10
Image analysis - water bottle:9/10
🥈
o3
OpenAI
Avg: 9/10
Individual Experiment Ratings:
Image - kanji:9/10
Image analysis - water bottle:9/10
🥉
Grok 4
xAI
Avg: 8.38/10
Individual Experiment Ratings:
Image - kanji:7.5/10
Image analysis - water bottle:9.25/10
Top Models - Image Analysis
Gemini 2.5 Pro Preview (05-06)
Gemini 2.5 Pro Preview (05-06)9.00
9.00
o3
o39.00
9.00
Grok 4
Grok 48.38
8.38
GPT-4.1
GPT-4.17.00
7.00
Gemini 2.5 Pro
Gemini 2.5 Pro7.00
7.00
Claude Opus 4
Claude Opus 46.00
6.00
Claude 3.7 Sonnet
Claude 3.7 Sonnet5.50
5.50
meta-llama/llama-4-maverick
meta-llama/llama-4-maverick5.50
5.50
Jump to Experiment
Evaluation Rubrics
Criteria:
- Side-by-side visualization without label: 8.5/10
- Baseline visualization without label: 8/10
- Horizontal bar chart (if cannot fit in the page): 7.5/10
- Has major formatting issues: 5/10
- Did not run / Code error: 1/10
Additional components:
- Side-by-side visualization
- Color by benchmark: +0.5 rating
- Alternative ways to differentiate benchmarks: +0.5 rating
- Color by model: No effect on rating
- Clear labels on bar chart: +0.5 rating
- Visually pleasing: +0.25 rating
- Poor color choice: -0.5 rating
- Minor formatting issues: -0.5 rating
Additional instructions for variance:
- If the code did not run or render in the first try, a second try is given to regenerate the code.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.25/10 | Benchmark visualization | 5.02¢ | Side-by-side no label. Color by benchmark. Visually pleasing |
#1 | Anthropic Claude Opus 4 | 9.25/10 | Benchmark visualization | 21.04¢ | Side-by-side no label. Color by benchmark. Visually pleasing |
#1 | xAI Grok 4 | 9.25/10 | Benchmark visualization | 11.26¢ | side-by-side clear labels. color by model. Visually pleasing |
#1 | Moonshot AI Kimi K2 | 9.25/10 | Benchmark visualization | N/A | Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing |
#5 | OpenAI GPT-4.1 | 8.75/10 | Benchmark visualization | 1.88¢ | Clear labels. Visually pleasing |
#5 | Google Gemini 2.5 Pro | 8.75/10 | Benchmark visualization | 11.00¢ | Clear labels. Visually pleasing |
#7 | OpenRouter openai/gpt-oss-120b | 8.5/10 | Benchmark visualization | N/A | baseline. clear labels |
#8 | OpenAI o3 | 8/10 | Benchmark visualization | 12.74¢ | Clear labels. Poor color choice |
#8 | Google Gemini 2.5 Pro Preview (06-05) | 8/10 | Benchmark visualization | 13.97¢ | Clear labels. Poor color choice |
#10 | Anthropic Claude 3.7 Sonnet | 7.5/10 | Benchmark visualization | 5.10¢ | Number labels. Good idea |
#11 | Google Gemini 2.5 Pro Experimental | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#11 | DeepSeek DeepSeek-V3 (New) | 7/10 | Benchmark visualization | 0.26¢ | No labels. Good colors |
#11 | Google Gemini 2.5 Pro Preview (05-06) | 7/10 | Benchmark visualization | 4.61¢ | No labels. Good colors |
#11 | OpenRouter mistralai/mistral-medium-3 | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#11 | OpenRouter mistralai/devstral-small | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#11 | OpenRouter (Alibaba Plus) Qwen3 Coder | 7/10 | Benchmark visualization | N/A | horizontal bars. minor formatting issues |
#17 | Google Gemini 2.5 Pro Preview (03-25) | 6/10 | Benchmark visualization | 5.35¢ | Minor bug. No labels |
#18 | Stealth Horizon Alpha | 5.5/10 | Benchmark visualization | N/A | Strange visualization with major formatting issues |
#18 | Stealth Horizon Alpha | 5.5/10 | Benchmark visualization | N/A | Strange visualization with major formatting issues |
#20 | OpenRouter qwen/qwen3-235b-a22b | 5/10 | Benchmark visualization | N/A | Very small. Hard to read |
#20 | OpenRouter inception/mercury-coder-small-beta | 5/10 | Benchmark visualization | N/A | No color. Hard to read |
#22 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 1/10 | Benchmark visualization | N/A | doesn't run. bugfix not obvious. |
Evaluation Rubrics
Criteria:
- Code runs and gives correct (expected) output: 9/10
- The output has 1 or more newline issues: 8.5/10
- The output does not contain newlines: 8/10
Additional components:
- Short code (1000 characters or less) that is correct: +0.25 rating
- Verbose output: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9.25/10 | clean markdown v2 | 4.02¢ | correct output. short code |
#1 | Moonshot AI Kimi K2 | 9.25/10 | clean markdown v2 | N/A | correct. short code |
#1 | OpenRouter (Alibaba Plus) Qwen3 Coder | 9.25/10 | clean markdown v2 | N/A | correct. short code |
#4 | Google Gemini 2.5 Pro | 9/10 | clean markdown v2 | 13.60¢ | correct |
#4 | OpenAI o3 | 9/10 | clean markdown v2 | 13.79¢ | correct |
#6 | OpenAI GPT-4.1 | 8.5/10 | clean markdown v2 | 1.12¢ | 1 new line issue |
#6 | xAI Grok 4 | 8.5/10 | clean markdown v2 | 13.06¢ | 1 new line issue |
#6 | Stealth Horizon Alpha | 8.5/10 | clean markdown v2 | N/A | one newline issue |
#6 | OpenRouter openai/gpt-oss-120b | 8.5/10 | clean markdown v2 | N/A | one newline issue |
#10 | Anthropic Claude Sonnet 4 | 8/10 | clean markdown v2 | 0.77¢ | no new lines |
#10 | DeepSeek DeepSeek-V3 (New) | 8/10 | clean markdown v2 | 0.05¢ | no new lines |
Evaluation Rubrics
Criteria:
- Correctly solved the task: 9/10
Additional components:
- Added helpful extra logic: +0.25 rating
- Added unnecessary code: -0.25 rating
- Returned code in diff format: -1 rating
- Verbose output: -0.5 rating
- Concise response (only changed code): +0.25 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9.5/10 | Folder watcher fix | 13.70¢ | solved. extra logic. concise |
#1 | Stealth Horizon Alpha | 9.5/10 | Folder watcher fix | N/A | solved. extra logic. concise, respects indentation well |
#3 | OpenAI o4-mini | 9.25/10 | Folder watcher fix | 1.28¢ | solved. extra logic |
#3 | Anthropic Claude Sonnet 4 | 9.25/10 | Folder watcher fix | 2.61¢ | solved. concise |
#3 | Moonshot AI Kimi K2 | 9.25/10 | Folder watcher fix | N/A | solved. extra logic |
#6 | Anthropic Claude 3.7 Sonnet | 8.75/10 | Folder watcher fix | 4.41¢ | solved. very verbose. extra logic |
#6 | xAI Grok 4 | 8.75/10 | Folder watcher fix | 4.36¢ | solved. extra logic. verbose |
#6 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8.75/10 | Folder watcher fix | N/A | unnecessary code |
#9 | OpenAI GPT-4.1 | 8.5/10 | Folder watcher fix | 1.58¢ | solved. verbose |
#9 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | Folder watcher fix | 2.59¢ | solved. verbose |
#9 | Anthropic Claude Opus 4 | 8.5/10 | Folder watcher fix | 16.76¢ | solved. verbose |
#9 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | Folder watcher fix | N/A | solved. verbose |
#9 | DeepSeek DeepSeek-V3 (New) | 8.5/10 | Folder watcher fix | 0.22¢ | solved. verbose |
#9 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | Folder watcher fix | 16.20¢ | solved. verbose |
#9 | OpenRouter openai/gpt-oss-120b | 8.5/10 | Folder watcher fix | N/A | solved. verbose |
#16 | OpenAI o3 | 8/10 | Folder watcher fix | 9.82¢ | solved. diff format |
#16 | Google Gemini 2.5 Pro | 8/10 | Folder watcher fix | 21.98¢ | solved in a different way. diff format |
Evaluation Rubrics
Criteria:
- Correct explanation: 9/10
- Tangentially related explanation: 6/10
- Incorrect or ambiguous explanation: 5/10
- Did not recognize image: 1/10
Additional components:
- Provides multiple explanations
- Includes one wrong explanation: -0.5 rating
- Final or main explanation wrong: -1 rating
- Verbose output: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Kanji image | 2.27¢ | correct |
#1 | OpenAI o3 | 9/10 | Kanji image | 8.27¢ | correct |
#3 | xAI Grok 4 | 7.5/10 | Kanji image | 15.85¢ | main exp wrong. alt exp correct. verbose |
#4 | Anthropic Claude Opus 4 | 6/10 | Kanji image | 3.96¢ | tangential |
#5 | OpenAI GPT-4.1 | 5/10 | Kanji image | 0.40¢ | failed |
#5 | Anthropic Claude 3.7 Sonnet | 5/10 | Kanji image | 0.80¢ | failed |
#5 | OpenAI GPT-4o | 5/10 | Kanji image | 0.70¢ | failed |
#5 | OpenRouter meta-llama/llama-4-maverick | 5/10 | Kanji image | N/A | ambiguous output |
#5 | Anthropic Claude Sonnet 4 | 5/10 | Kanji image | 0.91¢ | failed |
#5 | Google Gemini 2.5 Pro | 5/10 | Kanji image | 5.48¢ | failed |
#11 | OpenRouter qwen/qwen3-235b-a22b | 1/10 | Kanji image | N/A | Didn't recognize image |
Evaluation Rubrics
Criteria:
- Correct explanation: 9/10
- Missed the point: 6/10
Additional components:
- Detailed explanation: +0.25 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | xAI Grok 4 | 9.25/10 | Image analysis | 8.87¢ | correct. detailed explanation |
#2 | OpenAI GPT-4.1 | 9/10 | Image analysis | 0.24¢ | correct |
#2 | Google Gemini 2.5 Pro Experimental | 9/10 | Image analysis | N/A | correct |
#2 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Image analysis | 1.83¢ | correct |
#2 | OpenAI o3 | 9/10 | Image analysis | 2.90¢ | correct |
#2 | Google Gemini 2.5 Pro | 9/10 | Image analysis | 1.45¢ | correct |
#7 | Anthropic Claude 3.7 Sonnet | 6/10 | Image analysis | 0.68¢ | missed point |
#7 | OpenRouter meta-llama/llama-4-maverick | 6/10 | Image analysis | N/A | missed point |
#7 | Anthropic Claude Sonnet 4 | 6/10 | Image analysis | 0.65¢ | missed points |
#7 | Anthropic Claude Opus 4 | 6/10 | Image analysis | 3.35¢ | missed points |
Evaluation Rubrics
Criteria:
- Output only changed code (follows instructions): 9/10
- Output full code (does not follow instructions): 8/10
Additional components:
- Concise response
- Very concise response: +0.25 rating
- Very very concise response: +0.5 rating
- Verbose output: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (06-05) | 9.5/10 | TODO task v2 (concise) | 1.80¢ | Very concise *2. Follows instruction well |
#1 | Anthropic Claude Opus 4 | 9.5/10 | TODO task (Claude) | 3.88¢ | Very concise *2. Follows instruction well |
#1 | xAI Grok 4 | 9.5/10 | TODO task | 3.87¢ | Very concise *2. Follows instructions well |
#1 | Google Gemini 2.5 Pro | 9.5/10 | TODO task v2 (concise) | 2.12¢ | Very concise *2. Follows instruction well |
#5 | OpenAI GPT-4.1 | 9.25/10 | TODO task | 0.38¢ | Very concise. Follows instruction well |
#5 | Anthropic Claude Sonnet 4 | 9.25/10 | TODO task (Claude) | 0.76¢ | Very concise. Follows instruction well |
#7 | DeepSeek DeepSeek-V3 (New) | 9/10 | TODO task | 0.06¢ | Follows instruction |
#7 | Google Gemini 2.5 Pro Experimental | 9/10 | TODO task v2 (concise) | N/A | Follows instruction |
#7 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | TODO task v2 (concise) | 2.12¢ | Follows instruction |
#7 | Google Gemini 2.5 Pro Preview (06-05) | 9/10 | TODO task | 3.91¢ | Follows instruction |
#11 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | TODO task | N/A | Follows instruction. Verbose |
#11 | OpenRouter openai/codex-mini | 8.5/10 | TODO task | N/A | Asked for more context! |
#11 | OpenRouter google/gemini-2.5-flash-preview-05-20:thinking | 8.5/10 | TODO task v2 (concise) | N/A | Follows instruction. Verbose |
#11 | Anthropic Claude 3.5 Sonnet | 8.5/10 | TODO task | 0.86¢ | Slightly verbose. Follows instruction |
#11 | Stealth Horizon Alpha | 8.5/10 | TODO task | N/A | Follows instruction. Verbose |
#11 | Stealth Horizon Alpha | 8.5/10 | TODO task v2 (concise) | N/A | Follows instruction. Verbose |
#11 | OpenRouter openai/gpt-oss-120b | 8.5/10 | TODO task | N/A | Follows instruction. Verbose |
#11 | OpenRouter openai/gpt-oss-120b | 8.5/10 | TODO task v2 (concise) | N/A | Follows instruction. Verbose |
#19 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task | 1.20¢ | Output full code |
#19 | OpenRouter inception/mercury-coder-small-beta | 8/10 | TODO task | N/A | output full code |
#19 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | TODO task | N/A | output full code |
#19 | Anthropic Claude 3.7 Sonnet | 8/10 | TODO task (Claude) | 1.33¢ | Output full code |
#19 | Fireworks AI DeepSeek V3 (0324) | 8/10 | TODO task | N/A | output full code |
#19 | OpenRouter mistralai/devstral-small | 8/10 | TODO task | N/A | output full code |
#19 | Anthropic Claude Sonnet 4 | 8/10 | TODO task | 1.12¢ | output full code |
#19 | Anthropic Claude Opus 4 | 8/10 | TODO task | 5.66¢ | output full code |
#19 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 8/10 | TODO task v2 (concise) | N/A | output full code |
#19 | Moonshot AI Kimi K2 | 8/10 | TODO task | N/A | output full code |
#19 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8/10 | TODO task | N/A | output full code |
#30 | Google Gemini 2.5 Pro Preview (03-25) | 7.5/10 | TODO task | 1.14¢ | Verbose |
#30 | OpenRouter meta-llama/llama-4-maverick | 7.5/10 | TODO task | N/A | verbose |
#30 | OpenAI o3 | 7.5/10 | TODO task | 5.65¢ | diff format |
Evaluation Rubrics
Criteria:
- Provide a working method (without in keyword): 8/10
- Use in keyword: 6/10
- Did not work (Wrong answer): 1/10
Additional components:
- Provides multiple working methods: +0.5 rating
- Provides multiple methods
- Includes one wrong method: -0.5 rating
- Final answer wrong: -1 rating
- Verbose output: -0.5 rating
Additional instructions for variance:
- Each model is given two tries for this task to account for large variance in output. The higher rating will be used.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 8.5/10 | TypeScript narrowing v3 | 5.21¢ | both methods work |
#2 | Anthropic Claude 3.5 Sonnet | 8/10 | TypeScript narrowing v3 | 0.85¢ | second and final answer works |
#2 | Anthropic Claude Sonnet 4 | 8/10 | TypeScript narrowing v3 | 0.98¢ | second method works |
#4 | OpenRouter openai/gpt-oss-120b | 7.5/10 | TypeScript narrowing v3 | N/A | 2nd method works |
#4 | OpenRouter openai/gpt-oss-120b | 7.5/10 | TypeScript narrowing v3 | N/A | 2nd method works |
#6 | Anthropic Claude 3.7 Sonnet | 7/10 | TypeScript narrowing v3 | 1.22¢ | second answer works. final answer wrong |
#7 | OpenAI GPT-4.1 | 6/10 | TypeScript narrowing v3 | 0.37¢ | use in keyword |
#7 | OpenRouter mistralai/devstral-small | 6/10 | TypeScript narrowing v3 | N/A | almost correct |
#7 | xAI Grok 4 | 6/10 | TypeScript narrowing v3 | 8.55¢ | use in keyword |
#10 | Google Gemini 2.5 Pro Experimental | 5.5/10 | TypeScript narrowing v3 | N/A | use in keyword. verbose |
#10 | Google Gemini 2.5 Pro Preview (05-06) | 5.5/10 | TypeScript narrowing v3 | 2.27¢ | use in keyword. verbose |
#10 | Google Gemini 2.5 Pro Preview (06-05) | 5.5/10 | TypeScript narrowing v3 | 12.54¢ | use in keyword. verbose |
#10 | Stealth Horizon Alpha | 5.5/10 | TypeScript narrowing v3 | N/A | first method didn't work. second method uses in keyword |
#14 | OpenAI o3 | 1/10 | TypeScript narrowing v4 | 4.58¢ | wrong |
#14 | DeepSeek DeepSeek-V3 (New) | 1/10 | TypeScript narrowing v4 | 0.05¢ | wrong |
#14 | OpenAI o4-mini | 1/10 | TypeScript narrowing v4 | 0.91¢ | wrong |
#14 | Google Gemini 2.5 Pro Experimental | 1/10 | TypeScript narrowing v4 | N/A | wrong |
#14 | Google Gemini 2.5 Pro Preview (05-06) | 1/10 | TypeScript narrowing v4 | 2.07¢ | wrong |
#14 | OpenAI o3 | 1/10 | TypeScript narrowing v3 | 5.46¢ | wrong |
#14 | DeepSeek DeepSeek-V3 (New) | 1/10 | TypeScript narrowing v3 | 0.05¢ | wrong. mention predicate |
#14 | OpenRouter mistralai/mistral-medium-3 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#14 | OpenRouter inception/mercury-coder-small-beta | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#14 | Google Gemini 2.5 Pro | 1/10 | TypeScript narrowing v3 | 3.98¢ | wrong |
#14 | Moonshot AI Kimi K2 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#14 | OpenRouter (Alibaba Plus) Qwen3 Coder | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#14 | Moonshot AI Kimi K2 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#14 | OpenRouter (Alibaba Plus) Qwen3 Coder | 1/10 | TypeScript narrowing v3 | N/A | all 3 methods did not work |
Evaluation Rubrics
Criteria:
- Covers all points (20): 10/10
- Covers almost all points (>=18): 9.5/10
- Covers most points (>=15): 9/10
- Covers major points (>=13): 8.5/10
- Missed some points (<13): 8/10
Additional components:
- Bad headline: -0.5 rating
- Concise response: +0.25 rating
- Too concise response: -0.25 rating
- Verbose output: -0.5 rating
- Wrong format: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | AI timeline | 2.50¢ | Covers almost all points |
#1 | Anthropic Claude Opus 4 | 9.5/10 | AI timeline | 14.13¢ | Covers almost all points |
#3 | OpenAI GPT-4.1 | 9.25/10 | AI timeline | 0.76¢ | Covers most points. concise |
#3 | Google Gemini 2.5 Pro Preview (06-05) | 9.25/10 | AI timeline v2 (concise) | 4.92¢ | Covers most points. concise |
#3 | xAI Grok 4 | 9.25/10 | AI timeline | 3.15¢ | Covers most points. concise |
#6 | DeepSeek DeepSeek-V3 (New) | 8.75/10 | AI timeline | 0.09¢ | Covers most points. Too concise |
#7 | Anthropic Claude 3.7 Sonnet | 8.5/10 | AI timeline | 2.44¢ | Covers most points. Wrong format |
#7 | OpenRouter mistralai/mistral-medium-3 | 8.5/10 | AI timeline | N/A | Covers most points. Wrong format |
#7 | Google Gemini 2.5 Pro Experimental | 8.5/10 | AI timeline v2 (concise) | N/A | Covers most points. Wrong format |
#7 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | 2.13¢ | Covers most points. Wrong format |
#7 | OpenAI o3 | 8.5/10 | AI timeline | 16.56¢ | Covers most points. Wrong format |
#7 | Google Gemini 2.5 Pro | 8.5/10 | AI timeline v2 (concise) | 5.44¢ | Covers major points |
#7 | Moonshot AI Kimi K2 | 8.5/10 | AI timeline | N/A | covers major points |
#14 | Google Gemini 2.5 Pro Experimental | 8/10 | AI timeline | N/A | Covers most points. Wrong format. Verbose |
#14 | Fireworks AI DeepSeek V3 (0324) | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#14 | OpenRouter qwen/qwen3-235b-a22b | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#14 | OpenRouter meta-llama/llama-3.3-70b-instruct | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#14 | Google Gemini 2.5 Pro Preview (05-06) | 8/10 | AI timeline v2 (concise) | 5.47¢ | Covers major points. Wrong format |
#19 | Azure OpenAI gpt-4o | 7.5/10 | AI timeline | 0.95¢ | Missed some points. Bad headline |
#19 | OpenRouter qwen/qwen3-8b:free | 7.5/10 | AI timeline | N/A | Missed some points. Bad headline |
#19 | OpenRouter deepseek/deepseek-r1-0528-qwen3-8b | 7.5/10 | AI timeline | N/A | Missed some points. Wrong format |
Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.
Prompt variations are used on a best-effort basis to perform style control across models.