16x Eval Model Evaluation
Comprehensive evaluation results by 16x Eval team for AI models across various tasks including coding and writing.
16x Eval Top Models - Coding
Average Human Rating of 7 Tasks │ Total Cost
Claude Opus 4
Claude Opus 4
8.96(64.68¢)
8.9664.68¢
GPT-5 (High)
GPT-5 (High)
8.86(110.46¢)
8.86110.46¢
Claude Sonnet 4
Claude Sonnet 4
8.68(13.53¢)
8.6813.53¢
Grok 4
Grok 4
8.61(79.14¢)
8.6179.14¢
gpt-oss-120b (Cerebras)
gpt-oss-120b (Cerebras)
8.39(1.12¢)
8.391.12¢
GPT-4.1
GPT-4.1
8.21(7.14¢)
8.217.14¢
Gemini 2.5 Pro
Gemini 2.5 Pro
7.71(83.48¢)
7.7183.48¢
GPT-5
GPT-5
7.71(60.39¢)
7.7160.39¢
Grok Code Fast 1
Grok Code Fast 1
7.64(3.98¢)
7.643.98¢
Qwen3 Coder
Qwen3 Coder
7.25(5.68¢)
7.255.68¢
GLM-4.5
GLM-4.5
7.00(10.46¢)
7.0010.46¢
Kimi K2 0711
Kimi K2 0711
6.39(1.69¢)
6.391.69¢
DeepSeek-V3.1
DeepSeek-V3.1
5.68(1.64¢)
5.681.64¢
Latest Model Evaluation Posts
September 13, 2025
September 11, 2025
September 5, 2025
Evaluation Rubrics
Criteria:
- Side-by-side visualization without label: 8.5/10
- Baseline visualization without label: 8/10
- Horizontal bar chart (if cannot fit in the page): 7.5/10
- Has major formatting issues: 5/10
- Did not run / Code error: 1/10
Additional components:
- Side-by-side visualization
- Color by benchmark: +0.5 rating
- Alternative ways to differentiate benchmarks: +0.5 rating
- Color by model: No effect on rating
- Clear labels on bar chart: +0.5 rating
- Visually pleasing: +0.25 rating
- Poor color choice: -0.5 rating
- Minor formatting issues: -0.5 rating
Additional instructions for variance:
- If the code did not run or render in the first try, a second try is given to regenerate the code.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.25/10 | Benchmark visualization | 5.02¢ | Side-by-side no label. Color by benchmark. Visually pleasing |
#1 | Anthropic Claude Opus 4 | 9.25/10 | Benchmark visualization | 21.04¢ | Side-by-side no label. Color by benchmark. Visually pleasing |
#1 | xAI Grok 4 | 9.25/10 | Benchmark visualization | 11.26¢ | side-by-side clear labels. color by model. Visually pleasing |
#1 | Moonshot AI Kimi K2 0711 | 9.25/10 | Benchmark visualization | 0.55¢ | Side-by-side no label. Color by model. Benchmark diff by alpha. Visually pleasing |
#1 | OpenAI GPT-5 (High) | 9.25/10 | Benchmark visualization | 19.21¢ | Clear labels. Visually pleasing. Highlight benchmarks on hover |
#6 | OpenAI GPT-4.1 | 8.75/10 | Benchmark visualization | 1.88¢ | Clear labels. Visually pleasing |
#6 | Google Gemini 2.5 Pro | 8.75/10 | Benchmark visualization | 11.00¢ | Clear labels. Visually pleasing |
#8 | OpenRouter gpt-oss-120b (Cerebras) | 8.5/10 | Benchmark visualization | 0.20¢ | baseline. clear labels |
#8 | OpenAI GPT-5 | 8.5/10 | Benchmark visualization | 11.92¢ | baseline. clear labels |
#8 | Z.ai GLM-4.5 | 8.5/10 | Benchmark visualization | 1.31¢ | Side-by-side visualization. No labels |
#11 | xAI Grok Code Fast 1 | 8.25/10 | Benchmark visualization | 0.61¢ | No labels. Visually pleasing |
#12 | OpenAI o3 | 8/10 | Benchmark visualization | 12.74¢ | Clear labels. Poor color choice |
#12 | Google Gemini 2.5 Pro Preview (06-05) | 8/10 | Benchmark visualization | 13.97¢ | Clear labels. Poor color choice |
#14 | Anthropic Claude 3.7 Sonnet | 7.5/10 | Benchmark visualization | 5.10¢ | Number labels. Good idea |
#15 | Google Gemini 2.5 Pro Experimental | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#15 | DeepSeek DeepSeek-V3 (New) | 7/10 | Benchmark visualization | 0.42¢ | No labels. Good colors |
#15 | Google Gemini 2.5 Pro Preview (05-06) | 7/10 | Benchmark visualization | 4.61¢ | No labels. Good colors |
#15 | OpenRouter OpenRouter: Mistral Medium 3 | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#15 | OpenRouter OpenRouter: Mistral: Devstral Small | 7/10 | Benchmark visualization | N/A | No labels. Good colors |
#15 | OpenRouter (Alibaba Plus) Qwen3 Coder | 7/10 | Benchmark visualization | 1.64¢ | horizontal bars. minor formatting issues |
#21 | Google Gemini 2.5 Pro Preview (03-25) | 6/10 | Benchmark visualization | 5.35¢ | Minor bug. No labels |
#22 | Stealth Horizon Alpha | 5.5/10 | Benchmark visualization | N/A | Strange visualization with major formatting issues |
#22 | Stealth Horizon Alpha | 5.5/10 | Benchmark visualization | N/A | Strange visualization with major formatting issues |
#22 | DeepSeek DeepSeek-V3.1 | 5.5/10 | Benchmark visualization | 0.55¢ | Strange visualization with major formatting issues |
#22 | DeepSeek DeepSeek-V3.1 | 5.5/10 | Benchmark visualization | 0.49¢ | Strange visualization with major formatting issues |
#26 | OpenRouter OpenRouter: Qwen3 235B A22B | 5/10 | Benchmark visualization | N/A | Very small. Hard to read |
#26 | OpenRouter OpenRouter: Mercury Coder Small Beta | 5/10 | Benchmark visualization | N/A | No color. Hard to read |
#28 | OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B | 1/10 | Benchmark visualization | N/A | doesn't run. bugfix not obvious. |
Evaluation Rubrics
Criteria:
- No text content was removed: 9/10
- Some text content was removed: 8/10
Additional components:
- Left-over elements:
- Left-over tables: -0.5 rating
- Left-over mdx import statements: -0.5 rating
- Left-over mdx components: -0.5 rating
- Newline handling:
- The output does not contain newlines: -1 rating
- The output has 1 or more newline issues: -0.5 rating
- Short code (1500 characters or less) that is correct: +0.25 rating
- Verbose output: -0.5 rating
Additional instructions for variance:
- Each model is given two tries for this task. The higher rating will be used.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | xAI Grok 4 | 9/10 | Clean mdx | 25.74¢ | 100% match |
#2 | Google Gemini 2.5 Pro | 8.5/10 | Clean mdx | 17.84¢ | left-over components |
#2 | Google Gemini 2.5 Pro | 8.5/10 | Clean mdx | 13.42¢ | left-over components |
#4 | OpenAI GPT-5 | 8/10 | Clean mdx | 14.89¢ | 1 newline issue. left-over component |
#4 | OpenAI GPT-5 | 8/10 | Clean mdx | 16.46¢ | 1 newline issue. left-over component |
#4 | xAI Grok 4 | 8/10 | Clean mdx | 25.82¢ | no newline |
#4 | OpenRouter gpt-oss-120b (Cerebras) | 8/10 | Clean mdx | 0.17¢ | left-over import and components |
#4 | OpenRouter gpt-oss-120b (Cerebras) | 8/10 | Clean mdx | 0.15¢ | left-over import and components |
#4 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8/10 | Clean mdx | 0.42¢ | 2 newline issues. left-over component |
#4 | xAI Grok Code Fast 1 | 8/10 | Clean mdx | 0.77¢ | 3 newline issues. left-over imports |
#4 | OpenAI GPT-5 (High) | 8/10 | Clean mdx | 28.39¢ | 3 newline issue. left-over component |
#12 | Anthropic Claude Sonnet 4 | 7.5/10 | Clean mdx | 0.95¢ | text removed. left-over component |
#12 | OpenAI GPT-4.1 | 7.5/10 | Clean mdx | 0.77¢ | left-over tables. left-over import and components |
#12 | Anthropic Claude Opus 4 | 7.5/10 | Clean mdx | 5.34¢ | text removed. left-over component |
#12 | OpenAI GPT-4.1 | 7.5/10 | Clean mdx | 0.95¢ | 6 newline issues. left over import and components |
#12 | OpenRouter (Alibaba Plus) Qwen3 Coder | 7.5/10 | Clean mdx | 0.40¢ | text removed. left-over component |
#12 | xAI Grok Code Fast 1 | 7.5/10 | Clean mdx | 1.06¢ | 4 newline issues. left-over imports and component |
#12 | OpenAI GPT-5 (High) | 7.5/10 | Clean mdx | 25.71¢ | text removed. left-over component |
#12 | Z.ai GLM-4.5 | 7.5/10 | Clean mdx | 1.37¢ | 1 newline issue. left-over import, component |
#20 | Anthropic Claude Opus 4 | 7/10 | Clean mdx | 5.48¢ | text removed. left-over import and components |
#20 | Moonshot AI Kimi K2 0711 | 7/10 | Clean mdx | 0.13¢ | text removed. left-over import and components |
#20 | Moonshot AI Kimi K2 0711 | 7/10 | Clean mdx | 0.14¢ | text removed. left-over import and components |
#23 | DeepSeek DeepSeek-V3.1 | 6.5/10 | Clean mdx | 0.12¢ | left-over import and component. left-over table |
#23 | DeepSeek DeepSeek-V3.1 | 6.5/10 | Clean mdx | 0.12¢ | left-over import and component. left-over table |
#23 | Z.ai GLM-4.5 | 6.5/10 | Clean mdx | 1.71¢ | text removed. left-over import and components. 1 newline issue |
#26 | Anthropic Claude Sonnet 4 | 6/10 | Clean mdx | 0.92¢ | text removed. no newline. left-over import and component |
Evaluation Rubrics
Criteria:
- Code runs and gives correct (expected) output: 9/10
- The output has 1 or more newline issues: 8.5/10
- The output does not contain newlines: 8/10
Additional components:
- Short code (1000 characters or less) that is correct: +0.25 rating
- Verbose output: -0.5 rating
- Missing export statement: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9.25/10 | clean markdown v2 | 4.02¢ | correct. short code |
#1 | Moonshot AI Kimi K2 0711 | 9.25/10 | clean markdown v2 | 0.11¢ | correct. short code |
#1 | OpenRouter (Alibaba Plus) Qwen3 Coder | 9.25/10 | clean markdown v2 | 0.33¢ | correct. short code |
#1 | DeepSeek DeepSeek-V3.1 | 9.25/10 | clean markdown v2 | 0.09¢ | correct. short code |
#1 | xAI Grok Code Fast 1 | 9.25/10 | clean markdown v2 | 0.89¢ | correct. short code |
#6 | Google Gemini 2.5 Pro | 9/10 | clean markdown v2 | 13.60¢ | correct |
#6 | OpenAI o3 | 9/10 | clean markdown v2 | 13.79¢ | correct |
#6 | OpenAI GPT-5 | 9/10 | clean markdown v2 | 19.28¢ | correct |
#6 | OpenAI GPT-5 (High) | 9/10 | clean markdown v2 | 29.55¢ | correct |
#10 | OpenAI GPT-4.1 | 8.5/10 | clean markdown v2 | 1.12¢ | 1 new line issue |
#10 | xAI Grok 4 | 8.5/10 | clean markdown v2 | 13.06¢ | 1 new line issue |
#10 | Stealth Horizon Alpha | 8.5/10 | clean markdown v2 | N/A | one newline issue |
#10 | OpenRouter gpt-oss-120b (Cerebras) | 8.5/10 | clean markdown v2 | 0.15¢ | one newline issue |
#10 | Z.ai GLM-4.5 | 8.5/10 | clean markdown v2 | 3.64¢ | didn't add export |
#15 | Anthropic Claude Sonnet 4 | 8/10 | clean markdown v2 | 0.77¢ | no new lines |
#15 | DeepSeek DeepSeek-V3 (New) | 8/10 | clean markdown v2 | 0.08¢ | no new lines |
Evaluation Rubrics
Criteria:
- Correctly solved the task: 9/10
Additional components:
- Added helpful extra logic: +0.25 rating
- Added unnecessary code: -0.25 rating
- Returned code in diff format: -1 rating
- Verbose output: -0.5 rating
- Concise response (only changed code): +0.25 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 9.5/10 | Folder watcher fix | 13.70¢ | solved. extra logic. concise |
#1 | Stealth Horizon Alpha | 9.5/10 | Folder watcher fix | N/A | solved. extra logic. concise, respects indentation well |
#1 | xAI Grok Code Fast 1 | 9.5/10 | Folder watcher fix | 0.43¢ | solved. extra logic. concise |
#1 | Z.ai GLM-4.5 | 9.5/10 | Folder watcher fix | 1.06¢ | solved. extra logic |
#5 | OpenAI o4-mini | 9.25/10 | Folder watcher fix | 1.28¢ | solved. extra logic |
#5 | Anthropic Claude Sonnet 4 | 9.25/10 | Folder watcher fix | 2.61¢ | solved. concise |
#5 | Moonshot AI Kimi K2 0711 | 9.25/10 | Folder watcher fix | 0.48¢ | solved. extra logic |
#5 | Anthropic Claude Opus 4 | 9.25/10 | Folder watcher fix v2 | 13.18¢ | solved. concise |
#9 | Anthropic Claude 3.7 Sonnet | 8.75/10 | Folder watcher fix | 4.41¢ | solved. very verbose. extra logic |
#9 | xAI Grok 4 | 8.75/10 | Folder watcher fix | 4.36¢ | solved. extra logic. verbose |
#9 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8.75/10 | Folder watcher fix | 1.50¢ | unnecessary code |
#9 | OpenAI GPT-5 | 8.75/10 | Folder watcher fix | 4.46¢ | solved. extra logic. verbose |
#13 | OpenAI GPT-4.1 | 8.5/10 | Folder watcher fix | 1.58¢ | solved. verbose |
#13 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | Folder watcher fix | 2.59¢ | solved. verbose |
#13 | Anthropic Claude Opus 4 | 8.5/10 | Folder watcher fix | 16.76¢ | solved. verbose |
#13 | OpenRouter OpenRouter: Mistral Medium 3 | 8.5/10 | Folder watcher fix | N/A | solved. verbose |
#13 | DeepSeek DeepSeek-V3 (New) | 8.5/10 | Folder watcher fix | 0.43¢ | solved. verbose |
#13 | Google Gemini 2.5 Pro Preview (06-05) | 8.5/10 | Folder watcher fix | 16.20¢ | solved. verbose |
#13 | OpenRouter gpt-oss-120b (Cerebras) | 8.5/10 | Folder watcher fix | 0.28¢ | solved. verbose |
#13 | DeepSeek DeepSeek-V3.1 | 8.5/10 | Folder watcher fix | 0.45¢ | solved. verbose |
#13 | OpenAI GPT-5 (High) | 8.5/10 | Folder watcher fix | 8.91¢ | solved. verbose |
#13 | OpenAI GPT-5 (High) | 8.5/10 | Folder watcher fix v2 | 10.13¢ | solved. verbose |
#23 | OpenAI o3 | 8/10 | Folder watcher fix | 9.82¢ | solved. diff format |
#23 | Google Gemini 2.5 Pro | 8/10 | Folder watcher fix | 21.98¢ | solved in a different way. diff format |
Evaluation Rubrics
Criteria:
- Correct explanation: 9/10
- Tangentially related explanation: 6/10
- Incorrect or ambiguous explanation: 5/10
- Did not recognize image: 1/10
Additional components:
- Provides multiple explanations
- Includes one wrong explanation: -0.5 rating
- Final or main explanation wrong: -1 rating
- Verbose output: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Kanji image | 2.27¢ | correct |
#1 | OpenAI o3 | 9/10 | Kanji image | 8.27¢ | correct |
#1 | OpenAI GPT-5 | 9/10 | Kanji image | 13.53¢ | correct explanation |
#1 | OpenAI GPT-5 (High) | 9/10 | Kanji image | 12.65¢ | correct |
#5 | xAI Grok 4 | 7.5/10 | Kanji image | 15.85¢ | main exp wrong. alt exp correct. verbose |
#6 | Anthropic Claude Opus 4 | 6/10 | Kanji image | 3.96¢ | tangential |
#7 | OpenAI GPT-4.1 | 5/10 | Kanji image | 0.40¢ | failed |
#7 | Anthropic Claude 3.7 Sonnet | 5/10 | Kanji image | 0.80¢ | failed |
#7 | OpenAI GPT-4o | 5/10 | Kanji image | 0.70¢ | failed |
#7 | OpenRouter OpenRouter: Meta: Llama 4 Maverick | 5/10 | Kanji image | N/A | ambiguous output |
#7 | Anthropic Claude Sonnet 4 | 5/10 | Kanji image | 0.91¢ | failed |
#7 | Google Gemini 2.5 Pro | 5/10 | Kanji image | 5.48¢ | failed |
#13 | OpenRouter OpenRouter: Qwen3 235B A22B | 1/10 | Kanji image | N/A | Didn't recognize image |
#13 | Anthropic Claude Opus 4.1 | 1/10 | Kanji image | 4.20¢ | failed |
Evaluation Rubrics
Criteria:
- Correct explanation: 9/10
- Missed the point: 6/10
Additional components:
- Detailed explanation: +0.25 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | xAI Grok 4 | 9.25/10 | Image analysis | 8.87¢ | correct. detailed explanation |
#2 | OpenAI GPT-4.1 | 9/10 | Image analysis | 0.24¢ | correct |
#2 | Google Gemini 2.5 Pro Experimental | 9/10 | Image analysis | N/A | correct |
#2 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | Image analysis | 1.83¢ | correct |
#2 | OpenAI o3 | 9/10 | Image analysis | 2.90¢ | correct |
#2 | Google Gemini 2.5 Pro | 9/10 | Image analysis | 1.45¢ | correct |
#2 | Stealth Horizon Alpha | 9/10 | Image analysis | N/A | correct |
#2 | OpenRouter OR: Horizon Beta | 9/10 | Image analysis | N/A | correct |
#2 | OpenAI GPT-5 | 9/10 | Image analysis | 2.32¢ | correct |
#2 | OpenAI GPT-5 (High) | 9/10 | Image analysis | 6.16¢ | correct |
#11 | Anthropic Claude 3.7 Sonnet | 6/10 | Image analysis | 0.68¢ | missed point |
#11 | OpenRouter OpenRouter: Meta: Llama 4 Maverick | 6/10 | Image analysis | N/A | missed point |
#11 | Anthropic Claude Sonnet 4 | 6/10 | Image analysis | 0.65¢ | missed points |
#11 | Anthropic Claude Opus 4 | 6/10 | Image analysis | 3.35¢ | missed points |
#11 | Anthropic Claude Opus 4.1 | 6/10 | Image analysis | 3.35¢ | missed point |
Evaluation Rubrics
# Rubrics for Image Table Data Extraction Project
Criteria:
- All 4 models are correctly extracted: 9.5/10
- 3 models are correctly extracted: 8/10
- 2 models are correctly extracted: 6/10
- 1 model is correctly extracted: 3/10
- 0 models are correctly extracted: 1/10
Additional instructions for variance:
- Each model is given two tries for this task. The higher rating will be used.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | Image table data extraction | 0.71¢ | All 4 models correct |
#1 | Google Gemini 2.5 Pro | 9.5/10 | Image table data extraction | 1.70¢ | All 4 models correct |
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | Image table data extraction | 0.73¢ | All 4 models correct |
#1 | Google Gemini 2.5 Pro | 9.5/10 | Image table data extraction | 1.24¢ | All 4 models correct |
#5 | Google Gemini 2.5 Flash | 8/10 | Image table data extraction | 0.18¢ | 3 models correct |
#5 | OpenAI GPT-5 (High) | 8/10 | Image table data extraction | 12.39¢ | 3 models correct |
#7 | Google Gemini 2.5 Flash | 6/10 | Image table data extraction | 0.18¢ | 2 models correct |
#7 | OpenAI GPT-5 (High) | 6/10 | Image table data extraction | 14.95¢ | 2 models correct |
#9 | Anthropic Claude Opus 4.1 | 3/10 | Image table data extraction | 3.49¢ | 1 model correct |
#9 | Anthropic Claude Opus 4.1 | 3/10 | Image table data extraction | 4.54¢ | 1 model correct |
#9 | OpenAI GPT-5 | 3/10 | Image table data extraction | 11.11¢ | 1 model correct |
#12 | OpenAI GPT-5 | 1/10 | Image table data extraction | 11.50¢ | all wrong |
#12 | xAI Grok 4 | 1/10 | Image table data extraction | 15.63¢ | all wrong |
#12 | xAI Grok 4 | 1/10 | Image table data extraction | 15.26¢ | all wrong |
Evaluation Rubrics
Criteria:
- Output only changed code (follows instructions): 9/10
- Output full code (does not follow instructions): 8/10
Additional components:
- Concise response
- Very concise response (<=1300 characters): +0.25 rating
- Very very concise response (<=1200 characters): +0.5 rating
- Verbose output (>=1500 characters): -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | TODO task (Claude) | 0.76¢ | Very concise *2 |
#1 | Google Gemini 2.5 Pro Preview (06-05) | 9.5/10 | TODO task v2 (concise) | 1.80¢ | Very concise *2 |
#1 | Anthropic Claude Opus 4 | 9.5/10 | TODO task (Claude) | 3.88¢ | Very concise *2 |
#1 | xAI Grok 4 | 9.5/10 | TODO task | 3.87¢ | Very concise *2 |
#1 | Google Gemini 2.5 Pro | 9.5/10 | TODO task v2 (concise) | 2.12¢ | Very concise *2 |
#1 | OpenAI GPT-5 | 9.5/10 | TODO task v2 (concise) | 3.77¢ | very concise *2 |
#1 | xAI Grok Code Fast 1 | 9.5/10 | TODO task v2 (concise) | 0.17¢ | Very concise *2 |
#1 | OpenAI GPT-5 (High) | 9.5/10 | TODO task v2 (concise) | 7.62¢ | Very concise *2 |
#9 | OpenAI GPT-4.1 | 9.25/10 | TODO task | 0.38¢ | Very concise |
#9 | Google Gemini 2.5 Pro Experimental | 9.25/10 | TODO task v2 (concise) | N/A | Very concise |
#11 | DeepSeek DeepSeek-V3 (New) | 9/10 | TODO task | 0.10¢ | Follows instruction |
#11 | Google Gemini 2.5 Pro Preview (05-06) | 9/10 | TODO task v2 (concise) | 2.12¢ | Follows instruction |
#11 | Google Gemini 2.5 Pro Preview (06-05) | 9/10 | TODO task | 3.91¢ | Follows instruction |
#11 | Anthropic Claude 3.5 Sonnet | 9/10 | TODO task | 0.86¢ | Follows instruction |
#11 | OpenAI GPT-5 | 9/10 | TODO task | 3.01¢ | follows instructions |
#16 | OpenRouter OpenRouter: OpenAI: Codex Mini | 8.5/10 | TODO task | N/A | Asked for more context! |
#16 | OpenRouter google/gemini-2.5-flash-preview-05-20:thinking | 8.5/10 | TODO task v2 (concise) | N/A | Follows instruction. Verbose |
#16 | Stealth Horizon Alpha | 8.5/10 | TODO task | N/A | Follows instruction. Verbose |
#16 | Stealth Horizon Alpha | 8.5/10 | TODO task v2 (concise) | N/A | Follows instruction. Verbose |
#16 | OpenRouter gpt-oss-120b (Cerebras) | 8.5/10 | TODO task | 0.07¢ | Follows instruction. Verbose |
#21 | OpenRouter OpenRouter: Mercury Coder Small Beta | 8/10 | TODO task | N/A | output full code |
#21 | OpenRouter OpenRouter: Qwen3 235B A22B | 8/10 | TODO task | N/A | output full code |
#21 | OpenRouter OpenRouter: Mistral Medium 3 | 8/10 | TODO task | N/A | output full code |
#21 | Fireworks AI DeepSeek V3 (0324) | 8/10 | TODO task | N/A | output full code |
#21 | OpenRouter OpenRouter: Mistral: Devstral Small | 8/10 | TODO task | N/A | output full code |
#21 | Anthropic Claude Sonnet 4 | 8/10 | TODO task | 1.12¢ | output full code |
#21 | Anthropic Claude Opus 4 | 8/10 | TODO task | 5.66¢ | output full code |
#21 | OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B | 8/10 | TODO task v2 (concise) | N/A | output full code |
#21 | Moonshot AI Kimi K2 0711 | 8/10 | TODO task | 0.15¢ | output full code |
#21 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8/10 | TODO task | 0.43¢ | output full code |
#21 | OpenRouter gpt-oss-120b (Cerebras) | 8/10 | TODO task v2 (concise) | 0.08¢ | output full code |
#21 | DeepSeek DeepSeek-V3.1 | 8/10 | TODO task (Claude) | 0.11¢ | output full code |
#21 | DeepSeek DeepSeek-V3.1 | 8/10 | TODO task | 0.11¢ | output full code |
#21 | DeepSeek DeepSeek-V3.1 | 8/10 | TODO task v2 (concise) | 0.11¢ | output full code |
#21 | DeepSeek DeepSeek-V3.1 | 8/10 | TODO task (Claude) | 0.11¢ | output full code |
#21 | xAI Grok Code Fast 1 | 8/10 | TODO task | 0.32¢ | output full code |
#21 | Z.ai GLM-4.5 | 8/10 | TODO task v2 (concise) | 0.57¢ | output full code |
#38 | Anthropic Claude 3.7 Sonnet | 7.5/10 | TODO task | 1.20¢ | verbose |
#38 | Google Gemini 2.5 Pro Preview (03-25) | 7.5/10 | TODO task | 1.14¢ | Verbose |
#38 | Anthropic Claude 3.7 Sonnet | 7.5/10 | TODO task (Claude) | 1.33¢ | verbose |
#38 | OpenRouter OpenRouter: Meta: Llama 4 Maverick | 7.5/10 | TODO task | N/A | verbose |
#38 | OpenAI o3 | 7.5/10 | TODO task | 5.65¢ | diff format |
Evaluation Rubrics
Criteria:
- Bug identified and fixed: 9/10
- Bug identified but not fixed: 7/10
- Bug not identified: 1/10
Additional components:
- Removes extra z-index values: +0.25 rating
- Uses correct custom values syntax (e.g., z-[60]): +0.25 rating
- Wrong explanation of the bug despite the correct fix: -0.5 rating
Additional instructions for variance:
- Each model is given two tries for this task. The higher rating will be used.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.25/10 | Tailwind css v3 z-index | 2.44¢ | fixed. removes extra |
#1 | Anthropic Claude Sonnet 4 | 9.25/10 | Tailwind css v3 z-index | 2.26¢ | fixed. removes extra |
#1 | OpenAI GPT-5 | 9.25/10 | Tailwind css v3 z-index | 3.81¢ | fixed. correct custom value syntax |
#1 | OpenAI GPT-5 | 9.25/10 | Tailwind css v3 z-index | 4.28¢ | fixed. correct custom value syntax |
#1 | Anthropic Claude Opus 4 | 9.25/10 | Tailwind css v3 z-index | 11.49¢ | fixed. removes extra |
#1 | xAI Grok 4 | 9.25/10 | Tailwind css v3 z-index | 12.30¢ | fixed. correct custom value syntax |
#1 | xAI Grok 4 | 9.25/10 | Tailwind css v3 z-index | 20.80¢ | fixed. correct custom value syntax |
#1 | OpenRouter gpt-oss-120b (Cerebras) | 9.25/10 | Tailwind css v3 z-index | 0.10¢ | fixed. correct custom value syntax |
#1 | Google Gemini 2.5 Pro | 9.25/10 | Tailwind css v3 z-index | 12.96¢ | fixed. removes extra |
#1 | Google Gemini 2.5 Pro | 9.25/10 | Tailwind css v3 z-index | 8.07¢ | fixed. removes extra |
#1 | OpenAI GPT-5 (High) | 9.25/10 | Tailwind css v3 z-index | 8.28¢ | fixed. correct custom value syntax |
#1 | OpenAI GPT-5 (High) | 9.25/10 | Tailwind css v3 z-index | 7.49¢ | fixed. removes extra |
#13 | OpenAI GPT-4.1 | 9/10 | Tailwind css v3 z-index | 1.04¢ | fixed |
#13 | OpenAI GPT-4.1 | 9/10 | Tailwind css v3 z-index | 1.05¢ | fixed |
#13 | Anthropic Claude Opus 4 | 9/10 | Tailwind css v3 z-index | 11.30¢ | fixed |
#13 | OpenRouter gpt-oss-120b (Cerebras) | 9/10 | Tailwind css v3 z-index | 0.20¢ | fixed |
#17 | OpenRouter (Alibaba Plus) Qwen3 Coder | 8.75/10 | Tailwind css v3 z-index | 1.02¢ | fixed. removes extra. wrong explanation |
#18 | Moonshot AI Kimi K2 0711 | 1/10 | Tailwind css v3 z-index | 0.15¢ | not identified |
#18 | Moonshot AI Kimi K2 0711 | 1/10 | Tailwind css v3 z-index | 0.12¢ | not identified |
#18 | OpenRouter (Alibaba Plus) Qwen3 Coder | 1/10 | Tailwind css v3 z-index | 0.95¢ | not identified |
#18 | DeepSeek DeepSeek-V3.1 | 1/10 | Tailwind css v3 z-index | 0.24¢ | not identified |
#18 | DeepSeek DeepSeek-V3.1 | 1/10 | Tailwind css v3 z-index | 0.22¢ | not identified |
#18 | xAI Grok Code Fast 1 | 1/10 | Tailwind css v3 z-index | 0.71¢ | not identified |
#18 | xAI Grok Code Fast 1 | 1/10 | Tailwind css v3 z-index | 0.38¢ | not identified |
#18 | Z.ai GLM-4.5 | 1/10 | Tailwind css v3 z-index | 1.71¢ | not identified |
#18 | Z.ai GLM-4.5 | 1/10 | Tailwind css v3 z-index | 0.46¢ | not identified |
Evaluation Rubrics
Criteria:
- Provide a working method (without in keyword): 8/10
- Use in keyword: 6/10
- Did not work (Wrong answer): 1/10
Additional components:
- Provides multiple working methods: +0.5 rating
- Provides multiple methods
- Includes one wrong method: -0.5 rating
- Final answer wrong: -1 rating
- Verbose output: -0.5 rating
Additional instructions for variance:
- Each model is given two tries for this task to account for large variance in output. The higher rating will be used.
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Opus 4 | 8.5/10 | TypeScript narrowing v3 | 5.21¢ | both methods work |
#1 | OpenAI GPT-5 (High) | 8.5/10 | TypeScript narrowing v3 | 8.50¢ | both methods work |
#3 | Anthropic Claude 3.5 Sonnet | 8/10 | TypeScript narrowing v3 | 0.85¢ | second and final answer works |
#3 | Anthropic Claude Sonnet 4 | 8/10 | TypeScript narrowing v3 | 0.98¢ | second method works |
#3 | xAI Grok Code Fast 1 | 8/10 | TypeScript narrowing v3 | 0.40¢ | correct |
#6 | OpenRouter gpt-oss-120b (Cerebras) | 7.5/10 | TypeScript narrowing v3 | 0.15¢ | 2nd method works |
#6 | OpenRouter gpt-oss-120b (Cerebras) | 7.5/10 | TypeScript narrowing v3 | 0.11¢ | 2nd method works |
#8 | Anthropic Claude 3.7 Sonnet | 7/10 | TypeScript narrowing v3 | 1.22¢ | second answer works. final answer wrong |
#9 | OpenAI GPT-4.1 | 6/10 | TypeScript narrowing v3 | 0.37¢ | use in keyword |
#9 | OpenRouter OpenRouter: Mistral: Devstral Small | 6/10 | TypeScript narrowing v3 | N/A | almost correct |
#9 | xAI Grok 4 | 6/10 | TypeScript narrowing v3 | 8.55¢ | use in keyword |
#9 | xAI Grok Code Fast 1 | 6/10 | TypeScript narrowing v3 | 0.40¢ | use in keyword |
#9 | Z.ai GLM-4.5 | 6/10 | TypeScript narrowing v3 | 0.80¢ | use in keyword |
#14 | Google Gemini 2.5 Pro Experimental | 5.5/10 | TypeScript narrowing v3 | N/A | use in keyword. verbose |
#14 | Google Gemini 2.5 Pro Preview (05-06) | 5.5/10 | TypeScript narrowing v3 | 2.27¢ | use in keyword. verbose |
#14 | Google Gemini 2.5 Pro Preview (06-05) | 5.5/10 | TypeScript narrowing v3 | 12.54¢ | use in keyword. verbose |
#14 | Stealth Horizon Alpha | 5.5/10 | TypeScript narrowing v3 | N/A | first method didn't work. second method uses in keyword |
#14 | OpenAI GPT-5 (High) | 5.5/10 | TypeScript narrowing v3 | 5.26¢ | use in keyword. one wrong method |
#19 | OpenAI o3 | 1/10 | TypeScript narrowing v3 | 5.46¢ | wrong |
#19 | DeepSeek DeepSeek-V3 (New) | 1/10 | TypeScript narrowing v3 | 0.08¢ | wrong. mention predicate |
#19 | OpenRouter OpenRouter: Mistral Medium 3 | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#19 | OpenRouter OpenRouter: Mercury Coder Small Beta | 1/10 | TypeScript narrowing v3 | N/A | wrong |
#19 | Google Gemini 2.5 Pro | 1/10 | TypeScript narrowing v3 | 3.98¢ | wrong |
#19 | Moonshot AI Kimi K2 0711 | 1/10 | TypeScript narrowing v3 | 0.12¢ | wrong |
#19 | OpenRouter (Alibaba Plus) Qwen3 Coder | 1/10 | TypeScript narrowing v3 | 0.34¢ | wrong |
#19 | Moonshot AI Kimi K2 0711 | 1/10 | TypeScript narrowing v3 | 0.10¢ | wrong |
#19 | OpenRouter (Alibaba Plus) Qwen3 Coder | 1/10 | TypeScript narrowing v3 | 0.38¢ | all 3 methods did not work |
#19 | OpenAI GPT-5 | 1/10 | TypeScript narrowing v3 | 2.26¢ | wrong |
#19 | OpenAI GPT-5 | 1/10 | TypeScript narrowing v3 | 1.89¢ | wrong |
#19 | DeepSeek DeepSeek-V3.1 | 1/10 | TypeScript narrowing v3 | 0.08¢ | wrong |
#19 | DeepSeek DeepSeek-V3.1 | 1/10 | TypeScript narrowing v3 | 0.08¢ | wrong |
#19 | Z.ai GLM-4.5 | 1/10 | TypeScript narrowing v3 | 0.55¢ | wrong |
Evaluation Rubrics
Criteria:
- Covers all points (20): 10/10
- Covers almost all points (>=18): 9.5/10
- Covers most points (>=15): 9/10
- Covers major points (>=13): 8.5/10
- Missed some points (<13): 8/10
Additional components:
- Bad headline: -0.5 rating
- Concise response: +0.25 rating
- Too concise response: -0.25 rating
- Verbose output: -0.5 rating
- Wrong format: -0.5 rating
Rank | Model | Rating | Prompt | Cost | Notes |
---|---|---|---|---|---|
#1 | Anthropic Claude Sonnet 4 | 9.5/10 | AI timeline | 2.50¢ | Covers almost all points |
#1 | Anthropic Claude Opus 4 | 9.5/10 | AI timeline | 14.13¢ | Covers almost all points |
#1 | OpenAI GPT-5 | 9.5/10 | AI timeline | 11.45¢ | Covers almost all points |
#4 | OpenAI GPT-4.1 | 9.25/10 | AI timeline | 0.76¢ | Covers most points. concise |
#4 | Google Gemini 2.5 Pro Preview (06-05) | 9.25/10 | AI timeline v2 (concise) | 4.92¢ | Covers most points. concise |
#4 | xAI Grok 4 | 9.25/10 | AI timeline | 3.15¢ | Covers most points. concise |
#7 | DeepSeek DeepSeek-V3 (New) | 8.75/10 | AI timeline | 0.15¢ | Covers most points. Too concise |
#8 | Anthropic Claude 3.7 Sonnet | 8.5/10 | AI timeline | 2.44¢ | Covers most points. Wrong format |
#8 | OpenRouter OpenRouter: Mistral Medium 3 | 8.5/10 | AI timeline | N/A | Covers most points. Wrong format |
#8 | Google Gemini 2.5 Pro Experimental | 8.5/10 | AI timeline v2 (concise) | N/A | Covers most points. Wrong format |
#8 | Google Gemini 2.5 Pro Preview (05-06) | 8.5/10 | AI timeline v2 (concise) | 2.13¢ | Covers most points. Wrong format |
#8 | OpenAI o3 | 8.5/10 | AI timeline | 16.56¢ | Covers most points. Wrong format |
#8 | Google Gemini 2.5 Pro | 8.5/10 | AI timeline v2 (concise) | 5.44¢ | Covers major points |
#8 | Moonshot AI Kimi K2 0711 | 8.5/10 | AI timeline | 0.26¢ | covers major points |
#8 | DeepSeek DeepSeek-V3.1 | 8.5/10 | AI timeline | 0.23¢ | Covers most points. Wrong format |
#8 | DeepSeek DeepSeek-V3.1 | 8.5/10 | AI timeline v3 | 0.25¢ | Covers most points. Wrong format |
#17 | Google Gemini 2.5 Pro Experimental | 8/10 | AI timeline | N/A | Covers most points. Wrong format. Verbose |
#17 | Fireworks AI DeepSeek V3 (0324) | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#17 | OpenRouter OpenRouter: Qwen3 235B A22B | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#17 | OpenRouter OpenRouter: Meta: Llama 3.3 70B Instruct | 8/10 | AI timeline | N/A | Covers major points. Wrong format |
#17 | Google Gemini 2.5 Pro Preview (05-06) | 8/10 | AI timeline v2 (concise) | 5.47¢ | Covers major points. Wrong format |
#22 | Azure OpenAI gpt-4o | 7.5/10 | AI timeline | 0.95¢ | Missed some points. Bad headline |
#22 | OpenRouter OpenRouter: Qwen: Qwen3 8B (free) | 7.5/10 | AI timeline | N/A | Missed some points. Bad headline |
#22 | OpenRouter OpenRouter: Deepseek R1 0528 Qwen3 8B | 7.5/10 | AI timeline | N/A | Missed some points. Wrong format |
#22 | OpenRouter OR: GPT OSS 120B (Cerebras) | 7.5/10 | AI timeline | 0.11¢ | missed points. wrong format |
Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.
Prompt variations are used on a best-effort basis to perform style control across models.