Logo16x Eval

Grok 4 Evaluation Results: Strong Performance with Reasoning Trade-offs

Posted on July 16, 2025 by Zhu Liang

xAI recently released Grok 4, positioning it as "the most intelligent model in the world" with native tool use and real-time search integration. The model uses reinforcement learning at scale to enhance reasoning capabilities.

We tested Grok 4 across 7 evaluation tasks covering coding, writing, and image analysis to understand its capabilities and limitations. Our results show Grok 4 performs strongly but comes with notable trade-offs.

Grok 4 Evaluation Summary on 16x Eval across 7 tasks on coding, writing, and image analysis.
Grok 4 Evaluation Summary on 16x Eval across 7 tasks on coding, writing, and image analysis.

Coding Performance

Grok 4 demonstrates excellent performance on straightforward coding tasks. In our Next.js TODO app feature addition test, it achieved a 9.5/10 rating, tying with Claude Opus 4 and Gemini 2.5 Pro Preview for top performance.

Grok 4 Coding Task Performance across 4 Tasks
Grok 4 Coding Task Performance across 4 Tasks

The model produces concise, well-structured code following instructions precisely. For the simple folder watcher fix task, Grok 4 scored 8.5/10, solving the problem with extra logic but in a more verbose manner than the top-performing Claude models.

However, Grok 4 struggles with more complex coding challenges. On the TypeScript narrowing task, it scored 6/10, falling behind Claude Opus 4 (8.5/10) and using less optimal approaches like the in keyword.

The benchmark visualization task revealed similar limitations, with Grok 4 scoring 8/10 compared to Claude models' 8.5/10 performance. Grok 4 produced a decent side-by-side visualization with clear labels. However, its use of colors was not optimal, as it did not use distinct colors for each benchmark, which would have improved readability.

Grok 4 Output for Benchmark Visualization Task
Grok 4 Output for Benchmark Visualization Task

Technical Writing Task

We also put Grok 4 through our standard AI timeline writing task (technical writing). Grok 4 shows strong writing abilities, earning a 9.25/10 rating.

Grok 4 Shows Good Performance on Writing Task
Grok 4 Shows Good Performance on Writing Task

The model covers most required points while maintaining conciseness, placing it among the top performers, just slightly behind Claude Sonnet 4 and Claude Opus 4, and on par with Gemini 2.5 Pro Preview and GPT-4.1.

Strong Image Understanding

One surprising finding from our evaluation is that Grok 4 excels in image analysis tasks, showing particularly strong performance in certain scenarios.

Grok 4 Shows Strong Performance on Image Analysis Tasks
Grok 4 Shows Strong Performance on Image Analysis Tasks

On the water bottle image analysis task, it achieved a 9.25/10 rating with detailed explanations, though it took significantly longer than other models. This is the new highest rating we have seen on this task.

Grok 4 Water Bottle Image Analysis Response
Grok 4 Water Bottle Image Analysis Response

On the kanji recognition task, Grok 4 scored 7.5/10. The main explanation was incorrect, but it did mention the correct answer in the alternative explanations. This places it behind Gemini 2.5 Pro Preview and o3, which gave the right answer in the main explanation.

Slow Response Time

The most significant limitation of Grok 4 is its slow response time. As a thinking model, it generates extensive reasoning tokens before producing responses, leading to response times of 2-4 minutes for complex tasks.

For the TypeScript narrowing task, Grok 4 took 150.3 seconds to respond, compared to Gemini 2.5 Pro Preview's 84.1 seconds. The long response time is mainly due to the model's thinking process before outputting the final response.

PromptModelResponse TimeOutput TokensToken Speed
TypeScript narrowing v3GPT-4.14.1 seconds40196.7 t/sec
TypeScript narrowing v3Claude Sonnet 410.5 seconds59756.8t t/sec
TypeScript narrowing v3Claude Opus 421.6 seconds637 (*)29.5 t/sec
TypeScript narrowing v3Gemini 2.5 Pro Preview84.1 seconds831398.9 t/sec
TypeScript narrowing v3Grok 4150.3 seconds565137.6 t/sec

*: The evaluation is conducted without enabling extended thinking for Claude models. Anthropic API does not give stats on thinking tokens.

This long response time makes Grok 4 unsuitable for applications requiring quick responses or real-time interactions. The extended thinking process, while potentially beneficial for complex reasoning, significantly impacts user experience in most practical scenarios.

Summary and Practical Recommendations

Here are areas where Grok 4 excels at:

  • Grok 4 works best for deep research tasks where quality matters more than speed. Consider using it for complex analysis, detailed explanations, or tasks requiring thorough reasoning where you can afford to wait for responses.
  • The model is particularly effective for image-related tasks where its detailed explanations provide valuable insights.
  • Though we did not test it in our evaluation, Grok 4 is capable of performing live search on X, which is a powerful feature for real-time information retrieval.

Here are areas where Grok 4 is not suitable for:

  • For coding tasks, it performs competitively with other top models, but the speed penalty may not be justified.
  • For applications requiring fast responses, real-time interactions, or iterative development workflows, Grok 4 is not suitable as the response time is too long.

Evaluating Models with 16x Eval

These comprehensive evaluations were conducted using 16x Eval, a desktop application that simplifies AI model comparison across different tasks. The tool allows you to test models systematically and compare their performance on your specific use cases.

Screenshot of 16x Eval sample evaluations
Screenshot of 16x Eval sample evaluations

16x Eval enables you to create custom evaluations, compare multiple models, and analyze detailed performance metrics including response quality, speed, and cost.


Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

View raw evaluation data →

Related Posts

Claude Opus 4 and Claude Sonnet 4 Evaluation Results

A detailed analysis of Claude Opus 4 and Claude Sonnet 4 performance on coding and writing tasks, with comparisons to GPT-4.1, DeepSeek V3, and other leading models.

Mistral Medium 3 Coding and Writing Evaluation

A detailed look at Mistral Medium 3's performance on coding and writing tasks, compared to top models like GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro.

Why Gemini 2.5 Pro Won't Stop Talking (And How to Fix It)

Learn how to manage Gemini 2.5 Pro's verbose output, especially for coding, and compare its behavior with other models like Claude and GPT.

16x Eval blog is the authoritative source for AI model evaluations, benchmarks and analysis for coding, writing, and image analysis. Our blog provides leading industry insights and is cited by top publications for AI model performance comparisons.

Download 16x Eval

Join AI builders and power users in running your own evaluations