Grok 4 Evaluation Results: Strong Performance with Reasoning Trade-offs

Posted on July 16, 2025 by Zhu Liang

xAI recently released Grok 4, positioning it as "the most intelligent model in the world" with native tool use and real-time search integration. The model uses reinforcement learning at scale to enhance reasoning capabilities.

We tested Grok 4 across 7 evaluation tasks covering coding, writing, and image analysis to understand its capabilities and limitations. Our results show Grok 4 performs strongly but comes with notable trade-offs.

Grok 4 Evaluation Summary on 16x Eval across 7 tasks on coding, writing, and image analysis.

Coding Performance

Grok 4 demonstrates excellent performance on straightforward coding tasks. In our Next.js TODO app feature addition test, it achieved a 9.5/10 rating, tying with Claude Opus 4 and Gemini 2.5 Pro Preview for top performance.

Grok 4 Coding Task Performance across 4 Tasks

The model produces concise, well-structured code following instructions precisely. For the simple folder watcher fix task, Grok 4 scored 8.5/10, solving the problem with extra logic but in a more verbose manner than the top-performing Claude models.

However, Grok 4 struggles with more complex coding challenges. On the TypeScript narrowing task, it scored 6/10, falling behind Claude Opus 4 (8.5/10) and using less optimal approaches like the in keyword.

The benchmark visualization task revealed similar limitations, with Grok 4 scoring 8/10 compared to Claude models' 8.5/10 performance. Grok 4 produced a decent side-by-side visualization with clear labels. However, its use of colors was not optimal, as it did not use distinct colors for each benchmark, which would have improved readability.

Grok 4 Output for Benchmark Visualization Task

Technical Writing Task

We also put Grok 4 through our standard AI timeline writing task (technical writing). Grok 4 shows strong writing abilities, earning a 9.25/10 rating.

Grok 4 Shows Good Performance on Writing Task

The model covers most required points while maintaining conciseness, placing it among the top performers, just slightly behind Claude Sonnet 4 and Claude Opus 4, and on par with Gemini 2.5 Pro Preview and GPT-4.1.

Strong Image Understanding

One surprising finding from our evaluation is that Grok 4 excels in image analysis tasks, showing particularly strong performance in certain scenarios.

Grok 4 Shows Strong Performance on Image Analysis Tasks

On the water bottle image analysis task, it achieved a 9.25/10 rating with detailed explanations, though it took significantly longer than other models. This is the new highest rating we have seen on this task.

Grok 4 Water Bottle Image Analysis Response

On the kanji recognition task, Grok 4 scored 7.5/10. The main explanation was incorrect, but it did mention the correct answer in the alternative explanations. This places it behind Gemini 2.5 Pro Preview and o3, which gave the right answer in the main explanation.

Slow Response Time

The most significant limitation of Grok 4 is its slow response time. As a thinking model, it generates extensive reasoning tokens before producing responses, leading to response times of 2-4 minutes for complex tasks.

For the TypeScript narrowing task, Grok 4 took 150.3 seconds to respond, compared to Gemini 2.5 Pro Preview's 84.1 seconds. The long response time is mainly due to the model's thinking process before outputting the final response.

Prompt	Model	Response Time	Output Tokens	Token Speed
TypeScript narrowing v3	GPT-4.1	4.1 seconds	401	96.7 t/sec
TypeScript narrowing v3	Claude Sonnet 4	10.5 seconds	597	56.8t t/sec
TypeScript narrowing v3	Claude Opus 4	21.6 seconds	637 (*)	29.5 t/sec
TypeScript narrowing v3	Gemini 2.5 Pro Preview	84.1 seconds	8313	98.9 t/sec
TypeScript narrowing v3	Grok 4	150.3 seconds	5651	37.6 t/sec

*: The evaluation is conducted without enabling extended thinking for Claude models. Anthropic API does not give stats on thinking tokens.

This long response time makes Grok 4 unsuitable for applications requiring quick responses or real-time interactions. The extended thinking process, while potentially beneficial for complex reasoning, significantly impacts user experience in most practical scenarios.

Summary and Practical Recommendations

Here are areas where Grok 4 excels at:

Grok 4 works best for deep research tasks where quality matters more than speed. Consider using it for complex analysis, detailed explanations, or tasks requiring thorough reasoning where you can afford to wait for responses.
The model is particularly effective for image-related tasks where its detailed explanations provide valuable insights.
Though we did not test it in our evaluation, Grok 4 is capable of performing live search on X, which is a powerful feature for real-time information retrieval.

Here are areas where Grok 4 is not suitable for:

For coding tasks, it performs competitively with other top models, but the speed penalty may not be justified.
For applications requiring fast responses, real-time interactions, or iterative development workflows, Grok 4 is not suitable as the response time is too long.

Evaluating Models with 16x Eval

These comprehensive evaluations were conducted using 16x Eval, a desktop application that simplifies AI model comparison across different tasks. The tool allows you to test models systematically and compare their performance on your specific use cases.

Screenshot of 16x Eval sample evaluations

16x Eval enables you to create custom evaluations, compare multiple models, and analyze detailed performance metrics including response quality, speed, and cost.

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

View raw evaluation data →

Grok 4 Evaluation Results: Strong Performance with Reasoning Trade-offs

Coding Performance

Technical Writing Task

Strong Image Understanding

Slow Response Time

Summary and Practical Recommendations

Evaluating Models with 16x Eval

Related Posts

Claude Opus 4 and Claude Sonnet 4 Evaluation Results

Mistral Medium 3 Coding and Writing Evaluation

Why Gemini 2.5 Pro Won't Stop Talking (And How to Fix It)