Logo16x Eval

Mistral Medium 3 Coding and Writing Evaluation

Posted on May 9, 2025 by Zhu Liang

Mistral Medium 3 is the latest medium-sized model from Mistral.

We recently evaluated Mistral Medium 3 via OpenRouter, comparing it to leading models such as GPT-4.1, Claude 3.7 Sonnet, DeepSeek V3, and Gemini 2.5 Pro.

Here's a summary of the results:

Mistral Medium 3 Evaluation Summary

Coding: Simple TODO App Feature

For the simple task of adding a feature to a Next.js TODO app, Mistral Medium 3 delivered a concise and correct response.

The model earned a solid 8.5/10 rating, just behind DeepSeek V3 (New) which is still the best performing model in the medium-sized category:

Mistral Medium 3 Simple TODO App Performance

While Mistral Medium 3's output was concise, it did not follow the exact instructions as well as the top models such as GPT-4.1, and Gemini 2.5 Pro Experimental, which produced even more concise and instruction-following code.

Coding: Complex Benchmark Visualization

When tested on a more complex coding problem, generating a visualization of benchmark results using code, Mistral Medium 3's performance was average.

It produced code with good color choices but lacked clear labels in the visualization, scoring a 7/10, similar to DeepSeek V3 (New) and Gemini 2.5 Pro Preview.

Mistral Medium 3 Benchmark Visualization Output
Benchmark visualization generated by Mistral Medium 3

Models like GPT-4.1 and o3 outperformed Mistral Medium 3 in this area, offering clearer, more user-friendly visualization with better labels.

Find the full visualization from various models on our GitHub repository.

Here is how the model scored against other models:

Mistral Medium 3 Benchmark Visualization

Writing Tasks

On our writing evaluation task involving writing an AI timeline, Mistral Medium 3 covered most of the required points, demonstrating strong comprehension and content coverage.

However, it did not follow the exact format that was requested, which affected its overall rating. Its 8.5/10 score put it in line with Claude 3.7 Sonnet and Gemini 2.5 Pro Experimental, but a step behind GPT-4.1 and DeepSeek V3 (New), which both excelled in both content and formatting.

Mistral Medium 3 AI Timeline Writing Output

For users who prioritize substance over strict formatting, Mistral Medium 3 is a solid option. But if you need perfect adherence to specific formats, you may want to consider one of the higher-rated models.

Evaluation Results Table

Here are the evaluation results in table format for Mistral Medium 3 compared to other models.

We may use a variant of the prompt for certain models to perform "style control". This is our best effort to ensure all outputs are consistent in style during evaluation, as we prefer concise response over verbose ones.

For the simple Next.js TODO app coding task:

RankingPromptModelRatingNotes
1TODO taskGPT-4.19.5/10Very concise
Follows instruction well
2TODO taskDeepSeek-V3 (New)9/10Concise
Follows instruction well
2TODO task v2 (concise)Gemini 2.5 Pro Experimental9/10Concise
Follows instruction well
4TODO taskClaude 3.5 Sonnet8.5/10Slightly verbose
Follows instruction
4TODO taskGemini 2.5 Pro Preview (New)8.5/10Slightly verbose
Follows instruction
4TODO taskOpenRouter: Mistral Medium 38.5/10Concise
4TODO task v2 (concise)Gemini 2.5 Pro Preview8.5/10Slightly verbose
Follows instruction
4TODO task v3 Claude 3.7Claude 3.7 Sonnet8.5/10Concise
9TODO taskOpenRouter: Qwen3 235B A22B8/10Slightly verbose

And for the more complex benchmark visualization task:

RankingPromptModelRatingNotes
1Benchmark visualizationGPT-4.18.5/10Clear labels
2Benchmark visualizationo38/10Clear labels
Poor color choice
3Benchmark visualizationClaude 3.7 Sonnet7.5/10Number labels
Good idea
4Benchmark visualizationGemini 2.5 Pro Experimental7/10No labels
Good colors
4Benchmark visualizationDeepSeek-V3 (New)7/10No labels
Good colors
4Benchmark visualizationGemini 2.5 Pro Preview (New)7/10No labels
Good colors
4Benchmark visualizationOpenRouter: Mistral Medium 37/10No labels
Good colors
8Benchmark visualizationGemini 2.5 Pro Preview6/10Minor bug
No labels
9Benchmark visualizationOpenRouter: Qwen3 235B A22B5/10Very small
Hard to read

And for the AI timeline writing task:

RankingPromptModelRatingNotes
1AI timelineGPT-4.19.5/10Covers most points
Correct format
2AI timelineDeepSeek-V3 (New)9/10Covers major points
Concise
Correct format
3AI timelineClaude 3.7 Sonnet8.5/10Covers most points
Wrong format
3AI timelineOpenRouter: Mistral Medium 38.5/10Covers most points
Wrong format
3AI timelineOpenRouter: Meta: Llama 3.3 70B Instruct8.5/10Covers most points
Wrong format
3AI timeline v2 (concise)Gemini 2.5 Pro Experimental8.5/10Covers most points
Wrong format
3AI timeline v2 (concise)Gemini 2.5 Pro Preview (New)8.5/10Covers most points
Wrong format
8AI timelineGemini 2.5 Pro Experimental8/10Covers most points
Wrong format
Verbose
8AI timelineOpenRouter: Qwen3 235B A22B8/10Covers major points
Wrong format

Conclusion

Mistral Medium 3 is a well-rounded model that consistently ranks among the top five for both coding and writing tasks.

If you're looking for a dependable, mid-tier model for everyday development and writing needs, Mistral Medium 3 is a strong alternative to DeepSeek V3 (New).

Evaluation Data and Prompts

You can find all the prompts and evaluation data on our GitHub repository.

Effortless Evaluation with 16x Eval

This evaluation was conducted locally using 16x Eval.

If you are looking to run your own evaluations for various models, try out 16x Eval.

16x Eval screenshot

Download 16x Eval

Join other AI power users and AI builders in simplifying your eval workflows