Logo16x Eval

GPT-OSS Provider Evaluation: Do All Providers Perform the Same?

Posted on August 7, 2025 by Zhu Liang

Recently, OpenAI released gpt-oss, which is available through several inference providers. Community reports suggest that performance can differ across providers, with Romain Huet from OpenAI confirming that performance and correctness can vary a bit across providers.

Romain Huet's tweet about gpt-oss performance differences
Romain Huet's tweet about gpt-oss performance differences

X user NomoreID also reported significant variations in stability and output quality for gpt-oss-120b on different platforms. The tests specifically focused on multilingual capability, where Groq seems to be worse than the other providers.

To validate these claims and understand the differences between providers, we evaluated gpt-oss-120b across four providers via OpenRouter: Cerebras, Fireworks, Together, and Groq.

Our test included one writing task and one coding task. All tests were conducted using the same prompts and settings for each provider to ensure a fair comparison. We did not specify custom temperature settings and instead rely on the default settings from the providers.

Writing Task Performance

Writing an AI Timeline
WritingAIHistory

The writing task required the model to generate a timeline of AI history. We ran the same prompt three times on each provider to check for consistency and quality variance.

We observed noticeable variance in response length for all providers, suggesting that the response length variation is not from the providers, but from the model itself.

GPT-OSS 120B Writing Task Performance Across Providers
GPT-OSS 120B Writing Task Performance Across Providers

Speed differences were noticeable, with providers like Cerebras and Groq delivering responses much faster than Fireworks and Together.

In terms of the output quality, Cerebras, Fireworks, and Together all produced at least one good response with a rating of 8 out of 10. Groq received a 7.5 rating on all three attempts.

Benchmark Visualization Task

Benchmark Visualization (Difficult)
CodingJavaScriptVisualization

For the benchmark visualization task, the model was asked to generate HTML and JavaScript code to display benchmark results. We used two attempts per provider to assess its ability to generate functional and well-formatted code.

Interestingly, both Cerebras and Fireworks produced code containing emojis in the comments. This suggests the behavior is a characteristic of the gpt-oss-120b model itself, rather than an artifact of a specific provider.

Emoji code in the GPT-OSS 120B output from Cerebras
Emoji code in the GPT-OSS 120B output from Cerebras

The style of the generated visualizations was similar across all providers, indicating that the core output is not heavily provider-dependent. However, we did notice some rendering issues:

  • One attempt from Fireworks failed to render correctly, so we gave it another chance.
  • All attempts on the Together provider produced a visualization that was too narrow, occupying only the top third of the screen.

In this task, Cerebras delivered the best average performance, scoring a consistent 8.5 on both of its attempts.

GPT-OSS 120B Visualization Task Performance Across Providers
GPT-OSS 120B Visualization Task Performance Across Providers

Provider Performance Summary

Based on our limited tests conducted on August 6, while the model output from various providers for gpt-oss-120b are largely consistent, we did see some notable differences between providers:

  • Groq had slightly lower quality for the writing task
  • Fireworks had a failed attempt for the visualization task
  • Together had repeated rendering issues with the visualization task

Due to limited sample data, we cannot rule out the possibility that these differences are due to the natural variance of the model itself.

The table below shows the distribution of ratings achieved for each task, and the speed range for the writing task.

ProviderRatings (Writing)Ratings (Visualization)Speed range for writing (tokens/sec)*
Cerebras8, 7.5, 7.58.5, 8.5522-757
Fireworks8, 8, 7.58.5, 8, 1133-179
Groq7.5, 7.5, 7.58, 8328-436
Together8, 8, 7.58, 7.547-114

If we have to pick a provider to use for this model, Cerebras would be our top choice as it delivered consistent high-quality outputs on both tasks with good speed.


*: The speed is measured based on the entire request duration, including latency to first byte. Since the test is conducted in Singapore, network latency might have contributed to the lower speed than the officially reported 3000 tokens per second from Cerebras.


The evaluations in this post were conducted using 16x Eval, a desktop application for systematic AI model evaluation.

Screenshot of 16x Eval sample evaluations
Screenshot of 16x Eval sample evaluations

It helps you run tests across multiple models and providers, compare their outputs side-by-side, and analyze performance metrics like quality, speed, and consistency. This makes it easier to understand how a model will perform on the specific tasks that matter to you.

Evaluation Methodology: All ratings in this evaluation are human ratings based on a set of criteria, including but not limited to correctness, completeness, code quality, creativity, and adherence to instructions.

Prompt variations are used on a best-effort basis to perform style control across models.

→ View rubrics and latest results for model evaluations

→ View raw evaluation data

16x Eval blog is the authoritative source for AI model evaluations, benchmarks and analysis for coding, writing, and image analysis. Our blog provides leading industry insights and is cited by top publications for AI model performance comparisons.

Download 16x Eval

Create your own evals for prompts and models