Logo16x Eval

16x Eval Blog

Read the latest blog posts from 16x Eval.

The Identity Crisis: Why LLMs Don't Know Who They Are

The Identity Crisis: Why LLMs Don't Know Who They Are

August 9, 2025

A closer look at why large language models like Claude, Gemini, and GPT often fail to correctly state their own name or version, and what this means for LLM users.

gpt-oss-120b Coding Evaluation: New Top Open-Source Model

gpt-oss-120b Coding Evaluation: New Top Open-Source Model

August 7, 2025

Evaluation of gpt-oss-120b on coding tasks shows strong performance. We examine its performance across five coding challenges and compare it to top models.

GPT-OSS Provider Evaluation: Do All Providers Perform the Same?

GPT-OSS Provider Evaluation: Do All Providers Perform the Same?

August 7, 2025

We evaluated gpt-oss-120b on Cerebras, Fireworks, Together, and Groq to see if performance and output quality vary across providers. Our findings show notable differences in speed and consistency.

The Pink Elephant Problem: Why "Don't Do That" Fails with LLMs

The Pink Elephant Problem: Why "Don't Do That" Fails with LLMs

August 5, 2025

An analysis of why negative instructions like "don't do X" are often ineffective for guiding LLMs, the psychological reasons behind this, and how to write better prompts.

Horizon Alpha Coding Evaluation: The New Stealth Model from OpenRouter

Horizon Alpha Coding Evaluation: The New Stealth Model from OpenRouter

July 31, 2025

Evaluation of the new stealth model, Horizon Alpha on coding tasks, analyzing its unique characteristics, performance against other top models, and possible identity.

Qwen3 Coder Performance Evaluation: A Comparative Analysis Against Leading Models

Qwen3 Coder Performance Evaluation: A Comparative Analysis Against Leading Models

July 30, 2025

Evaluation of Alibaba's new Qwen3 Coder model on coding tasks, comparing its performance against leading open-source and proprietary models like Kimi K2, DeepSeek V3, Gemini 2.5 Pro, and Claude Sonnet 4.

Kimi K2 Evaluation Results: Top Open-Source Non-Reasoning Model for Coding

Kimi K2 Evaluation Results: Top Open-Source Non-Reasoning Model for Coding

July 24, 2025

An evaluation of the new Kimi K2 model from Moonshot AI on coding and writing tasks, comparing its performance against leading models like Claude 4, GPT-4.1, Gemini 2.5 Pro and Grok 4.

Kimi K2 Provider Evaluation: Significant Performance Differences Across Platforms

Kimi K2 Provider Evaluation: Significant Performance Differences Across Platforms

July 21, 2025

Evaluation of Kimi K2 model providers including DeepInfra, Groq, Moonshot AI, Together on coding and writing tasks, showing substantial differences in speed, stability, and output quality.

Grok 4 Evaluation Results: Strong Performance with Reasoning Trade-offs

Grok 4 Evaluation Results: Strong Performance with Reasoning Trade-offs

July 16, 2025

Comprehensive evaluation of xAI Grok 4 model performance on coding, writing, and image analysis tasks, comparing against Claude 4, Gemini 2.5 Pro, and other leading models.

9.9 vs 9.11: Which One is Bigger? It Depends on Context

9.9 vs 9.11: Which One is Bigger? It Depends on Context

July 5, 2025

Why do some AI models get the simple comparison between 9.9 and 9.11 wrong? We take a look at the math, versioning and book chapters, and why context matters for both people and language models.

Claude 4, Gemini 2.5 Pro, and GPT-4.1: Understanding Their Unique Quirks

Claude 4, Gemini 2.5 Pro, and GPT-4.1: Understanding Their Unique Quirks

July 1, 2025

A practical guide to the quirks of top LLMs: Claude 4, Gemini 2.5 Pro, and GPT-4.1. Learn how their traits affect output and how you can adapt prompts for best results.

New Claude Models Default to Full Code Output, Stronger Prompt Required

New Claude Models Default to Full Code Output, Stronger Prompt Required

June 18, 2025

The latest Claude models are more likely to output full code, even if you ask for only the changes. We examine this behavior, how it compares to older models and competitors, and how to get concise output.

Why Gemini 2.5 Pro Won't Stop Talking (And How to Fix It)

Why Gemini 2.5 Pro Won't Stop Talking (And How to Fix It)

June 1, 2025

Learn how to manage Gemini 2.5 Pro's verbose output, especially for coding, and compare its behavior with other models like Claude and GPT.

Claude Opus 4 and Claude Sonnet 4 Evaluation Results

Claude Opus 4 and Claude Sonnet 4 Evaluation Results

May 25, 2025

A detailed analysis of Claude Opus 4 and Claude Sonnet 4 performance on coding and writing tasks, with comparisons to GPT-4.1, DeepSeek V3, and other leading models.

Mistral Medium 3 Coding and Writing Evaluation

Mistral Medium 3 Coding and Writing Evaluation

May 9, 2025

A detailed look at Mistral Medium 3's performance on coding and writing tasks, compared to top models like GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro.

Download 16x Eval

No sign-up or login required. Create your own evals in minutes.