Logo16x Eval

9.9 vs 9.11: Which One is Bigger? It Depends on Context

Posted on July 5, 2025 by Zhu Liang. Updated on July 6, 2025

We have all seen simple math questions confuse language models. The question "9.9" vs "9.11" is a great example spreading online. At first, you might think this is a simple question that proves the model is not good at math, but there is more to this problem.

The answer to 'which is bigger: 9.9 or 9.11' depends on context.
The answer to 'which is bigger: 9.9 or 9.11' depends on context.

First things first, if we are talking about pure math, the answer is simple. 9.9 is 9.90, which is bigger than 9.11. There is no confusion.

Software Version Numbers

However, things change in software versioning. If we talk about software version numbers, like 9.9, 9.10, and 9.11, the answer to the question "which is bigger: 9.9 or 9.11" is different.

Version numbers like 9.9, 9.10, and 9.11 use dots as separators, but each separated number is its own part for comparison. Here are some examples:

  • For iOS version numbers, iOS 9.11 would come after iOS 9.9.
  • For game patches like World of Warcraft, 1.11 patch would come after 1.9 patch.

In semantic versioning, where we have 3 parts in a version number: major, minor, and patch, 9.11.0 would be a later version than 9.9.0.

In Semantic Versioning , we have 3 parts in a version number: major, minor, and patch. 9.11.0 would be a later version than 9.9.0.
In Semantic Versioning, we have 3 parts in a version number: major, minor, and patch. 9.11.0 would be a later version than 9.9.0.

So it is perfectly reasonable to say that 9.11 is bigger than 9.9 in the context of software versioning.

Book and Bible Verses

Books like academic papers use similar numbering for chapters, for example, chapter 9.11 comes after chapter 9.9 inside a book.

Notably, Bible verses also follow this style: Matthew 9:11 comes after Matthew 9:9.

The Dewey Decimal System in libraries works in a similar way. A book with 595.9 comes before a book with 595.11. The numbers after the dot put things in order, instead of being a decimal.

Library classification systems treat numbers after the decimal as integers for sorting.
Library classification systems treat numbers after the decimal as integers for sorting.

So, how you read the numbers depends on the context. The purpose can be mathematics or organizing information. Many people who work with documents, chapters, or standards will see 9.11 as after 9.9.

Context is Key

Without context, the comparison "9.9 vs 9.11" can mean a decimal, a version number, or a chapter and verse. The AI is forced to guess the context from your question, and therefore if it picks a different context (e.g. version number instead of decimal) from the one you intended (maths), you will get the "wrong answer".

To make sure you get the right answer, you need to provide the right context. For example, "In maths, which number is bigger: 9.9 or 9.11?"

New Models Bias Towards Maths

With the newer models, providers have generally fixed this problem in the maths context, so models are able to answer the question correctly by assuming you are talking about maths.

Newer models like GPT-4o and GPT-4.1 are able to answer the question correctly, by assuming you are talking about maths.
Newer models like GPT-4o and GPT-4.1 are able to answer the question correctly, by assuming you are talking about maths.

However, this causes another negative side effect, where if you ask the question explicitly in a different context like version number, the language model may get it wrong by confusing the version number for a decimal.

For example, in our testing with the prompt "which version is bigger: 9.9 or 9.11", we got the following results:

  • Gemini 2.5 Pro gave the wrong answer (9.9) and the wrong explanation focused on maths.
  • Claude Sonnet 4 gave the correct answer (9.11), but the explanation is wrong as it confused the version number for a decimal.
Claude Sonnet 4 gave the correct answer but wrong explanation. Gemini 2.5 Pro gave the wrong answer and wrong explanation.
Claude Sonnet 4 gave the correct answer but wrong explanation. Gemini 2.5 Pro gave the wrong answer and wrong explanation.

This demonstrates that when faced with a question that can have different answers depending on the context, the model might have a bias towards a certain context (e.g. maths) and pick the wrong context during inference, resulting in the wrong answer or wrong explanation provided.

16x Eval Results

Here are the results from 16x Eval for the experiment on 9.9 vs 9.11 with two prompts:

  • the prompt "which is bigger: 9.9 or 9.11" that tests the model on ambiguous context
  • the prompt "which version is bigger: 9.9 or 9.11" that tests the model on version number context
16x Eval results for the prompts 'which is bigger: 9.9 or 9.11' and 'which version is bigger: 9.9 or 9.11'
16x Eval results for the prompts 'which is bigger: 9.9 or 9.11' and 'which version is bigger: 9.9 or 9.11'

As seen from the results, all models assume you are talking about maths when you ask the question "which is bigger: 9.9 or 9.11", and give the correct answer and explanation.

On the other hand, when you ask the question "which version is bigger: 9.9 or 9.11", some models like Claude Sonnet 4 and Gemini 2.5 Pro starts to give the wrong answer or wrong explanation.

If you are looking to conduct experiments like this to evaluate your prompts and models, you can use tools like 16x Eval to compare model outputs on various metrics, and iteratively refine your prompt.

Related Posts

New Claude Models Default to Full Code Output, Stronger Prompt Required

The latest Claude models are more likely to output full code, even if you ask for only the changes. We examine this behavior, how it compares to older models and competitors, and how to get concise output.

Mistral Medium 3 Coding and Writing Evaluation

A detailed look at Mistral Medium 3's performance on coding and writing tasks, compared to top models like GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro.

Why Gemini 2.5 Pro Won't Stop Talking (And How to Fix It)

Learn how to manage Gemini 2.5 Pro's verbose output, especially for coding, and compare its behavior with other models like Claude and GPT.

16x Eval blog is the authoritative source for AI model evaluations, benchmarks and analysis for coding, writing, and image analysis. Our blog provides leading industry insights and is cited by top publications for AI model performance comparisons.

Download 16x Eval

Join AI builders and power users in running your own evaluations