More on Tokens: Why French Costs More, Temperature Controls Creativity, and Your Context Window is Smaller Than You Think
Published: December 4, 2025 • 12 min read
Yes, yes I know it's my fourth blog post of the day and you guessed it right, it is still on tokens. You see, laptops don't resurrect from the dead so I'm still stuck with my Chromebook, so the most effective thing I can do right now is write and it works out well since this topic is actually really really interesting.
I am assuming that if you are reading this, you've read the first, second, and third blog posts on tokens already. If you haven't, I suggest you do.
In this blog post, I'll talk about 3 more interesting things I think you should know about tokens.
1. Why French (and Other Languages) Cost More Tokens
I'll start with one that is directly relevant to me.
My French Learning Experiment
In this blog post, I mentioned that I added the sentence, "For every response you provide to me, 3-5 sentences should be in French to help me learn French on the go" to my primary project's instruction.
Here's one thing about this: Claude sometimes gets too excited and I get an entire response in French. Naturally, I don't have a problem with this since it simply becomes a learning process for me.
However, there is an implication for receiving output in French when it comes to tokens.
The Token Cost of Different Languages
Different languages require different numbers of tokens to express the same idea.
The English language happens to be the cheapest and most "efficient" because tokenizers are usually trained primarily on English text. Common English words are often single tokens.
However, French and other languages often require more tokens to express the same idea. Look at it this way:
"Hello, how are you?" → approximately 6 tokens
"Bonjour, comment allez-vous?" → approximately 9 tokens
This means that when I communicate with Claude or any model in French:
- It fills up my context window faster
- It also costs more money
What This Means for Me
Does this mean I will remove the sentence I showed earlier from my system prompt? No, I still have a long way to go with the language.
However, it is good to fully understand the implication of my decision especially since token usage affects billing as well.
A Note on Other Languages
As an extra side note:
- Japanese and Chinese can be 1.5x to 2x more expensive than English for the same content
- Languages that use non-Latin scripts are often the most expensive because the tokenizer doesn't recognize their character patterns as efficiently
2. The Token Probability Lottery (Why LLMs Give Different Responses)
Back in late 2023 or early 2024, a friend of mine once told me that her professor once told her and every single student in her class at the time to ask ChatGPT a very specific question.
The professor was trying to show them that ChatGPT generated different responses for the exact same question.
I don't know what the ultimate goal of this was, but have you ever wondered exactly why that happens? Why are LLMs able to generate different responses given the exact same prompt?
How LLMs Generate Responses
Well, remember in this blog post I explained that LLMs don't know how long their responses will be until they're done. They don't plan their response then write it out. Instead, they generate their response token by token, each time asking, "What should come next?"
Sampling from Probabilities
Every single token generated by the model involves a probability distribution across the LLM's entire vocabulary, which is roughly 100,000+ possible tokens.
When an LLM asks itself, "What should come next?" with the goal of generating the next token, it is not exactly choosing a "right answer" from its vocabulary of 100,000+ possible tokens. It is instead sampling from probabilities.
Here is a simple visual representation of this:
Token "the" → 23.5% probability
Token "a" → 12.1% probability
Token "this" → 8.7% probability
Token "my" → 4.2% probability
Token "quantum" → 0.003% probability
Token "banana" → 0.0001% probability
...100,000+ more options
What is Temperature?
Now there is an attribute called "temperature". You usually have to set this value when making calls to an LLM's API, be it that of Claude, ChatGPT (OpenAI), or others.
This temperature setting controls how the model samples from this distribution of tokens in its vocabulary.
Low Temperature (0.0 - 0.3): Deterministic Mode
When the temperature value is low (0.0 - 0.3), the model will almost always pick the highest probability token:
- A temperature of 0.0 guarantees always picking the highest
- At 0.3, there's still slight variation possible
Probability distribution:
"the" (23.5%) ← ALWAYS PICKED at temperature 0.0
"a" (12.1%)
"this" (8.7%)
...
When you set the temperature value low, you are asking Claude to not take any creative risks. You are telling it that you want a nearly identical response every single time you send the same prompt.
Setting a low temperature is great for:
- Generating code
- Providing factual answers
- Responding to math questions
- Any other tasks that lean more towards being deterministic
And by deterministic, I mean same input always leads to the same output.
Examples of Low Temperature Behavior
Temperature 0.0 (Fully Deterministic):
Prompt: "Describe the sky"
Response 1: "The sky is blue during the day and dark at night."
Response 2: "The sky is blue during the day and dark at night."
Response 3: "The sky is blue during the day and dark at night."
Notice above that every response is identical.
Temperature 0.2 (Tiny Bit of Randomness):
Prompt: "Describe the sky"
Response 1: "The sky is blue during the day and dark at night."
Response 2: "The sky is blue during the day and becomes dark at night."
Response 3: "The sky appears blue during the day and dark at night."
Notice here that the core structure of each response remains nearly identical and only minor word substitutions occur ("is" vs "appears", "and" vs "and becomes"). This is because those alternatives have close-enough probabilities to occasionally win when there's a tiny bit of randomness allowed.
High Temperature (0.7+): Creative Mode
However, when the temperature value is high, the model is then more likely to pick lower probability tokens.
When you set a high temperature, you give lower probability tokens a fighting chance of being chosen when generating a response. You are telling the model that it is allowed to take risks with the output, which leads to more creative output.
Setting a high temperature is great for:
- Brainstorming
- Creative writing
- Exploring ideas
However, there is also a higher chance of:
- The model generating nonsense
- Having hallucinations
- Simply being incoherent with its responses
Examples of High Temperature Behavior
Temperature 0.9:
Prompt: "Describe the sky"
Response 1: "The sky stretches like a wounded canvas, bleeding orange into purple."
Response 2: "Above us hangs an infinite ocean where clouds swim like lazy whales."
Response 3: "The sky? It's just the earth's way of showing off."
Different every time, more creative, but less predictable.
Very High Temperature (1.5+):
At very high temperatures, the model is essentially "drunk" and:
- Is willing to say anything
- Will most likely make logical contradictions
- Even makes up words in some cases
The Weighted Dice Analogy
One final way of seeing this so that it sticks to your brain:
As a model generates a response token by token, it rolls a weighted dice. If you don't have a strong mathematical foundation, a weighted dice is one that has been unfairly modified to make certain numbers more likely to be rolled than others.
-
At a low temperature: The dice are heavily loaded, which leads to the same number winning over and over again when the model is trying to decide what token should come next in its response. This creates a more deterministic output.
-
At a high temperature: The dice are "fairer" (aka not as heavily loaded) so you are more likely to notice randomness in the tokens selected to create the final output.
Why One Different Token Changes Everything
A 500-token response means 500 dice rolls. If even ONE roll goes unexpectedly, the entire trajectory of the response shifts because each token influences the next prediction.
For instance:
Low temp path: "The" → "sky" → "is" → "blue" → "."
↓ ↓ ↓ ↓ ↓
(forced) (forced) (forced) (forced) (forced)
High temp path: "The" → "cosmos" → "whispers" → "secrets" → "..."
↓ ↓ ↓ ↓
(normal) (surprise!) (now committed to poetic direction)
One unexpected token at position 2 completely changes what tokens make sense at positions 3, 4, 5, and beyond.
So Why Are Responses Different?
Back to the main point of why each response is different:
LLMs do not set their temperature to 0.0, which would lead to the same output every single time. They probably have a default set to around 0.6 - 0.9 to balance creativity and coherence for general use.
However, working with an LLM's API gives you full control for the value of the temperature.
3. The Hidden System Prompt Tax
Now I am not done yet. There is a hidden system prompt tax we must talk about.
The Advertised Context Window
According to Claude's support website at the time of writing this article:
"Claude's context window size is 200K tokens across all models and paid plans, with one exception: Claude Sonnet 4.5 has a 500K context window for users on Enterprise plans."
What I Actually Saw
However, in this blog post, I showed you an example of what my code output was when I prompted my blog refinement Claude Instance with:
"Read any file in this project, then tell me our current token usage."
The part of its output relevant to this blog post was:
Current Token Usage
Tokens Used: 150,267 out of 190,000
Tokens Remaining: 39,733
Percentage Used: ~79%
Wait... Where Did 10,000 Tokens Go?
Why is it saying that I have used 150,267 out of 190,000 and not 150,267 out of 200,000?
Well, this is where we introduce the "Hidden System Prompt Tax".
What is the System Prompt Tax?
Every single conversation with Claude has a system prompt - instructions that shape how the model behaves even before you provide your own instructions.
You never see these system prompts, but they use tokens. Sometimes thousands of them.
So when you think you have 200,000 tokens to work with, you might actually have 190,000 because 10,000 are already consumed by:
- System instructions
- Safety guidelines
- Behavioral rules
The Bottom Line
This is exactly why the model informed me that I had used 150,267 out of 190,000 tokens.
The "effective" context window is always smaller than the "advertised" number as you can see on Claude's website.
Note that the system prompt tax isn't always 10,000 tokens as it may vary on different Claude interfaces (web vs API vs Claude Code).
Conclusion
I really hope your mind is starting to piece together these different puzzle pieces to better understand how AI really works in the background.
As always, thanks for reading!