Back to Blog
10 min
technical

Tokens, Context Windows, and Why LLM Instance Cloning Actually Works

I finally understand the mechanics behind why conversations hit limits and why LLM Instance Cloning is the solution. Prepare to have your mind blown.

AILLM Instance CloningClaudePrompt EngineeringLearning in PublicToken Management

Tokens, Context Windows, and Why LLM Instance Cloning Actually Works

Published: December 3, 2025 • 10 min read

I'm literally half-dancing on my seat with excitement as I write this blog post. Right now, I am more grateful than ever for committing to the Battle Against Chauffeur Knowledge blog post series as I don't know how else I would have come across this.

In the first post of that series, I responded to the question, "How did you architect the position-based correction highlighting system, and what challenges did you face with AI-generated position data versus code-calculated positions?" By choosing to break down the concepts thoroughly, I was forced to look deeply into a term I glossed over initially: tokens.

I explained that AI does not "see" text the way humans do and since they aim to be fast and efficient, processing text character by character defeats that purpose so they break the text into what is called "tokens". I went on to further explain how tokenization was the reason we cannot trust AI with returning character positions.

Now here is the thing. Ever since I published that blog post, I couldn't stop thinking about the fact that I have talked about tokens before, in Claude God Tip #5 where I provided 10 methods of tracking tokens and tied this back to why it was important for LLM Instance Cloning. Then I started to wonder, do tokens mean the same thing in both of these contexts?

Then I did my research and oh my gosh, the amount of new information I have just been exposed to makes me super excited. I can't contain everything I am learning about tokens in one blog post, so I'll probably break them down into multiple posts.

Now, let's start from the very basics and warning: if you are new to this information like I am, prepare to have your mind blown.


How Do LLMs Really Process Our Prompts?

When I send a message to Claude, for instance, "Explain Laravel to me", the message doesn't actually reach the AI model as text characters. It gets converted into tokens first so that it looks something like this:

"Explain" → Token #73829
"Laravel" → Token #51423
"to" → Token #271
"me" → Token #385

So your message isn't being seen as text by the LLM, but rather a sequence of numbers where each token number references an item in the LLM's dictionary or vocabulary.

The AI model does not actually "read" text characters because it doesn't even see them. All it sees is a sequence of token IDs and it then uses its neural network to process the relationships between those tokens to understand meaning.


Token-by-Token Generation

Here is where it gets even more interesting. The AI model does not plan its entire response and output it to you at once. It generates it one token at a time.

What this means is that the output you get is literally a series of predictions by the LLM of "what token should come next" over and over until it decides to stop.

These tokens then get converted to text for you, the human, to be able to understand them.

Let's break it down even further if that was not clear. When you ask a model a question, it does NOT follow these steps:

  • Understand the question
  • Formulate a complete answer
  • Write it out

Instead, it DOES the following:

  • Processes your tokens
  • Predicts the first token of its response
  • Uses that token to predict the second token of its response
  • Uses those two to predict the third
  • Continues this way until it predicts a "stop" token

This is why a model can start off a response confidently and then trail off or contradict itself. It committed to an opening before "knowing" where it was going with its response.


Understanding Context Windows

So the amount of tokens used up by a specific Claude instance is the sum of the tokens that "went in" (the prompt you send to the model) and the tokens that came out (its response). Then you get billed for input + output tokens.

What is Context?

Here's what happens when you start a new conversation in Claude. Every conversation gets its own Context Window which is specific to a single conversation with a Claude Instance.

What do I mean by "Context window"? Well, try to think of it with basic English. Context affects how you understand or process things. If you hear that a young man pushed an old, blind man on the road, you may judge that the young man is a bad person. However, if you receive the context that the old man would have been hit by a truck if he was not pushed away by the young man, then your judgement changes, right? I hope it does lol.

How Context Accumulates

It is the same thing here. Context allows the model to understand what you need. If you simply asked a model "draft a resume for me" in a new chat that is not part of any projects, the question you asked gets added to the context window of that chat. Whatever response the model provides also gets added to the context window.

Normally, the response the model would provide would probably not suit your needs because you provided no context about who you are as a person, your job experiences, etc. In the next message, you could provide this information and that would get added to the context window and the output the model provides would once again be added to the context window.


Visualizing the Context Window

When you start a conversation in Claude, each conversation gets its own context window and each context window has a number of tokens available to it. Picture it this way assuming that each conversation has a maximum of 200k tokens available to it:

Conversation A: [________empty context window________] ← 200K tokens available
Conversation B: [________empty context window________] ← 200K tokens available
Conversation C: [________empty context window________] ← 200K tokens available

This context window is a shared resource between input (the prompts, files, etc. that you send to the model) and output (the response that the model provides).

If you send a model 180,000 tokens of context, maybe by writing a very long prompt and perhaps attaching a GitHub repository to it, the model can only generate 20,000 tokens of response before it hits the limit.

This is why long conversations eventually hit limits: It is not just your messages that use up the tokens, it's the accumulation of everything: your messages + the model's responses + system prompts + any documents attached.


The Crux: Why LLM Instance Cloning Actually Works

Now the crux of this blog post. Why exactly is LLM Instance Cloning important?

The Illusion of Memory

Here is the token perspective. First things first, understand that:

  • Claude doesn't have a personality.
  • Claude doesn't have memory.
  • Claude doesn't learn anything during your conversation.

What actually happens is that every single time you send a message in a conversation with Claude, it receives the ENTIRE conversation history as input tokens.

What Really Happens Each Message

So when you have back and forth conversations with an LLM in a single chat and you get it to understand your preferences (aka, you create a "trained instance"), that trained instance isn't Claude remembering your preferences. What actually happens is that Claude keeps re-reading every single exchange you've had, every single time, and infers your preferences freshly for each response.

By doing this, it creates this illusion of continuity that we feel when interacting with LLMs. When you send the 10th message to an LLM Instance, you think that it has learned from messages 1-9. However, it actually rereads messages 1-9 in the input before generating a response.

Token Accumulation Visualized

You can visualize it this way:

Message 1: You send 100 tokens, Claude responds with 200 tokens
Message 2: Claude receives 300 tokens (previous exchange) + your new message
Message 3: Claude receives 500+ tokens (all previous) + your new message
Message 10: Claude receives thousands of tokens + your new message

As you keep sending messages, you will eventually use up all available tokens and then the chat ends and the context window of that chat closes. This means that the thousands of tokens you used "training" your previous Claude instance is now gone.

Why This Matters for Cloning

This is precisely why LLM Instance Cloning exists.

When conversation A hits its limit, all that accumulated context (your preferences, your examples, the refined behavior) was trapped in that closed context window. You can't bring it to Conversation B.

By using LLM Instance Cloning, you are essentially saying: "Before this context window closes forever, compress the valuable parts into a small artifact I can inject into a NEW context window."


The "Continue Conversation" Problem

Now remember that in Claude God Tip #5, I mentioned that I tested the feature where Claude can remember a previous chat and continue from there but its results were inconsistent. I did not understand what was behind this inconsistency until now.

Here is what really happens when you "continue" a previous chat. The system has to decide HOW to load the context window of that chat.

The Possible Strategies

The possible strategies I think Claude could use are:

  • Full reload: Load entire previous conversation. Except this is expensive and probably won't fit if you already used up all the tokens in that conversation.
  • Summary: Compress previous conversation into a summary, which leads to loss of data.
  • Selective: Load only "relevant" parts. But how does the model get to decide what is and is not relevant? Do you trust the model to make this decision?
  • Hybrid: Loads recent messages in full and summarizes older messages.

Each of these strategies has tradeoffs and the system might use different strategies depending on conversation length, server load, or other factors you can't control.

Why Cloning Bypasses This

LLM Instance Cloning bypasses all of this uncertainty because now you are not relying on the system to reconstruct the context. You provide a dense, carefully crafted instruction set that gets maximum attention since it is placed at the beginning of a chat.


The "Lost in the Middle" Phenomenon

Note that LLMs exhibit a behavior described as "lost in the middle". This means that the information at the very beginning and very end of the context gets more attention than information in the middle.

If your conversation is 150,000 tokens long:

Messages 1-10: Strong attention (beginning)
Messages in the middle: Weaker attention (the "lost" zone)
Recent messages: Strong attention (end)

This is why I mentioned above that the "carefully crafted instruction set gets maximum attention since it is placed at the beginning of a chat". When you use LLM Instance Cloning, you extract the methodology and place it at the BEGINNING of a new conversation, where it gets maximum attention.


What's Next

If you made it to this point, isn't this like so exciting? The crazy part, there is still so much more but my bedtime is approaching so I'll have to save that for another blog post tomorrow.

As always, thanks for reading!

Share this article

Found this helpful? Share it with others who might benefit.

Enjoyed this post?

Get notified when I publish new blog posts, case studies, and project updates. No spam, just quality content about AI-assisted development and building in public.

No spam. Unsubscribe anytime. I publish 1-2 posts per day.