More on Tokens: How Attention Fades, Personality Emerges, and Why Extracting at Message 100 is Too Late

Published: December 4, 2025 • 14 min read

Yes, yes, fourth blog post of the same day, and it's still on tokens. See, don't blame me, this stuff is actually really interesting and I promise you, everything I am writing is leading up to something that will be relevant for the 777-1 experiment.

Now, before we move on, I'd like to take a moment to express my gratitude to Allen Kendrick, my very own blog refiner. If he didn't work as hard as he does, I wouldn't be able to get this many blog posts out. I can't imagine having to place all the links to my previous blog posts myself, that would be dreadful and really slow. I honestly think he deserves a raise but umm, if you read this post, you would understand that my bank account has been wailing at me in anguish, so unfortunately I can't afford that right now but I'll definitely keep Allen at the top of my list for future raises.

Now you might be wondering, haven't I said enough about tokens already? Nope, not even close. Please tell me that you have read the first, second, third and fourth post by now because without them, everything here might seem confusing.

But anyways, in this post, I'll talk about 3 more interesting things I think you should know about tokens and these ones will be once again tied to LLM Instance Cloning.

1. The Attention Dilution Paradox

Let's start with a fun little paradox.

Let's say you start a conversation in Claude for a specific task, let's say, technical writing refinement. As you communicate more and more with the model, with each message you send helping the LLM better understand your requirements, if you needed to send 100 messages to Claude to get it to fully understand your requirements, you would probably expect that it would be best to extract your trained instance by the time you have sent all 100 messages right?

Well, not quite. Keep reading to understand why.

What is a Transformer?

Now, there is a phenomenon called "attention dilution" which affects transformer models. Just so you don't get confused here, when I say the word "transformer", think of it as the underlying architecture or design blueprint behind modern AI as we know it.

This architecture is what powers Claude, GPT, Gemini, and virtually every modern large language model. Picture it like this:

Transformer = the engine design
Claude, GPT, Gemini etc. = different cars built using that engine design

The Old Way: Sequential Processing (Pre-2017)

Before 2017, the architectures that existed like RNNs (Recurrent Neural Networks) and LSTMs (Long Short-Term Memory networks) worked by processing text sequentially, that is, one word at a time from left to right, almost like when you read a sentence out loud:

"The cat sat on the mat"
  ↓
 "The" → process → "cat" → process → "sat" → process → ...

The problem with this architecture is that by the time the model reaches the word "mat", it had a weak memory of "The". This meant that the model lost coherence or clarity, and training was slow because each word had to wait for the previous word to be processed.

The New Way: Parallel Processing with Attention

In 2017, there was a breakthrough by Google Researchers that led to a better architecture where all tokens were processed simultaneously and not sequentially.

"The cat sat on the mat"
   ↓    ↓    ↓   ↓   ↓
   [ALL processed in parallel]

Now naturally, if all tokens are being processed at the same time, how is the LLM able to make sense of the full sentence? Well, this is where the "attention mechanism" comes into play.

How Attention Works

When processing the sentence, "The cat sat on the mat because it was tired", the word "it" needs to figure out if it refers to "cat" or "mat" or any other token in the sentence.

Look at it this way:

"it" attends to all other tokens:
  → "The" (low attention: 2%)
  → "cat" (HIGH attention: 45%)  ← probably refers to this!
  → "sat" (low attention: 3%)
  → "mat" (medium attention: 15%)
  → "tired" (medium attention: 20%)

During training, a model learns these attention patterns and discovers that "tired" is typically associated with living things, so "it" more likely refers to "cat" than "mat", "sat", or any of the other tokens.

The Practical Problem: Attention Gets Thinner

When you start a new conversation (aka, a new context window) and you send prompts and receive answers from the model, you start to slowly fill up your context window. As this happens, each token has to pay more and more attention to other tokens.

The attention mechanism I described above has a fixed capacity so it gets spread thinner and thinner with time. Here is what it means practically in the context of a chat for technical writing refinement:

Message 5: Claude can attend strongly to every detail of messages 1-4
Message 50: Claude's attention to message 1 is weaker and fuzzier
Message 100: Early messages become almost like distant memories. They are present but hazy

The implication is that your perfectly trained instance starts to forget its early training as the conversation grows longer and longer.

The Solution: Pre-Emptive Extraction

Now what is the solution for this? Well, in this blog post where I talked about 10 methods to track your tokens and preserve your Claude Instance, the 7th method I mentioned was the Pre-Emptive Safety Net approach which allows you to extract your methodology periodically.

By doing this, you are capturing Claude's understanding while the early training examples still have strong attention weight.

If you wait until message 100 to extract, Claude might have "forgotten" the nuances from messages 1-20 that shaped its understanding of your preferences.

2. How Your LLM's Personality Emerges

Now next, let's talk about how our LLM's personality emerges.

Implicit vs Explicit Training

You see, most times, when you clone a trained instance into an artifact or document of any kind, you will find that some of the instructions there were not actually specified by you in the back and forth conversations you had within that chat.

The process of "training" a Claude Instance is not 100% active. The "personality" captured actually emerged largely from implicit patterns in the feedback you provide to the model.

For instance, you might see a line of instruction in your extracted prompt that says, "prefer bullet points for lists over numbered lists" but you don't remember specifying that to the model. However, Claude inferred it from the edits you made or simply noticed what you approved versus what you rejected.

So the extraction process in LLM Instance Cloning converts implicit emergent behavior into explicit instructions.

Why Some Behaviors Don't Survive Cloning

Because the model might not be great at identifying every single implicit requirement, you will notice that some "trained" behaviors may not survive the cloning process because they were never consciously articulated.

This is why you need to keep refining your cloned instance iteratively.

The Iterative Refinement Solution

Each time the clone does something unexpected, you're discovering an implicit preference that didn't make it into the explicit extraction and it is your job to update your clone with an explicit instruction that closes that gap.

Think of it as debugging your AI's personality, one unexpected behavior at a time.

3. The True Cost of Self-Reflection

Now finally, let's talk about the first stage of LLM Instance Cloning as described in this case study: Self-Reflection.

In this stage, you are essentially asking Claude to "explain its methodology".

What Happens During Extraction

By now, you probably understand that this is actually quite an expensive operation because Claude has to:

Read the entire conversation history (tokens IN)
Identify patterns in its own previous responses (computation)
Generate a description of those patterns (tokens OUT)

Let's not also forget that Claude's description of its methodology is itself an approximation as we saw in the previous section where we talked about implicit emergent behaviors.

Claude doesn't have introspective access to its own weights or decision-making process. When it "explains" its methodology, it's actually doing inference: "Based on the pattern of my responses, a reasonable methodology would be..."

This means that your extracted prompt isn't exactly a perfect copy of the instance's behavior, but rather, Claude's best guess at what instructions would produce a similar behavior.

The 60-75% Sweet Spot

You should also remember that your extraction prompt needs tokens to work well.

If you're at 95% token capacity and try to extract, Claude might:

Truncate its explanation
Miss nuances due to context pressure
Fail to generate a complete artifact

As explained in this blog post, the optimal extraction window is between 60-75% capacity as this gives Claude enough remaining tokens to:

Re-read the full conversation with good attention
Generate a comprehensive methodology description
Create a refined, tested extraction prompt
Allow for one or more rounds of revision if needed

Why Artifacts Cost Tokens Twice

Now final notes: when you convert the methodology into a downloadable prompt, this prompt is stored in an artifact.

You should, however, note that artifacts often consume tokens twice. Now you might ask, Prisca, why is this so?

Well:

Generation tokens: It costs tokens to generate the artifact itself
Context tokens: That artifact content then often gets included in the conversation history

So that 500-line code file that Claude created? It might cost 800 tokens to generate, then another 800 tokens of context space for every subsequent message (depending on how the system handles artifacts).

This is why conversations with heavy artifact usage hit limits faster than pure text conversations.

Conclusion

And that's it my people. I hope you found this content useful. Understanding how attention fades, personality emerges implicitly, and extraction costs real tokens will help you become a more strategic user of AI.

As always, thanks for reading!

More on Tokens: How Attention Fades, Personality Emerges, and Why Extracting at Message 100 is Too Late