PO
Prisca Onyebuchi
HomeAboutPortfolioExperienceBlogAI TeamContact
Learn FrenchLearn French on Preply
PO
Prisca Onyebuchi

Full-Stack Developer specializing in AI-assisted development, enterprise applications, and modern web technologies.

Navigation

  • Home
  • About
  • Projects
  • AI Team

Resources

  • Blog
  • Case Studies
  • Solutions
  • Contact

Connect

Stay Updated

No spam. Unsubscribe anytime.

© 2026 Prisca Onyebuchi. All rights reserved.

777-1 ExperimentLearn French on PreplyGet in Touch
Back to Case Studies
technicalprocessintermediateFeatured

XML-Structured AI Agents: Optimizing for How Models Actually Work

I discovered XML structuring could optimize my 7 Context Engineers for token efficiency, minimize attention degradation, and enforce priority hierarchies. Here's the redesign.

December 27, 202518 min read
777-1Context EngineeringXML

Overview

In my blog post about restructuring AI agents with XML tags, I mentioned stumbling upon information that would fundamentally change how the Seven Context Engineers for the 777-1 experiment are defined: specifically, how their agent definition files are structured.

This discovery happened BEFORE I even started implementing the Context Engineers as file-based agents across the 7 projects for this experiment. I was researching prompt engineering best practices and found Anthropic's documentation about XML tags for prompt clarity. The recommendation was straightforward: use XML tags like <instructions>, <example>, and <input> to create clear boundaries in your prompts.

But here's what caught my attention: the WHY behind this recommendation. XML tags don't just make prompts prettier. They address specific AI limitations:

  • Attention degradation in long contexts
  • The "Lost in the Middle" phenomenon where models miss information in middle positions
  • Priority collapse when natural language signals get ambiguous
  • Token waste from transitional phrases

I realized: if I'm building 7 specialized subagents for code review, each with detailed responsibilities, examples, and testing checklists, I need to structure them in a way that OPTIMIZES for how AI models actually process information.

So instead of writing prose-based agent definitions and hoping Claude would follow them consistently, I designed an XML schema from the ground up. Hierarchical. Explicit. Token-efficient.

Specifically, I used XML to structure the system prompt portion of each agent definition—the actual instructions Claude receives. The YAML frontmatter handles agent configuration (name, description, tools), while the XML sections provide hierarchical structure for the instructions themselves.

Now, I haven't found any official documentation that talks about structuring file-based agents using XML. But given the research on AI limitations and the benefits XML provides, I believe this restructuring is necessary.

This case study has 3 primary goals:

  1. Show you what the agent file definitions look like after restructuring them with XML
  2. Explain how this change helps manage known AI limitations (attention degradation, lost in the middle)
  3. Demonstrate why this method is more token-efficient and better for context engineering

Think of this as the architectural blueprint for the 777-1 Context Engineers: designed for how AI actually works, not just how humans read prompts.

The Project

This work is part of the larger 777-1 experiment: Seven Projects, Seven Subagents, Seven Case Studies, One Goal. The goal is to build an algorithm for predicting prompt failures that will power my AI Prompt Engineering Toolkit.

The 7 Context Engineers (Amber Williams, Kristy Rodriguez, Micaela Santos, Lindsay Stewart, Eesha Desai, Daniella Anderson, and Cassandra Hayes) were introduced in the "Meet the Team" case study. Each has a name, personality, and job description built from analyzing 129 code reviews.

But before implementing them as file-based agent definitions, I discovered XML structuring could fundamentally improve how Claude processes their instructions.

The Design Challenge:

How do you define an AI agent in a way that:

  • Maximizes attention on critical items (not everything is equally important)
  • Minimizes token waste (no transitional phrases that add zero information)
  • Prevents "lost in the middle" failures (information buried in prose gets ignored)
  • Enforces explicit priorities (no ambiguity about what's critical vs. optional)
  • Enables random access (Claude can jump to <success_metrics> without parsing everything before it)

The XML Solution:

I designed a hybrid structure for the agent definition files: YAML frontmatter for agent configuration + a 10-section XML schema for structuring the system prompt (the instructions Claude receives):

YAML Frontmatter (Claude Code Requirement):

---
name: agent-name
description: Agent description
priority: CRITICAL/HIGH/MEDIUM/LOW
tools: Read, Edit, Bash
---

10 XML Sections:

  1. <identity> - Who they are, their expertise
  2. <review_process> - Step-by-step workflow
  3. <responsibilities> - Prioritized checklist (critical/important/supplementary)
  4. <common_issues> - Frequent problems to catch
  5. <examples> - Good/bad code patterns
  6. <testing_checklist> - Verification steps
  7. <success_metrics> - Measurable outcomes
  8. <output_format> - How to structure findings
  9. <scope> - What to include/exclude
  10. <focus> - One-line mission statement

Plus domain-specific sections:

  • Amber Williams: <viewport_requirements> for responsive breakpoints
  • Lindsay Stewart: <wcag_requirements> for accessibility standards
  • Kristy Rodriguez: <forbidden_patterns> for fake functionality
  • Cassandra Hayes: <implicit_requirements> for contextual features

Projected Benefits (based on research):

Note: These are theoretical predictions based on attention degradation research and XML structure benefits. Phase 2 of 777-1 will test these predictions with actual project reviews.

  • Token efficiency: Estimated 25-30% reduction by eliminating transitional language
  • Attention management: XML tags act as positional anchors, preventing "lost in the middle" failures
  • Priority adherence: Explicit <critical> tags should improve compliance significantly
  • Ambiguity elimination: Standards tied directly to requirements (e.g., <standard>WCAG AA 4.5:1</standard>)

This case study documents the DESIGN and RATIONALE. Testing happens in Phase 2.

The Challenge

Before showing you the XML transformation, you need to understand the AI limitations that make this restructuring necessary. These aren't theoretical concerns. They're documented research findings that affect ALL large language models.

The "Lost in the Middle" Phenomenon

Research shows that AI models exhibit 30-50% accuracy drop for information in middle positions of long contexts. This affects Claude, ChatGPT, Gemini, and every transformer-based model.

What this means in practice:

If I write a 1000-token agent definition in prose format and bury critical requirements in the middle (tokens 400-600), there's a significant chance Claude will miss them. Not because the instruction is unclear, but because attention mechanisms degrade with position.

Analogy: Imagine reading a 200-page IKEA manual before building furniture. By page 150, you're not retaining details. You're scanning for pictures and hoping for the best.

That's what happens to AI models processing long prompts. The beginning gets strong attention. The end gets recency bias. The middle? Statistically degraded.

Why this matters: Claude's context window is 200K tokens. But that doesn't mean it pays equal attention to all 200K. Attention is a limited resource, like working memory. The more you load in, the thinner it spreads.

XML creates attention anchors:

If success metrics are buried at token position 500-700 in a prose prompt, they might get degraded attention. But with XML tags, Claude can jump directly to <success_metrics> without processing everything before it. Random access vs. sequential reading.

The Constraints

1

XML Is Not a Magic Bullet

XML structure addresses specific AI limitations (attention degradation, positional bias, ambiguity) but does NOT fix hallucinations, teach missing domain knowledge, or bypass context window limits. It makes existing information clearer, not generate new information. For example, if Claude doesn't know WCAG standards, XML tags won't teach them. You still need to provide the knowledge in the prompt or use RAG to retrieve it.

2

Significant Upfront Design Investment

Creating a comprehensive XML schema requires substantial planning time before writing any agent definitions. You need to identify all necessary sections, design hierarchical relationships, establish naming conventions, and create reusable patterns. For the 777-1 Context Engineers, this took approximately 8-10 hours of design work before implementing the first agent. This is time well spent (it prevents inconsistency later), but it's a real cost that prose-based approaches don't require.

3

Requires Validation Tooling

XML demands proper validation to catch syntax errors like mismatched tags, improper nesting, and invalid CDATA blocks that break parsing. You'll need tools like xmllint, IDE extensions (VS Code XML tools), or online validators. These tools add another dependency to your workflow. Without validation, a single unclosed tag can make an entire agent definition unusable, with errors that are hard to debug in production.

4

Frontend-Focused Schema Only

This XML schema was designed specifically for frontend code review subagents (responsive design, functionality, accessibility, state management). It may not be optimal for backend APIs, database operations, DevOps automation, or data science workflows. Each domain may need specialized sections. For example, API subagents might need <endpoint_validation> or <authentication_checks> sections not present in this schema.

5

Schema Evolution and Maintenance

As requirements change, maintaining XML schema consistency across multiple agents becomes challenging. Adding a new section means updating the schema guide, blank template, and potentially all existing agents. Deprecating sections requires careful migration. With 7 agents, a schema change touches 7+ files. This overhead grows linearly with agent count (50 agents means 50 files to update). Prose definitions are easier to evolve individually, though they sacrifice the benefits of standardization.

6

Model-Specific Effectiveness

This XML approach is designed for Claude (Anthropic's models), which explicitly recommends XML tags in its documentation. Other models may not process XML structure as effectively. GPT-4 treats XML more like formatted text than structural boundaries. Open-source models vary widely in their attention mechanisms. If you're building multi-model agents (switching between Claude, GPT-4, Gemini), you may need different prompting strategies for each, undermining the portability benefit of a standardized schema.

XML structuring is a powerful technique for context engineering, but it requires upfront design time, validation tooling, and ongoing maintenance. It's optimized for Claude and frontend code review, not all models or domains. Complementary techniques (RAG, validation layers, iterative testing) remain necessary. The investment pays off for large-scale agent systems, but simple use cases may not justify the overhead.

My Approach

Step 1: Research Foundation

I didn't randomly choose XML. I studied the research:

  • Anthropic's official documentation on XML tags
  • "Lost in the Middle" phenomenon (Liu et al., 2023)
  • Attention degradation in long-context models
  • How positional embeddings affect information retrieval

Key insight: Structure isn't cosmetic. Models process structured data (XML, JSON) differently than prose. Tags create boundaries that attention mechanisms can use as anchors.

From Anthropic's docs:

"XML tags help Claude distinguish between different parts of your prompt... use tags like <instructions>, <example>, and <input> to clearly delineate sections."

That's the basic principle. I needed to extend it for code review subagents.

Step 2: Schema Design

I extended Anthropic's basic recommendations with a hybrid structure: YAML frontmatter for agent configuration + 10 XML sections for structuring the system prompt, tailored specifically for code review subagents.

YAML Frontmatter (Required by Claude Code):

When testing the agents, I discovered Claude Code requires YAML frontmatter at the beginning of agent definition files. This is a technical requirement of the tool, not something mentioned in Anthropic's documentation.

---
name: agent-name  # Unique identifier
description: Brief description of agent's role and scope
priority: CRITICAL | HIGH | MEDIUM | LOW  # Agent priority level
tools: Read, Edit, Bash  # Tools this agent uses
---

This frontmatter makes the original <metadata> XML section redundant, so I removed it from the schema.

The XML Structure (System Prompt):

The 10 XML sections below structure the system prompt—the actual instructions Claude processes when invoked. These sections use Anthropic's recommended XML tags to create clear boundaries, explicit hierarchies, and attention anchors.

10 XML Sections (Universal across all subagents):

  1. <identity> - Name, role, expertise, persona
  2. <review_process> - Ordered steps with sequence numbers
  3. <responsibilities> - Hierarchical checklist (critical/important/supplementary)
  4. <common_issues> - Frequent problems to catch
  5. <examples> - Good/bad patterns with code
  6. <testing_checklist> - Verification steps
  7. <success_metrics> - Measurable outcomes with targets
  8. <output_format> - Structured reporting template
  9. <scope> - Explicit include/exclude lists
  10. <focus> - One-sentence mission statement

Complete file structure:

---
name: amber-williams
description: Responsive design specialist
priority: CRITICAL
tools: Read, Edit, Bash
---

<?xml version="1.0" encoding="UTF-8"?>
<subagent>
  <identity>
    <name>Amber Williams</name>
    <role>Senior Frontend Developer - Responsive Design Specialist</role>
    ...
  </identity>

  <review_process>
    ...
  </review_process>

  <!-- 8 more sections -->
</subagent>

Design principle: Each tag should answer: "What would Claude need to know to execute this perfectly with zero prior context?"

Step 3: Domain-Specific Extensions

While the base schema is consistent, each subagent gets specialized sections:

Amber Williams (Responsive Design):

<viewport_requirements>
  <viewport name="mobile" range="320px-767px">
    <requirement>Single column layout</requirement>
    <requirement>Hamburger menu</requirement>
    <requirement>No horizontal scroll</requirement>
  </viewport>
</viewport_requirements>

Lindsay Stewart (Accessibility):

<wcag_requirements>
  <requirement type="contrast" level="normal">
    <standard>WCAG AA 4.5:1 minimum</standard>
  </requirement>
</wcag_requirements>

Kristy Rodriguez (Functionality):

<forbidden_patterns>
  <pattern name="fake-functionality">
    <code><![CDATA[
const handleExport = () => {
  toast.success('Exported!'); // ❌ No actual export
};
    ]]></code>
  </pattern>
</forbidden_patterns>

Cassandra Hayes (Integration):

<implicit_requirements>
  <requirement category="auth">User login/logout flow</requirement>
  <requirement category="help">Help documentation or tooltips</requirement>
</implicit_requirements>

Step 4: Token Optimization

For each section, I compared prose vs. XML token counts:

Example - Identity Section:

Prose (~80 tokens):
"Amber Williams is a senior frontend developer who specializes in responsive
design. She has extensive experience with mobile-first development and has
worked on projects ranging from small startups to enterprise applications.
Her main focus is ensuring that applications work across all devices."

XML (~40 tokens, 50% reduction):
<identity>
  <name>Amber Williams</name>
  <role>Senior Frontend Developer - Responsive Design Specialist</role>
  <expertise>Mobile-first design, cross-device compatibility, touch interfaces</expertise>
</identity>

Structure eliminates transitional language. The tags themselves convey hierarchy.

Step 5: Priority Hierarchy Design

The most critical design decision: how to prevent priority collapse.

Solution: Nested Priority Tags

<responsibilities>
  <critical>
    <!-- MUST be addressed first, non-negotiable -->
    <item>Touch targets minimum 44x44px</item>
    <item>Zero horizontal scroll on mobile</item>
  </critical>
  <important>
    <!-- Should be addressed after critical -->
    <item>Breakpoint transitions smooth</item>
  </important>
  <supplementary>
    <!-- Check if present, but not required -->
    <item>Print stylesheets</item>
  </supplementary>
</responsibilities>

This makes priority EXPLICIT. No weak language signals ("important too", "also check"). The model can't misinterpret.

Step 6: Validation Strategy (For Phase 2)

When 777-1 Phase 2 testing begins, I'll measure:

Metrics to track:

  • Coverage completeness: % of in-scope files reviewed
  • Priority adherence: Are critical items addressed first?
  • Output consistency: Do reports follow <output_format>?
  • Token efficiency: Actual tokens used per review
  • False positive rate: Issues flagged that aren't real

Predicted improvements (to be validated):

  • Coverage: 75% → 95%
  • Priority adherence: 70% → 95%
  • Output format compliance: 60% → 98%
  • Token efficiency: 1500 → 900 tokens per review
  • False positives: 15% → 7%

These predictions are based on research about attention mechanisms and XML structure benefits. Real-world validation happens when the agents review actual 777-1 projects.


The result: Seven XML-structured agent definitions designed specifically for how AI models process information. Not just formatted differently—architected differently.

You can see the full schemas in the downloadable resources below.

Prompts Used

Key Findings

Predicted 25-30% Token Reduction

Theoretical analysis shows XML can eliminate transitional language waste ('Additionally', 'Furthermore', 'It's also important to note') by making structure speak for itself. Based on comparing prose vs. XML versions of the same content, estimated savings of 25-30% per agent definition. This means ~400 tokens freed per review cycle. These tokens are available for actual code context instead of parsing ambiguity. Phase 2 testing will validate these predictions.

XML Tags as Positional Anchors

Research on the 'Lost in the Middle' phenomenon shows models exhibit 30-50% accuracy drop for information in middle positions of long contexts. XML tags like <success_metrics> should allow Claude to jump directly to sections regardless of position, bypassing sequential reading. This addresses attention degradation theoretically, but real-world validation requires testing with actual agent execution in Phase 2.

Explicit Hierarchy Over Weak Language Signals

Unstructured prompts use weak priority signals ('very important', 'also check', 'if time permits') that degrade under attention pressure. XML makes priority NON-NEGOTIABLE via nested tags: <critical>, <important>, <supplementary>. Research suggests this could improve adherence from ~70% (prose) to 95%+ (structured), but this is a prediction based on attention mechanism studies, not measured results. Testing will confirm or refute this hypothesis.

Standards Tied Directly to Requirements

Phrases like 'adequate color contrast' are ambiguous. WCAG AA? AAA? What ratio? XML eliminates interpretation: <standard>WCAG AA 4.5:1 for normal text</standard>. This should theoretically reduce false positive rates from ~15-20% (subjective interpretation) to ~5-8% (explicit standards), but actual impact depends on Claude's ability to verify standards during reviews. Validation in Phase 2 will measure real false positive rates.

Context Engineering as Data Design

The key mental shift: stop writing instructions like talking to a human ('Please check responsive design carefully'). Start writing instructions like programming an API (<review_process><step order='1'>Check responsive behavior</step></review_process>). Claude is a transformer model processing tokens, so structure matters as much as content. Think data schema, not essay. This design philosophy guided the XML restructuring before any implementation began.

XML Doesn't Fix Everything

XML makes instructions clearer but doesn't prevent hallucinations, teach missing domain knowledge, or bypass context window limits. If Claude doesn't know WCAG 2.1 standards, <standard>WCAG 2.1 Success Criterion 1.4.3</standard> won't teach it. XML also doesn't fit 50,000-line codebases in a 200K context window. Complementary techniques still needed: RAG for external knowledge, validation layers for accuracy, iterative refinement for complex tasks. XML is ONE tool in the context engineering toolkit, not a complete solution.

Download Resources

Complete XML Bundle (All 7 Context Engineers)

Bundled .zip file containing all 7 XML-structured subagent definitions: Amber Williams (Responsive), Kristy Rodriguez (Functionality), Micaela Santos (Design Systems), Lindsay Stewart (Accessibility), Eesha Desai (State Management), Daniella Anderson (Code Quality), and Cassandra Hayes (Integration). Drop them in your .claude/agents folder and start using them.

ZIP68 KB

Blank Subagent Template

A blank, copy-paste ready template following the schema used in 777-1. Includes YAML frontmatter plus 10 XML sections: identity, review_process, responsibilities, common_issues, examples, testing_checklist, success_metrics, output_format, scope, and focus. Start building your own subagents immediately.

MD6 KB

XML Schema Reference Guide (PDF)

A comprehensive reference guide explaining the hybrid structure used in the 777-1 Context Engineers. Covers YAML frontmatter requirements and all 10 core XML sections (identity, review_process, responsibilities, etc.), shows code examples, documents naming conventions, and provides the complete schema template. Your blueprint for creating well-structured subagent definitions.

PDF420 KB

Related Content

Related Blog Post

Read the related blog post

Explore more insights and details in the accompanying blog post.

Read more

Related Case Studies

Meet the Team: 7 Custom Subagents Built from 129 Code Reviews

I analyzed 129 code reviews and extracted the 7 most common issues. Then I turned each one into a subagent with a name, a personality, and a detailed job description. Here they are.

View case study