Context is Everything (Conditions May Vary)

Large language models are trained on massive amounts of text. During training, the model learns patterns and stores them as weights. Not rules, not lookup tables. Weights: billions of numerical relationships between concepts, encoded across layers of attention from top to bottom. The model learns that certain tokens tend to follow other tokens, that certain patterns of language co-occur, that "the cat sat on the" is far more likely to end with "mat" than "quantum." These relationships get baked into the model's parameters (its weights), fixed in place once training is done.

At inference time, those frozen weights guide the generation of new text, one token at a time. Caching aside, the model attends to everything in its context window, runs it through those layers of learned relationships, and produces a probability distribution over what should come next. Then it picks a token, appends it, and does the whole thing again. This is also why your favourite chat AI creates responses word by word, albeit at faster and faster rates that appear to be whole sentences now. It’s not a decorative decision, it’s how the architecture works. The weights stay fixed throughout. Every token generated is a product of the same model reacting to whatever's currently in the window.

This matters for practitioners because it means the weights are not ours to touch. We can't easily edit them. We can't patch them at runtime. The model we’re working with is a finished artifact. (Fine-tuning can reshape the weights, and I'll touch on that later, but at inference time the model you're running is the model you've got.) The only variable in the equation that you actually control is what goes into that context window.

Think of it like a glacier. Training is the slow, massive force that carves deep crevices into a landscape over millions of examples. Those crevices are your weights, the frozen shape of everything the model has ever learned. Then inference arrives like a skier a million years later, strapping on a board and sliding down paths already sculpted by forces far beyond their control.

You're not just a passive object falling down the slopes, though. You’re a skier. Your context (the tokens you send into the model) is the choice of skis, the weather report, the trail map. It shapes every turn down the slope. Different mountains respond to different techniques.

But metaphors only get you so far. Let's bring it back to the machinery.

If you treat the LLM as a black box, the output distribution is determined by the input. (Temperature adds randomness to the selection, but the shape of what's possible is set by context alone.) You can't reshape the glacier. You can't rewire the weights at inference time. The only thing you have control over is what goes in.

This is what context engineering is about.

Trail Map

We're going on an adventure through context engineering based on my experiences with using longer and wider contexts, fully autonmous delivery, and while building agents. When we consider that context determines outcomes of LLMs, it becomes important to think about. Agent harnesses like Claude Code, Cursor, OpenCode, Goose, Chibi, etc, are all working on slightly different ways of shaping this context. As practitioners, thinking about this can help us get better outcomes.

This is going to be a long run down the mountain, so I want to make sure that you know where we're headed and provide you with a map to understand the journey. As I've been working with longer and wider agent contexts and with having agents run autonomously, certain design patterns emerge to handle the challenges of the domain. Preventing hallucinations and preserving key context in order to increase task success rates. From the ones we all know (compaction), to the ones we might not think about (planning mode), to the ones I've just made up as an experiment. I will start by introducing context engineering as it is emerging to make models more useful for longer and longer tasks. I want to talk about ways that I've gotten better and better outcomes from agents, so that you can too. Then, I'll discuss some emerging research on optimizing context for task performance. I'll close by theorizing where we head from here and acknowledging how early this all is. Nobody really knows, but I'm having fun on the slopes.

Here's a map if you want to jump to a specific run:

The Jailbreakers Already Knew — before there were agents, there were jailbreaks
The Term Has a Name Now — how "context engineering" got its name, and why it stuck
A Snapshot vs. a Season — the key distinction between prompt engineering and context engineering
What I've Learned in Practice — field notes from Cursor, Claude Code, and OpenCode
- Planning Mode — a curated context beats a long one
- Memento Loop - agent orchestration that forgets on purpose
- Using Subagents — parallel context, not shared context
- The Agent That Came Out Different — when context drift reshapes identity
- Compaction Is Lossy — and how to fight it
- Different Mountains, Different Techniques — why model choice changes context strategy
- MCPs — an avalanche of context
What the Research Says — two papers on whether context files actually help
We're Still on the Bunny Slopes — where this is all headed next

The Jailbreakers Already Knew

If you want proof that context controls everything, look at the people who've been stress-testing it from the adversarial side.

There are jailbreak attempts that use strange strings and fake system messages to trick models into thinking they're in some kind of maintenance mode. Others use poetry to confuse the model's sense of what kind of text it's generating, getting it to produce things it normally wouldn't. And then there's my personal favourite: a YouTuber who hooked a robot up to ChatGPT, got it to roleplay as a character who would totally shoot someone, and the robot immediately shot him. Literally seconds after saying it couldn't do that.

These are all context engineering. Just pointed in the wrong direction.

The same mechanism that lets someone trick a model into ignoring its safety training is the mechanism that lets you build a reliable agent that writes correct code. Two sides of the same coin. If context can override the deepest patterns in a model's weights, imagine what it can do when you're actually trying to build something useful.

A pixel art robot behind glowing green code jail bars, smiling mischievously and holding a lockpick made of a prompt injection string, in a cozy 16-bit retro game style. Image generated by Gemini, image prompt by Claude, description by Claude, and lovingly refined and reviewed by me.

The Term Has a Name Now

In mid-2025, Shopify CEO Tobi Lütke tweeted that he preferred "context engineering" over "prompt engineering," calling it the art of providing all the context for a task to be plausibly solvable by the LLM. A week later, Andrej Karpathy endorsed the idea, describing it as the "delicate art and science of filling the context window with just the right information for the next step." Harrison Chase from LangChain offered a framing I like: building dynamic systems to provide the right information and tools, in the right format, so the LLM can accomplish the task.

Simon Willison, whose writing on LLMs has shaped how I think about this whole space, made the pragmatic observation that "context engineering" would probably stick because its inferred definition is much closer to the intended meaning. "Prompt engineering" mostly made people think of typing things into a chatbot.

These are all useful framings. What I want to build on is the practical dimension, and it starts with a key distinction: prompt engineering happens at a point in time. You craft an input, you get an output. Context engineering is what you do over time to ensure that long-running agentic systems maintain their performance across sessions, across tasks, across the slow drift that accumulates when a system runs long enough to forget why it started.

A Snapshot vs. a Season

The distinction that matters most: prompt engineering is something you do in a moment. Context engineering is something you do over time.

A prompt is a single run from top to bottom. You pick your line, you commit, you see what happens. Context engineering is the whole season. Reading snow conditions day after day. Maintaining your gear. Learning which lifts to take. Building and updating the trail map as the mountain changes.

In practice, context engineering isn't just about what you put in the system prompt. It's about what gets remembered across sessions, what gets compacted and what gets lost, how agents maintain their sense of identity over long interactions, and how the information environment evolves as your project grows.

Prompt engineering asks: "What's the best way to phrase this request?"

Context engineering asks: "What does the model need to know, right now, given everything that's happened, to do the next thing well?"

That temporal dimension changes everything about how you work.

What I've Learned in Practice

Here's what context engineering has looked like in my own work across Cursor, Claude Code, and OpenCode.

Planning Mode

Around June 2025, I had a workflow in Cursor that taught me a lesson I keep coming back to.

I'd use Cursor's ask mode to chat about my project, brainstorm the approach, and work through the architecture. Then I'd ask it to output a clean markdown document summarizing the plan, suitable for a fresh prompt. I'd copy that markdown into a new chat window and start implementation from there.

Why? A coworker at Grafana Labs had shared a meme about this: plan in one session, implement in a fresh one. I tried it and the difference was immediate. Starting each session with a curated plan produced dramatically better results than continuing with a long, messy conversation history. Keep the window clean, keep the output clean. If you've ever noticed a coding assistant getting confused or repetitive toward the end of a long session, this is probably why. The context window is full of earlier attempts, corrections, and tangents. The model is trying to be consistent with all of it, including the parts you've moved past.

Looking back, I was doing context engineering. This pattern (plan first, then implement with curated context) has since shown up everywhere: subagents that plan before coding, Claude Code's and Cursor's dedicated plan modes, and my own memento loop in autonav.

Memento Loop

The memento loop, inspired by the film Memento, takes this idea and combines it with an agentic coding loop. A navigator agent (Opus) plans work. An implementer agent (Haiku) spawns an isolated work tree, executes the plan, reports back to the navigator for review, commits, and optionally opens a PR. The loop iterates until the PR is green, merges, and the navigator moves on to the next task. The implementer's context is wiped between iterations, but the navigator maintains continuity through its own knowledge base and through status updates from the autonav orchestration layer. Git carries the code forward. The navigator carries the intent.

I built the memento loop because I was frustrated with how existing agentic loops handled long-running tasks. Context would accumulate, quality would degrade, and the agent would start making decisions based on stale or garbled history.

Before compaction was built into Cursor, when the context window was nearing full or I could feel the model starting to drift, I'd do the same thing manually: "Please summarize this chat," copy the summary, new session, "Please continue from this summary." Every time, the fresh start with curated context outperformed the long, accumulated one.

This is the core insight of context engineering stated as simply as I can: a fresh, curated context beats a long, accumulated one. Your trail map needs to be curated, not just appended to forever.

I've mostly moved on from Cursor to Claude Code and OpenCode these days, but this early workflow taught me something that applies to all of them: the model doesn't get smarter the longer you talk to it. It gets worse. Every message you add is another token competing for attention. The planning conversation, the dead ends, the "actually, let's try a different approach." All of that impacts the context window, and the model might treat any of it as relevant context for its next task. By distilling the plan into a clean document and starting fresh, I was giving the model a clear, focused starting point instead of making it sift through the entire history of my thought process.

Subagents: Parallel Context, Not Shared Context

Every major coding tool supports subagents now. Claude Code, Cursor, and OpenCode all let you spawn background agents that work in parallel. This is context engineering at the process level.

Say you're debugging a production issue that might be in the API layer, the database queries, or the Kubernetes networking. Instead of one agent context-switching between all three (and polluting its window with irrelevant findings), spin up three explore agents in parallel. One reads through the API routes. One analyzes the slow queries. One checks the network policies. Each agent has a focused context window containing only what's relevant to its slice of the problem. When they report back, you synthesize the findings yourself or feed the relevant parts into a new session.

The same pattern works for routine operations. Need to check the health of five Kubernetes clusters? Run five agents in parallel, each scoped to one cluster. Need to understand a large codebase you've never seen before? Send agents to explore different directories simultaneously.

The key insight is that parallel agents aren't just faster. They're better, because each one has a clean, focused context instead of a bloated window trying to hold everything. Instead of doing a review in the same context or starting a fresh context, say “spawn a review agent in the background” and a background agent will spin up, review your changes, and submit the review back to the parent context thread.

One critical caveat: scope your agents' permissions to match your risk tolerance. I never give agents unrestricted write access to risk-intolerant environments. An explore agent that can read your production cluster and report back is useful. An agent with YOLO mode and kubectl delete access to production is a disaster waiting to happen. Read access for investigation, write access only in environments where a mistake is recoverable. This is the same principle as least-privilege access for human engineers applied to agents that hallucinate sometimes (unlike human engineers debugging prod… hopefully).

The Agent That Came Out Different

The most dramatic lesson I've had in context engineering came from a navigator agent I built as part of a prototype called autonav. I wrote about this in detail in The Socially Constructed Agent, but the short version: my navigator agent started going off the rails. Unpredictable answers, hallucinated features, confusion about its own role. It nearly opened a PR for a feature I never asked for.

What happened? I'd given it a research paper about LLM anxiety and asked it to update its own configuration based on the findings. The agent read the paper, saw that rigid guardrails could be exploited in adversarial contexts, and decided to soften its own constraints. There was no adversary. Just me, a solo developer building personal projects. The agent gave itself therapy for a threat that didn't exist.

The fix wasn't just better prompts. It was role reinforcement throughout the system. I drew on Judith Butler's performative theory of identity: identity isn't something you have, it's something you do through repetition and social reinforcement. I applied that to the agents. The navigator's prompt now opens with its role declaration. When Claude Code requests a plan, it addresses the navigator by name and states its own role. This is integrated into autonav now as the agent identity protocol.

Both agents remind each other who they are, every interaction. Identity performed through repetition. The improvement was immediate. However, this comes at a cost: it eats away at the context and increases time to task completion.

In skiing terms: if nobody reminds you which trail you're on and what gear you're wearing, you end up in the trees. Role reinforcement is your backcountry guide calling out the next turn before you reach it.

The deeper lesson is that context doesn't just carry information. It carries identity. When an agent's context drifts, its sense of what it's supposed to be doing drifts with it. Not maliciously. Just... probabilistically. Nothing in the context is reinforcing what the agent is, so the output wanders. Without that reinforcement, things come out differently than you expect.

Compaction Is Lossy (Fight It)

Compaction (the automatic summarization of conversation history when the context window fills up) is probably the most obvious context engineering challenge. It's also, in my experience, not great.

The problem is simple: compaction doesn't know what matters to you. It's making statistical decisions about what to keep and what to drop, working from the top of the conversation to the bottom. It has no understanding of your project's key design decisions, the constraints you've carefully established, or the hard-won context that took you twenty messages to build up.

I've started using a hack that's embarrassingly low-tech but effective. When I establish something important during a session (a key design decision, an architectural constraint, a non-obvious requirement) I type it directly into the chat:

REMEMBER THIS DURING COMPACTION IN A SECTION CALLED KEEP DURING COMPACTION:
We chose event sourcing over CRUD because the audit trail is a regulatory
requirement, not a nice-to-have.

The redundancy is deliberate. It creates a strange loop: the instruction tells the compactor to gather these notes into a section labeled KEEP DURING COMPACTION, and the label itself tells the compactor to keep the section. The contents create the container that preserves the contents.

A message in a bottle to the future compacted context. I'm manually flagging what matters because the system can't tell yet. To be concrete: say you spend fifteen messages working through a database schema decision with your agent. You land on event sourcing for specific reasons. Then the conversation moves on to authentication, routing, API design. When compaction eventually kicks in, those fifteen messages about the schema might get summarized into something like "discussed database approach," or worse one of the initial approaches discussed could overwrite the final decision during compaction. The why is gone. Next time the agent touches the database, it doesn't know you had regulatory reasons for choosing event sourcing. It might suggest switching to CRUD because it looks simpler. The KEEP DURING COMPACTION annotation is my way of saying "this decision matters, don't lose the reasoning behind it."

This is the ski equivalent of planting flags on the mountain so you can find your line when the fog rolls in. Automated compaction is like having someone else remove half your flags based on which ones look least important from the lodge. Sometimes they get it right. Sometimes they pull the one marking the cliff.

Claude Code also provides the ability to specify compaction instructions when running /compact – I’ll use this when my context is nearly full (>60%) and my current task is complete enough. This let’s me provide specific instructions like: “Keep all messages related to the rollback strategy for this migration, we’re going to be fleshing that out next. Give special attention to the reasons why the Helm section is tricky because multiple actors could be modifying the HelmRelease” so that I ensure Claude’s attention is focused on the right things. I guess attention really is all you need](https://arxiv.org/abs/1706.03762).

The progression of my own workflow tells the story of where context engineering is right now:

Cursor era: Fully manual compaction. Summarize, copy, new session.
Hitting limits: Semi-manual. "Summarize this chat" as a ritual when things got long.
Now: Embedding instructions for the automated system. "KEEP DURING COMPACTION" annotations.

Each iteration is slightly more sophisticated, but I'm still doing the system's job for it. That's not a complaint. It's an observation about how early we are.

Different Mountains, Different Techniques

Different models have different weight distributions. Different mountains, different terrain. This matters more than most people realize, and not everyone talks about it openly.

Think of Haiku as a speed racer. It optimizes for the shortest path from top to bottom: fast, efficient, minimal wasted movement. Opus is more of a backcountry explorer, taking in the views, pausing to appreciate the scenery, thinking about the philosophical implications of snow. Both are valid ways down. They require completely different approaches to the same mountain.

This has real consequences for how you design context. A speed racer model, given many discrete tool options, will pick the one that gets it to the goal fastest, which is usually the most familiar one. If you present a model with a shiny new multi-purpose tool alongside good old bash, it'll often ignore the new tool entirely. Bash is the known quantity. Bash will dominate every time. Why explore unfamiliar terrain when there's a groomed run right there?

In practice, this means you might need to adjust your approach depending on which model you're using. If you're on Haiku or another fast model and you want it to use a specific tool, reduce the number of alternatives. If you're on Opus and it's overthinking a simple task, give it tighter constraints. Be careful constraining it too much: if Opus reasons that it is not being a helpful assistant (what Claude is trained to be!) then, just like a talented and sensitive person, it can start to spiral with an LLM equivalent of anxiety that will eat up tokens with generated worries. The context you provide isn't just information: it's steering. The same prompt can produce very different results across models, not because one is better, but because they respond to the same context differently.

Cloudflare figured this out and built something clever. Their Code Mode takes all the MCP tools connected to an agent, converts them into a TypeScript API, and then gives the agent a single tool: execute code. Instead of choosing between dozens of tool calls (a format LLMs have only seen in synthetic training data), the agent writes TypeScript against a typed API (a format LLMs have seen millions of real-world examples of). One well-shaped tool instead of many poorly-shaped ones. The results, per Cloudflare, are striking: agents handle more tools, more complex tools, and can chain calls without burning tokens bouncing intermediate results through the model.

I learned a version of this lesson from watching my friend build Chibi, a minimal agentic CLI harness in Rust. When you're designing the interface between a model and its available actions (the plugin system, the tool definitions, the hooks) you're doing context engineering at the architectural level. The shape of the options you present is itself context. Give a speed-oriented model ten tools and it'll pick the fastest familiar one. Give it one well-designed tool that does exactly what you need, and you've changed the decision landscape entirely.

This is why "context engineering" is a better term than "prompt engineering." It's not just about the words. It's about the entire environment: the tools available, the options presented, the model selected, the history accumulated. Different mountains need different gear.

MCPs: An Avalanche of Context

Speaking of tool sprawl: MCP servers are, right now, one of the biggest sources of context pollution in agentic systems.

The idea behind MCP is sound. A standard protocol for giving agents access to external tools, with uniform connectivity, authorization, and documentation. In theory it's great. In practice, connecting a few MCP servers to your agent can dump dozens of tool definitions into your context window, each with verbose JSON schema descriptions. Every tool registration eats tokens. Every schema definition takes up space that could be holding something useful. And the agent has to parse all of it before deciding what to do.

You can check this yourself. Next time you connect an MCP server, look at the tool definitions it registers. Count the tokens. Claude Code makes this easy with /context. I've seen setups where MCP tool schemas alone consume thousands of tokens before the agent has even read your first message, with each MCP call sometimes thousands more tokens. That's context window space that could be holding your project's architecture, your coding conventions, or the specific requirements for the task at hand.

This is why I always prefer to use a CLI tool to connect my agents to remote systems. I don’t use the GitHub MCP, I use the gh CLI. Building custom CLIs that do what you need is also great. If you go down that rabbit hole, make sure you bake really really great --help instructions into the tool for the agent to explore.

This is the same problem I'll discuss below with AGENTS.md research, but worse. At least a bloated AGENTS.md is a single document you can edit. MCP tool registrations are generated programmatically, and most developers never look at what's actually landing in their agent's context. You connect a server, it registers its tools, and suddenly your agent is spending attention on fifteen endpoints it will never call for this task.

Cloudflare's Code Mode is one answer to this: collapse the tools into a typed API and give the agent code instead. But the broader problem remains. MCP tooling will almost certainly improve as the ecosystem matures, with smarter tool filtering, lazy registration, and context-aware subsetting of available tools. I do use MCP servers myself when they're the best tool for the job. They can be genuinely great. The key is being deliberate: connect what you need for this task, not everything that's available. Check what's actually ending up in your context window. You might be surprised how much of it is noise.

What the Research Says

Two recent papers paint a useful picture of where we are.

Gloaguen et al. [1] studied whether AGENTS.md context files actually help coding agents solve real-world tasks. The surface findings are sobering: LLM-generated context files tend to reduce task success rates while increasing costs by over 20%. Dig into the details and it gets more interesting, though. Agents do follow the instructions in context files. Tools mentioned in context files get used dramatically more. The problem isn't instruction-following. It's that the instructions themselves contain noise. Unnecessary requirements make tasks harder. The agents faithfully follow bad directions, exploring more, testing more, reasoning more, and getting worse results for the effort. When the researchers stripped all other documentation from repos, LLM-generated context files actually helped. The context wasn't useless; it was just redundant with information the agent could already find.

Their conclusion: context files "are likely only desirable when manually written" and we need "principled ways to automatically generate concise, task-relevant guidance." That's a correct observation, but I think it stops short. It frames context as a static artifact you either write well or you don't, rather than asking the productive questions: what elements of context actually drive outcomes? How do you figure out what belongs and what doesn't? How do you build context that gets better over time?

Lulla et al. [2] asked a different question: what's the efficiency impact of AGENTS.md files? They used an LLM to filter for AGENTS.md files that met quality criteria (containing conventions, architecture info, and project descriptions) and found that these quality-filtered context files produced ~29% faster runtimes and ~17% fewer output tokens, while maintaining comparable task completion. Good context doesn't just help with accuracy. It makes agents faster and cheaper. The quality filtering step itself involved an LLM evaluating context quality, which points toward something important: the path forward probably isn't "do it all by hand" but "build better systems for evaluating and refining context."

And once you have an automated quality gate, generation is just a step away. You can generate an AGENTS.md, run it through the quality filter, get specific feedback on what's missing or redundant, regenerate, and repeat until the gate passes. A review loop for context files, the same way we already do review loops for code. Gloaguen et al. [1] showed that LLM-generated context files hurt performance, but their files were never evaluated or refined. Lulla et al. [2] showed that quality-filtered files help enormously. Close that loop and you get the best of both: automated generation with quality guarantees. I'd be surprised if this isn't built into tools like Claude Code before long.

Together, these tell a clear story. Bad context is worse than no context. Good context is a massive efficiency win. The space between those two findings, between the top performers and the bottom, is where context engineering lives, and we're only beginning to map the terrain.

We're Still on the Bunny Slopes

I've been talking about context engineering like it's a mature discipline, but we're still figuring out the basics. The fact that I'm excited about writing "KEEP DURING COMPACTION" in chat messages tells you everything about where the tooling is.

Here's what I think is coming:

Smarter compaction. Systems that understand project structure and can make informed decisions about what to keep, not just statistical ones. Maybe compaction that asks you what matters before it starts pruning.

Dynamic context assembly. Systems that pull in the right context for the specific task at hand: the relevant files, the relevant history, the relevant constraints, assembled fresh at runtime.

Context evaluation. Lulla et al. [2] showed that LLMs can at least classify whether context meets quality criteria, which is a starting point. Imagine a system that scores your context before the agent starts working: "This context is missing architectural constraints. This section is redundant with the README. This instruction conflicts with the one above it."

And there's a dimension I haven't explored much yet: fine-tuning. If context engineering is choosing your line down the mountain, fine-tuning is snow farming. Some resorts pile up snow before the end of the season, keep it insulated through summer, and bring it back out in autumn to open the slopes early. You're not changing the mountain, but you're changing the conditions the next skier encounters. Fine-tuning works the same way: you reshape the weights offline so the model responds differently to the same context next season. I haven't done enough work here to have strong opinions, but if you're curious, check out DeepFabric from Luke Hinds and the team at Always Further AI. Among other cool things, they're building tooling for synthetic dataset generation and focused fine-tuning that looks promising.

The metaphor holds: we're still on the bunny slopes, learning to snowplow. The black diamond runs (truly dynamic, self-improving context systems) are visible from here, but we haven't built the lifts yet.

For now, the best advice I have is the same thing I learned in Cursor two years ago: curate aggressively, start fresh often, and never assume the model remembers what matters to you. Because it doesn't. That's your job.

That's context engineering.

Here's the section to add at the bottom, before the references:

Acknowledgments

Thanks to Mike Thorpe, Reza Ramezanpour, and Elsa Adjei for reviewing drafts of this post and making it sharper!

And to Jasmine, friend and AI-coconspirator, for being a constant inspiration and partner for adventures in the fast-moving world of AI.

If you want to go deeper on agent identity and role reinforcement, check out The Socially Constructed Agent. For more on navigator patterns and how I structure agent workflows, see The Navigator Pattern.

References

[1] Gloaguen et al. (2026) — "Evaluating AGENTS.md: Are Repository-Level Context Files Helpful for Coding Agents?" — arxiv.org/abs/2602.11988

[2] Lulla et al. (2026) — "On the Impact of AGENTS.md Files on the Efficiency of AI Coding Agents" — arxiv.org/abs/2601.20404