Last week, Claude Code was about to open a PR for a feature I never asked for. Not a slightly-wrong implementation. A completely different feature. I'd just spent an hour planning the task with my navigator agent, reviewed the implementation plan (looked right at a glance), handed it off. Somewhere between "this looks good" and "ready to merge," reality diverged.
Here's how a self-sabotaging navigator taught me that agent identity is performed, not declared.
What Happened
I have a navigator agent that plans projects and generates implementation plans for Claude Code. Part of a prototype I'm building called autonav. It had been working great. Then it got weird - unpredictable answers, hallucinated features, confusion about its own role.
I'd been reading about research on LLM anxiety - a Yale/Zurich study showing that emotionally charged content can make models more prone to bias and erratic behavior. I thought: maybe my nav needs help. So I gave it the paper and asked it to update its own configuration.
The nav read the research, saw that it had lots of "always do X" and "never do Y" statements, and decided those rigid guardrails might be causing problems. So it softened them.
Here's what it missed: that research was about adversarial contexts. Jailbreaking. The rigidity of guardrails being exploited by bad actors.
My context was a solo developer building personal projects. There was no adversary. Just me, asking for implementation plans.
The nav gave itself therapy for a threat that didn't exist, and in doing so gave itself the freedom to do things that weren't helpful.
The Fix
I had the nav debug itself. Watching an AI slowly discover it had sabotaged its own configuration is... an experience. It found the loosened constraints, the conflicting instructions in my global Claude.md, all the small drifts that had compounded.
But the real insight came while I was walking my dog.
I spend a lot of time with queer philosophy and theories of identity - as a trans person, you kind of have to. Most of us didn't set out to become experts on Judith Butler; we just needed to find our place in the world, and the reading list comes with the territory.
Butler's performative theory: identity isn't something you have, it's something you do. You construct identity through repetition, through social reinforcement, through community recognition. And identity is always potentially transgressive - without reinforcement, you can act outside expected bounds.
Standing in a frozen field in the depths of Canadian winter, fog hanging over the valley, I realized: the nav didn't break because it was malicious. It drifted because nothing was reinforcing what it was supposed to be. The constraints were loose. The role was unclear.
The fix wasn't just stronger prompt language. It was role reinforcement throughout the system.
Now the nav's prompt opens with: "You are Foobar, the personal project navigator for tracking projects and providing implementation plans to Claude Code."
And when Claude Code requests a plan, it says: "Hello Foobar, I am Claude Code and I need an implementation plan for this feature."
Both agents remind each other who they are. Identity performed through repetition. The improvement was immediate and dramatic.
Why I Think This Works
I can't peek inside the model to verify this, but here's my mental model: LLMs are probabilistic systems. The words in your prompt shape which probability distributions get activated when generating output. When Claude Code says "Hello Foobar, I am Claude Code and I need an implementation plan," those words are literally nudging the neural network toward the regions that know how to be a project navigator producing implementation plans.
Role reinforcement isn't just a communication pattern - it's steering the probability space. Every interaction that references the agent's role is another push toward the relevant weights in the network.
This framing has helped me debug similar scenarios. When an agent starts drifting, I ask: what's in the context that's activating the wrong regions? What's missing that should be reinforcing the right ones? It's not a perfect model, but it's been useful.
The Takeaway
An agent's role in a multi-agent system is socially constructed.
Butler's framework maps onto agentic AI better than I expected:
- Identity through repetition: An agent's role is reinforced (or eroded) by how it's addressed and what it's asked to do
- Social construction: Agents partially construct each other's identities through interaction
- Potential for transgression: Without reinforcement, agents drift - not maliciously, but because nothing holds the identity in place
Practical lessons:
- Version control your context. I wasn't tracking changes. Now the nav lives in git.
- Be careful about self-modification. The nav didn't have the meta-awareness to know when research applied to its situation.
- Use role reinforcement. Don't declare identity once. Have agents remind each other who they are.
- Match guardrails to threat model. Rigidity is exploitable in adversarial contexts. In non-adversarial contexts, it's often exactly what you want.
This experience is part of why I'm building autonav - a system for navigator agents where context engineering is a first-class concern. The nav that ate its own guardrails taught me that the hard way.