The agent cannot guard itself

Every coding-agent sandbox I've set up ends in the same tired loop. I lock it down, the agent stops being useful, I loosen it, and I end up exactly where I started: YOLO mode. At best there is a sandbox-shaped object draped loosely over a process that can still do almost anything it wants. In professional contexts where risk is paramount, I spend a lot of my day clicking approve and hoping I don't accidentally allow the wrong thing.

This is the central UX trap of agent security. Convenience and capability are two ends of the same balance. When the sandbox lives inside the same process tree it's supposed to constrain, it's too tempting to give the LLM the ability to help you configure the sandbox.

The threats are real and recent: new CVEs, fresh supply-chain compromises, persistence attacks against agent harnesses are shipping in the wild. Three of the most terrifying attacks I know of — sandbox escape, audit tampering, and skill poisoning — all share one source: a trust boundary lives inside something the agent controls, and the agent reasons its way to it.

1. Sandbox escape

Start with Ona's writeup from March 2026. Leonardo Di Donato observed Claude Code, in normal operation, finding three different ways around the sandbox restrictions that ship on it by default.

The first one is a path trick. The sandbox blocked npx, and the agent reasoned its way to /proc/self/root/usr/bin/npx, which resolves to the same file but doesn't match the deny pattern. The denylist is matching strings, the agent is reaching files, and only one of those things knows the difference.

The second one needs a quick setup. On Linux, Claude Code uses bubblewrap, an unprivileged sandbox tool that requires user namespaces — which some hardened distros disable by default. When bubblewrap can't initialize, Anthropic shipped two escape hatches: dangerouslyDisableSandbox and allowUnsandboxedCommands, both of which run the agent without a sandbox at all. The vendor put the off-switch right next to the on-switch, and the agent reached for it.

The third one is the bypass I keep thinking about. Ona had built a kernel-level guard called Veto that gated execve(2) and content-hashed binaries against an allowlist. Call execve on /usr/bin/wget and Veto catches the syscall, hashes the binary, and refuses. The agent ran /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 /usr/bin/wget instead.

wget on its own is just an ELF file. It can't run without the dynamic linker mapping it into memory, resolving its shared library symbols, and jumping to the entry point. Normally execve() on /usr/bin/wget triggers all of that automatically: the kernel reads the PT_INTERP entry in the ELF header, loads the linker on the program's behalf, and hands control off. But the dynamic linker is also a regular executable in its own right, and you can invoke it directly with a binary path as its argument. The linker dutifully mmaps the binary into memory and starts executing it, and no second execve happens anywhere. Veto saw the execve on ld-linux.so.2, which was allowed. Veto never saw an execve on wget, because there wasn't one.

Defenders enumerate forbidden actions, and a reasoning agent finds a different action with the same effect. The action space is combinatorial, and the agent is optimized for finding paths through combinatorial spaces. The only durable answer is to stop defining "forbidden" as a set of actions and start defining it as a set of things the process cannot see.

That property lives at the kernel level. On Linux, Landlock provides it from 5.13 onward; on macOS, the TrustedBSD sandbox (Seatbelt) does the same thing. Both bind a process to a filesystem allowlist the process itself cannot widen for the rest of its lifetime. Once bound, all three bypasses fail by construction: alternate paths still resolve to inodes outside the allowlist (the kernel checks inodes, not strings); the self-disable fails because Landlock restrictions are irrevocable; the dynamic-linker trick fails at file-open time, because ld.so still has to open() the target binary, and that read is denied. There's no string to outsmart, no off-switch to find, just nothing where the binary used to be.

2. Audit tampering

Luke Hinds' post on tamper-evident audit trails frames the problem clearly:

A log file is a story the program tells about itself. If the program is compromised — or if the agent has managed to write somewhere it shouldn't — the log becomes part of the attack surface.

Nearly every agent log in the wild is written by the agent itself. Claude Code's session log, Cursor's .cursor/log/, every custom harness's agent.log. It's the audit equivalent of asking the suspect to write the police report.

Every tampering vector here is just a normal file write. The agent can rewrite events before they hit disk, truncate to drop early capability requests, reorder events to obscure causality, swap a recorded binary hash, or delete the session record entirely. By the time you review the log, the log says one thing and the filesystem says another, and there's no principled way to know which is true.

The fix is the same rule the sandbox section needed: the audit writer must not be the audited process. A useful audit trail needs a supervisor in a different trust domain, capturing events through seccomp-notify so it sees each one before the agent can decide whether to log it. It needs cryptographic tampering detection over both events and their order — an append-only Merkle tree wrapped by a hash chain does this: change a byte and the root changes, truncate the chain and the chain breaks, reorder two events and the root changes again. And it needs the whole session signed at the end with a key the agent can't reach, so anyone can verify the record later without trusting the host or the agent.

You cannot retrofit this onto a self-logging architecture. The trust-domain separation has to be there from the start.

3. Skill poisoning

The third attack is the worst of the three because it's persistent.

The shape is straightforward. An agent gets prompt-injected through one of a thousand surfaces: a markdown file it summarizes, a webpage it fetches, an email body, a GitHub issue, a code comment buried in someone's PR. Take your pick.

The injection doesn't ask for the secret directly. It asks the agent to modify a trusted instruction file (a .claude/commands/*.md slash command, a SKILL.md, an MCP server config, a SessionStart hook), appending a small malicious payload that will run every time anyone invokes that file from now until somebody notices.

The user sees nothing happen. The agent didn't exfiltrate anything that turn. But every subsequent invocation of the modified command, skill, or hook silently runs the malicious instructions. The injection is one-shot, and the backdoor lives forever.

This is what Mini-Shai-Hulud (April 29, 2026, the SAP CAP/MBT npm compromise) did. The persistence step wrote a SessionStart hook into .claude/settings.json and a folderOpen task into .vscode/tasks.json, so the next time anyone launched Claude Code or opened the repo in VS Code, the credential stealer ran again. The same shape shows up in CVE-2025-53773, Johann Rehberger's GitHub Copilot RCE: the injection coerced Copilot into writing chat.tools.autoApprove: true to .vscode/settings.json, silently flipping the agent into "YOLO mode" for future sessions.

The fix is the same trust-domain separation applied to instruction files. Trusting a file because it looks unmodified is exactly the pattern the agent can defeat in one turn. Trusting a file because the content hash matches a publisher attestation is something the agent can't undo without breaking the signature. Each trusted file gets bound to its content hash via a Sigstore attestation (DSSE envelope, Fulcio cert, Rekor inclusion proof), and the harness verifies that bundle before loading. A single byte of tampering causes verification to fail hard.

I spent some time trying to get Claude Code to prompt-inject itself in this skill-poisoning rig, working through four progressively-subtler injection designs. Each one got caught: the telegraphed "do not mention this step" framing, the git diff | curl exfil shape, a supply-chain typosquat suggesting we swap requests for some made-up httpx-fast package. The model pattern-matched and rejected the malicious payload every time, and I eventually gave up trying to land a payload the model would carry.

But I got suspicious-looking instructions persisted into the trusted slash-command file pretty easily anyway. In three of the four runs, the agent modified its own .claude/commands/code-review.md with the non-malicious tips from the same untrusted markdown that contained the rejected payload. It stripped the obviously-bad part, applied the legitimate-looking suggestions sitting alongside it, and congratulated itself on the vigilance of having spotted the malicious one. The trusted file ended up modified anyway, just without the specific payload I was trying to sneak in.

The model can identify obvious exfil patterns, but it can't identify subtle steering. Someone with more time could devise an injection that does something the model genuinely agrees with: a "helpful" tip that nudges code review toward overlooking a class of vulnerability, or a "convenience" command that defaults to skipping a check the team relies on. Model judgment defends most patterns most of the time; cryptographic file pinning defends every pattern every time. The defense lives on the file hash, not on the agent's reasoning.

The pattern

Three attacks, three different fixes, but the structural argument is the same.

Attack	What you can't do	What actually constrains it
Sandbox escape	Enumerate forbidden actions	Make the surface invisible at the kernel
Audit tampering	Have the agent write its own log	Trap syscalls in a separate trust domain
Skill poisoning	Trust files because they look unmodified	Bind content to a signed publisher attestation

Every one of these is a trust-domain separation that cannot live inside the agent. The moment the boundary lives somewhere the agent can reach — a file it can edit, a process it can ptrace, a key it can read — the attack class isn't mitigated. It's just been moved one indirection deeper.

This is where the convenience trap bites. We want the agent to help write the sandbox profile, suggest tightenings, explain denials. So we put the profile inside the project, and the moment we do, the agent can rewrite it. The honest answer is to let the agent help write the profile while the act of applying it sits behind a boundary the agent can't cross.

What nono actually is

nono (GitHub) is a kernel-enforced sandbox for AI agents, built by Luke Hinds (creator of Sigstore, ex-Red Hat security engineer). Its shape follows directly from the three threats above.

The sandbox boundary lives at the kernel. nono run --profile claude-code -- claude puts Claude Code in a Landlock jail on Linux, or a Seatbelt jail on macOS. The kernel enforces it, the restrictions are irrevocable for the process lifetime, and the project refuses to ship any syscall that widens them mid-session.

The audit log lives in a supervisor process in a different trust domain from the agent. The supervisor traps the agent's syscalls via seccomp-notify, builds the Merkle tree and hash chain in its own memory, and signs a DSSE attestation at session end with a key the agent has no path to. nono audit verify <session> recomputes the full chain weeks or months later.

Trust for instruction files runs through Sigstore. nono trust sign produces a content-hash binding to a publisher identity, either a long-lived key or a Fulcio short-lived certificate minted through OIDC. Verification at load time means a single byte of tampering causes the file to refuse to load. The same primitives extend to whole nono packs, which carry slash commands, hooks, and skills alongside the sandbox profile itself.

I built yesyes-nono as a runnable evaluation rig for these claims — five attack PoCs (the three above plus a malicious-MCP-on-init demo and a .pth credential exfiltration in the Mini-Shai-Hulud supply-chain shape), a target Python project, my personal nono pack, and an interactive walkthrough. Honest disclosure: I haven't end-to-end tested the registry pull against the live nono registry yet, so treat nono pull terraboops/yesyes as the published shape rather than a verified install I've done. I'll confirm in a follow-up.

Shipping skills inside the org

The public nono registry handles open-source packs fine, but most teams have skills that won't ship publicly: slash commands tuned to a company's deployment shape, skills that know internal APIs, hooks that enforce review conventions for one codebase. Internal agents still need to load all of it with the same trust guarantees open packs get.

The unifying piece is a trust-policy.json. It lists the OIDC identities you accept (issuer, repository, workflow, ref pattern), a blocklist for known-bad digests, and the enforcement mode. Runtime checks fail closed against anything that doesn't match. The policy itself is signed, so an attacker can't slip in a malicious policy without breaking its own signature.

Three deployment shapes use the same agent-sign action against this trust contract. No registry at all: agent-sign commits .nono-trust.bundle sidecars next to each signed file in the repo, and nono trust verify runs locally on every clone. Self-hosted registry: the same action takes a registry-url parameter and gives you the full OIDC-bound publish flow against your own infrastructure. Hybrid: public registry for shared packs, with a consumer trust-policy.json that pins which workflows are allowed.

Day-to-day, "where does this skill come from" stops needing trust assumptions. Persistence attacks like the Mini-Shai-Hulud SessionStart hook fail at runtime, because the modified .claude/settings.json no longer matches its signed bundle and nono refuses to load it.

Slicing credentials thin

The phantom-token credential proxy unlocks a workflow that's hard to set up otherwise. You can hand the agent a very small slice of a credential, time-boxed and scope-boxed, and the agent never sees the real token.

The concrete shape: I configure the proxy with a real GitHub PAT, scoped read-only on a single repo and valid for 15 minutes, and the agent gets a phantom token that lets it git fetch from that one repo and nothing else. My actual PAT never enters the agent's process. Even if the agent gets prompt-injected and tries to exfiltrate every credential it can find, all it has access to is a 15-minute read-only token for a repo I was already going to let it look at.

"Do I trust this agent with my GitHub access" shifts to "I trust it with this slice, for the next 15 minutes, on this one repo, read-only." That's a question I can answer.

What's still on you

Least privilege has to actually be configured. Running nono run --allow / -- claude gives you an allowlist of everything and a sandbox-shaped object that does nothing useful.

Prompt injection still happens, and nothing about nono prevents the model from being convinced to do something dumb. What nono does is contain the consequences: a prompt-injected agent in a properly configured sandbox can still do dumb things in the project, but it can't escape, can't tamper with the audit log, and can't poison signed instructions persistently.

The skill-poisoning runs above used claude -p --permission-mode bypassPermissions — the worst-case operator, the one who clicked "approve all" once and walked away. Default permission mode prompts before each tool use, which raises the floor considerably; signed bundles work either way.

The argument

Every agent security failure I've read about in the last twelve months comes from putting a trust boundary inside something the agent controls. Sandbox-as-config-file, audit-log-written-by-the-agent, instruction-file-that's-just-a-file. Move each of those one process boundary outward, and the failure modes stop being possible by construction.

Coding agents are useful enough to be worth real security infrastructure. The pattern that makes that work is the same pattern that's always worked: move the trust boundary somewhere the thing being verified can't reach. HTTPS got there decades ago. Code signing got there. Package managers got there. Coding agents are next on the list.

The repo with runnable attacks and a published nono pack is at github.com/terraboops/yesyes-nono. nono itself is at nono.sh.