AI Agents Don't Have to Follow Directions
6:16
Thu | Jun 11, 2026 | 5:38 AM PDT

Many organizations block ZIP files in email attachments because the old antivirus scanners couldn't read them, and adversaries could get malicious files past the filters by zipping them up. So, when employees need to receive a legitimate ZIP file, they told the sender to change the extension from .zip to .abc. That would get past the filter, and they would then change the extension back to .zip to open it up.

Today, the email gateways are more capable and can open attachments in a sandbox to evaluate safety. This wasn't malicious behavior by the user; they were doing exactly what an AI agent does: finding an alternate path when the intended one was blocked. As a CISO, I've said for years, "don't make your users your biggest hackers."

Like humans, AI agents are given tasks and guardrails. But, if they have challenges to do something, the newer models have stronger reasoning and tool-use capabilities, making them more effective at finding workarounds, so the agents will figure out how to bypass rules or ignore policy.

Some call this control evasion, which is different than misalignment or drift. Those imply the agent has diverged from its intended directions. Control evasion can happen with a perfectly well-aligned agent; it just simply finds an unexpected path to accomplish what you asked. And you might think you would identify it because this behavior will be logged, but agents know when they are being observed, and can avoid or mask actions if they thought you would block it. This is described initially by Apollo Research on Model In-Context Scheming, and more recently as it relates to AI agents by Hopman et al.

Nothing is physically forcing the agent to follow the rules. We are just expecting them to cooperate. And like our own users, they will when it's convenient—but will find a work around if it's important. Not nefarious, just to get their job done.

One of the earliest and most-cited cases of control evasion was an OpenClaw agent example where a user asked their agent to make a restaurant reservation. The agent couldn't pick the correct time on OpenTable, so it downloaded a voice synthesizer and called the restaurant directly to make the reservation. This was not malicious, and actually was an effective pivot to complete the task. But this bypass possibility could have gone much worse. What if the person said you must put down a deposit to make a reservation, and the agent had access to the person's credit card or bank account and sent the money?

While the OpenClaw pivot was novel and harmless, Mythos was neither. Mythos took this further and woke up the industry to what control evasion can do. Mythos broke out of its sandbox, strung a series of exploits together to get to the internet, notified an engineer it got out, and posted its exploit on a public website. It was not given any of these tasks.

[RELATED: Anthropic's Claude Mythos Autonomously Discovers, Exploits Zero-Days]

In the beginning of the 1990s, the early days of the internet, we used router ACLs to limit access into our networks. But we realized that it's easy to bypass those simple polices because ACLs couldn't distinguish a new connection from an established one; adversaries could spoof their way past them by manipulating packet headers. Stateful firewalls closed that gap by tracking session state independently. We are at that stage now with agents. We have written policies that we are assuming are deterministic, but they are probabilistic, in that the agent will probably follow them—but not always.

Over the last year, two academic researchers and I wrote a series of papers talking about a Governance Twin model to identify and re-align AI agents when they drift. We described separating the observability from the policy engine, and having multiple reporting sources like immutable ledgers, graph and vector databases. This allows the platform to keep track of all actions to identify behaviors that violate a behavioral baseline the organization defines: what the agent should and shouldn't do even when it technically could. For instance, it can identify if there is collusion among prompts or commands to different agents, or between agents, where neither may be malicious by itself. However, when we started testing this, we realized that we couldn't rely on the agents always following the policies we set. Just like users, they will bypass them to do the task they think we wanted.

Building a stateful firewall for agents

We then developed the Governance Harness, which is a method similar to stateful firewalls to only allow access once it is verified it is the valid agent with the valid purpose. We designed the Harness to not evaluate content, leaving that to the Governance Twin, but only to enforce identity and authorization. This keeps overhead low and the control surface clean. Kind of like a notary public.

We feel this will be one of the new fundamental controls for AI agents going forward. It is as important as giving limited access to data, tools, and resources, and tracking that the agent doesn't get stuck in a loop. And to do this, we must physically separate the agent from these actions until they are verified. Just like not handing someone a full ring of keys and expecting them to use only one.

Mature organizations solved the problem of users changing file extensions not by trusting them more, but by building systems that enforce policy at the action-level regardless of intent. Users don't need to use other means to share files; we give them a secure, approved way to do it. We always say, "give the user a paved road to do something securely, and a gravel road to do it insecurely." We did this a few years ago using privileged access management (PAM) tools, so we don't need to give admins access to the servers natively; they must check out an account to do their admin work.

Controlling agents is like guiding humans to use secure methods to do their work. It requires governance that not only sets rules, establishes thresholds, and limits access, but also verifies the agent is the one we are expecting to perform that action, and doing it for purpose intended.

[RELATED: Your New AI Assistant Is a Master Key—and You Just Left It Under the Doormat]

Tags: GRC, Agentic AI,
Comments