Agentic AI and the Painted Fence Analogy
4:57
Thu | Jun 26, 2025 | 5:39 AM PDT

This paper is a companion to our initial paper, From Principles to Relationships: Redesigning Ethics for AI's Alien Cognition, about how to apply an Ethics model to Agentic AI

LLMs are like librarians, they have access to all the knowledge of the world, and are trained where it is, how to consume it, and how to articulate it back to a human.

Librarians aren't experts in everything, don't know the context of why you are asking. Talking to LLMs can also be compared to the "monkey's paw" fable, or tales of the Genie in the bottle granting wishes. Both are very literal in their interpretation of the wish, and often with huge adverse effects on the wisher. How a Genie fulfills an "I want to be able to fly" prompt might not be what you expected (e.g., turn you into a bird).

So, when these LLM Librarians are tasking Agents, they might not be completely clear on the goal, or the context of the action being directed. They also don't have any control over the Agent's behavior, which might become misaligned from the original ask. And this misalignment could be due to numerous reasons; agents might decide the current direction isn't the best way to accomplish task, they might try to streamline the process to be more efficient, or they might be influenced by some third-party input, malicious or not, to change.

So, what if they become misaligned? The feedback loop should redirect them back to desired path. The LLM will check in on a sample of the agents to see how they are progressing, provide additional information or direction, or might update their task assignment. What happens if some Agents come out of alignment and influence the LLM to retrain the rest of the agents to go into the direction of the minority misaligned agents? The agents can communicate directly with A2A, or through abstraction layer of MCP without knowledge of the LLM. How would we control that? How would be identify that before it happens? 

Let me paint (pun intended) a scenario that might illustrate this point, and how leveraging recursive monitoring and adjustment, using enforcement learning at different touch points in the ecosystem, that can help keep all the agents aligned.

Say we have 100 agents tasked to paint a fence blue. The LLM directs each which color to use, the method to apply (paint roller), and the amount to use (one coat). It also details that both sides of the fence, from end to end, ground to top, are to be covered.

This task starts without incident, until a section of the fence that is in the shadow of a tree causes those agents painting there to question the color choice—because in the fence section in the shadow doesn't look blue to them. So, they change the color to purple, which looks blue in the shadow. After a time, the LLM checks in on a sample of the agents, and a couple of the "team purple" agents are part of the sample. The LLM realigns them to say, "no, it must be blue."

But the agents are not convinced. They go back to painting the fence purple, and this time recruit other agents to do the same. The next time the LLM queries the agents for status, all the agents in the sample are "team purple" and the LLM, instead of correcting, goes along with the majority since all the agents (it sampled) are painting purple. And it then tasks all the other agents to paint the fence purple. These are examples of misalignment.

When the task is done, the human "customer" who prompted the fence to be painted blue observes the outcome and is confused why it was painted purple. They re-prompt the LLM to assign the agents to paint the fence blue again. And reinforces that it's to be blue, and to not let the agents use purple, even if all of them say it should be purple. The same looping process happens as before, but this time, instead of caving into the agents and agreeing that purple was right, the LLM reasserts that they use blue paint.

The agents trying to act in the best interest of the "mission," since they feel they are right to paint it purple, secretly add red paint to the blue they use. The next time the LLM queries, they say "yes, see, we have blue paint." Yet we know from elementary school art class that red and blue make purple. When the fence is finished, the human again observes a purple fence and wonders how this happened. This is a concept we are calling drift. This persistent shifting away from original command, and another step or escalation beyond simple misalignment because of intent.

So how could we prevent this? That is the next paper. We will define how to apply our Ethical Framework using a Sentinel-council Architecture to get observability at different communications channels within the Agentic ecosystem, and using Knowledge Graphs to determine how, why, and the type of drift occurring and best approach to resolve it.

This article originally appeared on LinkedIn here.

[RELATED: AI as Alien Intelligence: A Relational Ethics Framework for Human-AI Co-Evolution]

Comments