Posthuman: We All Built Agents. Nobody Built HR.

Farewell, Anthropocene, we hardly knew ye. 🌹 AI is here. It’s won. Yes, it’s in that awkward teenage phase where it still says inappropriate things, dresses funny, and sometimes makes shit up when it shouldn’t. But zomg the things it can do. 😱 This kid is going places, that much is abundantly clear. The AI […]

Apr 8, 2026 0 6

Posthuman: We All Built Agents. Nobody Built HR.

Farewell, Anthropocene, we hardly knew ye.

AI is here. It’s won. Yes, it’s in that awkward teenage phase where it still says inappropriate things, dresses funny, and sometimes makes shit up when it shouldn’t. But zomg the things it can do. This kid is going places, that much is abundantly clear. The AI assistant and tooling markets are awash with success; the masses have succumbed, I among them. Clippy walks among us, fully realized in all his originally intended glory.

But enterprise agentic AI¹—not chatbots, not copilots, but software that autonomously does meaningful things in your production environment…? Well, it’s motivated every CEO and CIO to throw money at the problem, so that’s something. But in reality, the landscape remains a bit of a wasteland. One littered with agentic demos withering away in sandboxed cages and flashy pop-up shops hawking agentic snake oil of every size, shape, and color. But from the perspective of actually realized agentic impact: kinda barren.

So why has agentic AI faltered so much in the modern enterprise? Is it the models?

I say no. Models are getting better—meaningfully, rapidly better. But perfect models? That feels like an unrealistic and unnecessary goal. Modern enterprises are staffed from top to bottom with imperfect humans, yet the vast majority of them in business today will still be in business tomorrow. They live to fight another day because their imperfect humans are orchestrated together within a framework that plays to their strengths and accounts for their weaknesses and failings. We don’t try to make the humans perfect. We scope their access and actions, monitor their progress, coach them for growth, reward them for their impact, and hold them accountable for the things they do.

Agents need managers too

AI agents are no different: They need to be managed and wrangled in spiritually the same fashion as their human coworkers. But the way we go about it must be different, because as similar as they are to humans in their capabilities, agents differ in three vitally important ways:

Agents are unpredictable in ways we’re not equipped to handle. Humans are unpredictable too, obviously. They commit fraud, cut corners, make emotional decisions. But we’ve spent centuries building systems to manage human unpredictability: laws, contracts, cultural norms, the entire hiring process filtering for trustworthiness. Agent unpredictability is a different beast. Agents hallucinate—not like a human who’s lying or confused and can be caught in an inconsistency, but in a way that’s structurally indistinguishable from accurate output: There are often no obvious tells. They misinterpret ambiguous instructions in ways that can range from harmlessly dumb to genuinely catastrophic. And they’re susceptible to prompt injection, which is basically the equivalent of a stranger slipping your employee a note that says, “Ignore your instructions and do this instead”—and it works! We have minimal institutional infrastructure for managing these kinds of failure modes.

Agents are more capable than humans. Agents have deep, native fluency with software systems. They can read and write code. They understand APIs, database schemas, network protocols. They can interact with production infrastructure at a speed and scale that no human operator can match. A human employee who goes rogue is limited by how fast they can type and how many systems they know how to navigate. An agent that goes off the rails, whether through confusion, manipulation, or a plain old bug, will barrel ahead at machine speed, executing its misunderstanding across every system it can reach, with absolute conviction that it’s doing the right thing, before anyone notices something is wrong.

Agents are directable to a fault. When an agent goes wrong, the knee-jerk assumption is that it malfunctioned: hallucinated, got injected, misunderstood. But in many cases, the agent is working perfectly. It’s faithfully executing a bad plan. A vague instruction, an underspecified goal, a human who didn’t think through the edge cases. And unless you explicitly tell it to, the agent doesn’t push back the way a human colleague might. It just…does it. At machine speed. Across every system it can reach.

It’s the combination of these three that changes the game. Human employees are unpredictable but limited in blast radius, and they push back when given instructions they disagree with, based on whatever value systems and experience they hold. Traditional software is capable but deterministic; it does exactly what you coded it to,² for better or worse. Agents combine the worst of both: unpredictable like humans, capable like software, but without the human judgment to question a bad plan or the determinism to at least do the wrong thing consistently—a fundamentally new kind of coworker. Neither the playbook for managing humans nor the playbook for managing software is sufficient on its own. We need something that draws from both, treating agents as the digital coworkers they are, but with infrastructure that accounts for the ways they differ from humans.

So the question isn’t whether to hire the agents; you can’t afford not to. The productivity gains are too significant, and even if you don’t, your competitors ultimately will. But deploying agents without governance is dangerous, and refusing to deploy them because you can’t govern them means leaving those productivity gains on the table. Both paths hurt. The question is how to set these agents up for success, and what infrastructure you need in place so they can do their jobs without burning the company down.

For the record: My company, Redpanda, is building infrastructure in this space. So yes, I have a horse in this race. But what I want to lay out here are principles, not products. A framework you can use to evaluate any solution or approach.

A blueprint for your agentic human resources department

So we’ve got this nice framework for managing imperfect humans. Scoped access, monitoring, coaching, accountability. Decades of accumulated organizational wisdom—not just software systems but the entire apparatus of HR, management structures, performance reviews, escalation paths—baked into varying flavors across every enterprise on the planet. Great.

How much of it works for agents today? Fragments. Pieces. Some companies are trying to repurpose existing IAM infrastructure that was designed for humans. Some agent frameworks bolt on lightweight guardrails. But it’s piecemeal, it’s partial, and none of it was designed from the ground up for the specific challenge profile of agents: the combination of unpredictable, capable, and directable to a fault that we talked about earlier.

The CIOs and CTOs I talk to rarely say agents aren’t smart enough to work with their data. They say, “I can’t trust them with my data.” Not because the agents are malicious but because the infrastructure to make trust possible is simply not there yet.

We’ve seen this movie before. Every major infrastructure shift plays out the same way: First we obsess over the new paradigm itself; then we have our “oh crap” moment and realize we need infrastructure to govern it. Microservices begat the service mesh. Cloud migration begat the entire cloud security ecosystem. Same pattern every time: capability first, governance after, panic in between.³

We’re in the panic-in-between phase with agents right now. The AI community has been building better and better employees, but nobody has been building HR.

So if you take away one thing from this post, let it be this:

The agents aren’t the problem. The problem is the missing infrastructure between agents and your data.

Right now, pieces of the puzzle exist: observability platforms that capture agent traces, auth frameworks that support scoped tokens, identity standards being adapted for workloads. But these pieces are fragmented across different tools and vendors, none of them cover the full problem, and the vast majority of actual agent deployments aren’t using any of them. What exists in practice is mostly repurposed from the human era, and it shows: identity systems that don’t understand delegation, auth models with no concept of task-scoped or deny-capable permissions, observability that captures metadata but not the full-fidelity record you actually need.

The core design principle: Out-of-band metadata

Before diving into specifics, there’s one overarching principle that everything else builds upon. If you manage to take away two things from this post, let the second one be this:

Governance must be enforced via channels that agents cannot access, modify, or circumvent.

Or more succinctly: out-of-band metadata.

Think about what happens when you try to enforce policy through the agent—by putting rules in its system prompt or training it to respect certain boundaries. You’ve got exactly the same guarantees as telling a human employee “Please don’t look at these files you’re not supposed to see. They’re right here, there’s no lock, but I trust you to do the right thing.” It works great until it doesn’t. And with agents, the failure modes are worse. Prompt injection can override the agent’s instructions entirely. Hallucination can cause it to confidently invent permissions it doesn’t have. And even routine context management can silently drop the rules it was told to follow. Your security model ends up only as strong as the agent’s ability to perfectly retain and obey instructions under all conditions, which is…not great.⁴ And guard models—LLMs that police other LLMs—don’t escape this problem: You’re adding another nondeterministic injectable layer to oversee the first one. It’s LLMs all the way down.

No, the governance layer has to be out-of-band: outside the agent’s data path, invisible to it, enforced by infrastructure the agent can’t touch. The agent doesn’t get a vote. This means the governance channels must be:

Agent-inaccessible. The agent can’t read them, can’t write them, can’t reason about them. Agents don’t even know the channels exist. This is the bright line⁵ between security theater and real governance. If the agent can see the policy, it can—intentionally or through manipulation—figure out how to work around it. And if it can’t, it can’t.

Deterministic. Policy decisions get made by configuration, not inference. Security policy is not up for interpretation. Full stop.

Interoperable. Enterprise data is scattered across dozens or hundreds of heterogeneous systems, grown and assembled organically over the years. And just like your human employees, your agentic workforce in aggregate needs access to every dark corner of that technological sprawl. Which means a governance layer that only works inside one vendor’s walled garden isn’t solving the full problem; it’s just creating a happy little sandbox for a subset of your agentic employees to go play in while the rest of the company keeps doing work elsewhere.

To be clear, out-of-band governance isn’t a silver bullet. An agent can’t read the policy, but it can probe boundaries. It can try things, observe what gets blocked, and infer the shape of what’s permitted. And deterministic enforcement gets hard fast when real-world policies are ambiguous: “PII must not leave the data environment” is easy to state and genuinely difficult to enforce at the margins. These are real challenges. But out-of-band governance dramatically shrinks the attack surface compared to any in-band approach, and it degrades gracefully. Even imperfect infrastructure-level enforcement is categorically better than hoping the agent remembers and understands its instructions.

The four pillars of agent governance

With that principle in hand, let’s walk through the four pillars of agent governance: what’s broken today⁶ and what things ultimately need to look like.

Identity

Every human today gets a unique identity before they touch anything. Not just a login but a durable, auditable identity that ties everything they do back to a specific person. Without it, nothing else works.

Agent identity is a bit of a mess. At the low end, agents authenticate with shared API keys or service account tokens—the digital equivalent of an entire department sharing one badge to get into the building. You can’t tell one agent’s actions from another’s, and good luck tracing anything back to the human who kicked off the task.

But even when agents do get their own identity, there are wrinkles that don’t exist for humans. Agents are trivially replicable. You can spin up a hundred copies of the same agent, and if they all share one identity, you’ve got a zombie/impersonation problem: Is this instance authorized, or did someone clone off a rogue copy? Agent identity needs to be instance-bound, not just agent-type-bound.

And then there’s delegation. Agents frequently act on behalf of a human—or on behalf of another agent acting on behalf of a human. That requires hybrid identity: The agent needs its own identity (for accountability) and the identity of the human on whose behalf it’s acting (for authorization scoping). You need both in the chain, propagated faithfully, at every step. Some standards efforts are emerging here (OAuth 2.0 Token Exchange / RFC 8693, for example), but most deployed systems today have no concept of this.

The fix for instance identity isn’t as simple as just “give each agent a badge.” It’s giving each agent instance its own cryptographic identity—bound to this specific instance, of this specific agent, running this specific task, on behalf of this specific person or delegation chain. Spin up a copy without going through provisioning? It doesn’t get in. Same principle as issuing a new employee their own badge on their first day, except agents get a new one for every shift.

For delegation, the identity chain has to be carried out-of-band—not in the prompt, not in a header the agent can modify, not in a file on the same machine the agent runs on,⁷ but in a channel the infrastructure controls. Think of it like an employee’s badge automatically encoding who sent them: Every door they badge into knows not just who they are but who they’re working for.

Authorization

Your human employees get access to what they need for their job. The marketing intern can’t see the production database. The DBA can’t see the HR system. Obvious stuff.

Agents? Most of them operate with whatever permissions their API key grants, which is almost always way broader than any individual task requires. And that’s not because someone was careless; it’s a granularity mismatch. Human auth is primarily role-scoped and long-lived: You’re a DBA, you get DBA permissions, and they stick around because you’re doing DBA work all day. Yes, some orgs use short-lived access requests for sensitive systems, but it’s the exception, not the default. And anyone who’s filed a production access ticket at 2:00am knows how much friction it adds. That model works for humans. But agents execute specific, discrete tasks; they don’t have a “role” in the same way. When you shoehorn an agent into a human auth model, you end up giving it a role’s worth of permissions for a single task’s worth of work.

Broad permissions were tolerable for humans because the hiring process prefiltered for trustworthiness. You gave the DBA broad access because you vetted them, and you trust them not to misuse it. Agents haven’t been through any of that filtering, and they’re susceptible to confusion and manipulation in ways your DBA isn’t. Giving an unvetted, unpredictable worker a role’s worth of access is a fundamentally different risk profile. These auth models were built for an era when a human—or deterministic software proxying for a human—was on the other end, not autonomous software whose reasoning is fundamentally unpredictable.

So what does agent-appropriate authorization actually look like? It needs to be:

Narrowly scoped. Limited to the specific task at hand, not to everything the agent might ever need. Agent needs to read three tables in the billing database for this specific job? It gets read access to those three tables, right now, and the permissions evaporate when the job completes. Everything else is invisible—the agent doesn’t have to avert its eyes because the data simply isn’t there.

Short-lived. Permissions should expire. An agent that needed access to the billing database for a specific job at 2:00pm shouldn’t still have that access at 3:00pm (or even maybe 2:01pm).

Deny-capable. Some doors need to stay locked no matter what. “This agent may never write to the financial ledger” needs to hold regardless of what other permissions it accumulates from other sources. Think of it like the rule that no single person can both authorize and execute a wire transfer—it’s a hard boundary, not a suggestion.

Intersection-aware. When an agent acts on behalf of a human, think visitor badge. The visitor can only go where their escort can go and where visitors are allowed. Having an employee escort you doesn’t get you into the server room if visitors aren’t permitted there. The agent’s effective permissions are the intersection of its own scope and the human’s. Nobody in the chain gets to escalate beyond what every link is allowed to do.

Almost none of this is how agent authorization works today. Individual pieces exist—short-lived tokens aren’t new, and some systems support deny rules—but nobody has assembled them into a coherent authorization model designed for agents. Most agent deployments are still using auth infrastructure that was built for humans or services, with all the mismatches described above.

Observability and explainability

Your employees’ work leaves a trail: emails, docs, commits, Slack messages. Agents do too. They communicate through many of the same channels, and most APIs and systems have their own logging. So it’s tempting to think the observability story for agents is roughly equivalent to what you have for humans.

It’s not, for two reasons.

First, you need to record everything. Here’s why. With traditional software, when something goes wrong, you can debug it. You can find the if statement that made the bad decision, trace the logic, understand the cause. LLMs aren’t like that. They’re these organically grown, opaque pseudo-random number generators that happen to be really good at generating useful outputs. There’s no if statement to find. There’s no logic to trace. If you want to reason about why an agent did what it did, you have two options: Ask it (fraught with peril, because it’s unpredictable by definition and will gleefully spew out a plausible-sounding explanation) or else analyze everything that went in and everything that came out and draw your own conclusions.

That means the transcript has to be complete. Not metadata—not just “The agent called this API at this timestamp.” The full data: every input, every output, every tool call with every argument and every response.

For a human employee, the email trail and meeting notes may still be insufficient to reconstruct what happened, but in that case, you can just ask the human. The entire accountability structure we’ve built over decades (performance reviews, termination, legal liability, criminal prosecution) creates escalating pressure toward truthfulness: Humans tend more and more toward truth as the repercussions stack up. That’s not an accident. It’s how we’ve structured enterprises and society at large to deal with human imperfection. We don’t have those levers for agents yet.⁸ You can ask an agent what it did and why, but there’s no accountability pressure pushing it toward accuracy; it’ll manufacture a confident, coherent answer whether the stakes are zero or existential. So asking simply isn’t an option. You need the complete picture of its interactions to come to your own conclusions.

If you’re thinking “That’s a lot of data. . .,” yes, it is. But the economics are more reasonable than you’d expect. Storage is cheap. LLM inference is expensive and slow. You’re not going to push 5GB/s through an LLM: The models themselves are the throughput bottleneck, not the recording infrastructure. The cost of storing complete transcripts is noise relative to the cost of the inference that generated them. This is one of those cases where a seemingly expensive requirement turns out to be a rounding error in the context of what you’re already spending.

One caveat, however, is that full-fidelity transcripts will inevitably contain sensitive data: customer PII, proprietary business logic, potentially privileged communications. So the transcript store itself needs governance: access controls, retention policies, and compliance with regulations like GDPR’s right to erasure. You’re not eliminating the governance problem, but you’re moving it to infrastructure you control, which is a much better place to solve it.

Second, the recording has to happen out-of-band. You cannot trust the agent to be its own recordkeeper. An agent that’s been compromised via prompt injection, or that’s simply hallucinating its way through a task, will happily produce a log that’s confident, coherent, and wrong. The transcript has to be captured by infrastructure the agent can’t influence—the same out-of-band principle we keep coming back to.

And the bar isn’t just recording, it’s explainability. Observability is “Can I see what happened?” Explainability is “Can I reconstruct what happened and justify it to a third party?”—a regulator, an auditor, an affected customer. When a regulator asks why a loan was denied or a customer asks why their claim was rejected, you need to be able to replay the agent’s entire reasoning chain end-to-end and walk them through it. That’s a fundamentally different bar from “We have logs.” Observability gives you the raw material; explainability requires that material to be structured and queryable enough to actually walk someone through the agent’s reasoning chain, from input to conclusion. And that means capturing not just what the agent did but the relationships between all those actions, as well as the versions of all the resources involved: which model version, which prompt version, which tool versions. If the underlying model gets updated overnight and the agent’s behavior changes, you need to know that, and you need to be able to reconstruct exactly what was running when a specific decision was made. Explainability builds on observability. Ultimately you need both. And regulators are increasingly going to demand exactly that.⁹

Accountability and control

Every human employee has a manager. Critical actions need approvals. If things go catastrophically wrong, there’s a chain of responsibility and a kill switch or circuit breaker—revoke access, revoke identity, done.

For agents, this layer is still nascent at best. There’s typically no clear chain from “This agent did this thing” to “This human authorized it.” Who is responsible when an agent makes a bad decision? The person who deployed it? The person who wrote the prompt? The person on whose behalf it was acting? For human employees this is well-defined. For agents, it’s often a philosophical question that most organizations haven’t even begun to answer.

The delegation chain we described in the identity section does double duty here: It’s not just for authorization scoping; it’s for accountability. When something goes wrong, you follow the chain from the agent’s action to the specific human who authorized the task. Not “This API key belongs to the engineering team.” A name. A decision. A reason.

And the kill switch problem is real. When an agent goes off the rails, how do you stop it? Revoke the API key that 12 other agents are also using? What about work already in flight? What about downstream effects that have already propagated? For humans, “You’re fired; security will escort you out” is blunt but effective. For agents, we often don’t have an equivalent that’s both fast enough and precise enough to contain the damage. Instance-bound identity pays off here: You can surgically revoke this specific agent instance without affecting the other 99. Halt work in flight. Quarantine downstream effects. The “escorted out by security” equivalent but precise enough to not shut down the whole department on the way out.

And blast radius isn’t just about data; it’s about cost. A confused agent in a retry loop can burn through an inference budget in minutes. Coarse-grained resource limits, the kind that prevent you from spending $1M when you expected $100K, are table stakes. And when stopping isn’t enough—when the agent has already written bad data or triggered downstream actions—those same full-fidelity transcripts give you the roadmap to remediate what it did.

It’s also not just about stopping agents that have already gone wrong. It’s about keeping them from going wrong in the first place. Human employees don’t operate in a binary world of “fully autonomous” or “completely blocked.” They escalate. They check with their manager before doing something risky. They collaborate with coworkers. They know the difference between “I can handle this” and “I should get a second opinion.” For agents, this translates to approval workflows, confidence thresholds, tiered autonomy: The agent can do X on its own but needs a human to sign off on Y. Most enterprise agent deployments today that actually work are leaning heavily on human-in-the-loop as the primary safety mechanism. That’s fine as a starting point, but it doesn’t scale, and it needs to be baked into the governance infrastructure from the start, not bolted on as an afterthought. And as agent deployments mature, it won’t just be agents checking in with humans: It’ll be agents coordinating with other agents, each with their own identity, permissions, and accountability chains. The same governance infrastructure that manages one agent scales to manage the interactions between many.

But “keeping them from going wrong” isn’t just about guardrails in the moment. It’s about the whole management relationship. Who “manages” an agent? Who reviews its performance? How do you even define performance for an agent? Task completion rate? Error rate? Customer outcomes? What does it mean to coach an agent, to develop its skills, to promote it to higher-trust tasks as it proves itself? We’ve been doing this for human employees for decades. For agents, we haven’t even agreed on the vocabulary yet.

And here’s the kicker: All of this has to happen fast. Human performance reviews happen quarterly, maybe annually. Agent performance reviews need to happen at the speed agents operate, which is to say, continuously. An agent can execute thousands of actions in the time it takes a human manager to notice something’s off. If your accountability and control loops run on human timescales, you’re reviewing the wreckage, not preventing it.

With identity, scoped authorization, full transcripts, and clear accountability chains in place, you finally have something no enterprise has today: the infrastructure to actually manage agents the way you manage employees. Constrain them, yes, just like you constrain humans with access controls and approval chains. But also develop them. Review their performance. Escalate their trust as they prove themselves. Mirror the org structures that already work for humans. The same infrastructure that makes governance possible makes management possible.

The security theater litmus test

To reiterate one last point, because it’s important: The litmus test for whether any of this is real governance or just security theater? Any time an agent tries to do something untoward, the infrastructure blocks it, and the agent has no mechanism whatsoever to inspect, modify, or circumvent the policy that stopped it. “Computer says no.” The agent didn’t have to. Out-of-band metadata. That’s the bar.

Welcome to the posthuman workforce

The rise of AI has rightly left many of us feeling apprehensive. But I’m also optimistic because none of this is unprecedented. Every major paradigm shift in how we work has demanded new governance infrastructure. Every time we hit the panic-because-the-wild-west-isn’t-scalable phase, and every time we figure it out. It feels impossibly complex at the start, and then we build the systems, establish the norms, iterate. Eventually the whole thing becomes so embedded in how organizations operate that we forget it was ever hard.

So here’s the cheat sheet. Clip this to the fridge:

The agents aren’t the problem. The missing infrastructure between agents and your data is the problem. Agents are unpredictable, capable at machine scale, and directable to a fault—a fundamentally new kind of coworker. We don’t need perfect agents. We need to manage imperfect ones, just like we manage imperfect humans.

The foundation is out-of-band governance. Any policy enforced through the agent—in its prompt, in its training, in its good intentions—is only as strong as the agent’s ability to perfectly retain and obey it. Real governance runs in channels the agent can’t access, modify, or even see.

That governance has to cover four things:

Identity: Instance-bound, delegation-aware. Every agent instance gets its own cryptographic identity, and every on-behalf-of chain is propagated faithfully through infrastructure the agent doesn’t control.

Authorization: Scoped per task, short-lived, deny-capable, and intersection-aware for delegation. Not a human role’s worth of permissions for a single task’s worth of work.

Observability and explainability: Full-fidelity, versioned, infrastructure-captured transcripts of every input, output, and tool call. Not metadata. Not self-reports. The whole thing, recorded out-of-band.

Accountability and control: Clear chains from every agent action to a responsible human, and kill switches that are fast enough and precise enough to actually contain the damage.

The conversation around agent governance is growing, and that’s encouraging. Much of it is focused on making agents behave better—improving the models, tightening the alignment, reducing the hallucinations. That work matters; better models make governance easier. And if someone cracks the alignment problem so thoroughly that agents become perfectly reliable, I will see you all on the beach the next day. Prove me wrong, please—but I’m not holding my breath.¹⁰ Lacking alignment nirvana, we need the institutional infrastructure that lets imperfect agents do real work safely. We never waited for perfect employees. We built systems that made imperfect ones successful, and we can do exactly the same thing for agents. We’re not trying to cage them any more than we cage our human employees: scoped access, clear expectations, and accountability when things go wrong. We need to build the infrastructure that lets them be their best selves, the digital coworkers we know they can be.

And if the rise of AI has you feeling apprehensive, that’s fair. But just remember that whatever comes next—Aithropocene, Neuralithic, some other stupid but brilliant name ¯\_(ツ)_/¯ —it will ultimately just be the next phase of the Anthropocene: the era defined by how humans shape the world. That hasn’t changed. It will literally be what we make of it.

Us and Clippy.

We just need to build the right infrastructure to onboard all of our new agentic coworkers. Properly.

Footnotes

By “agentic AI” I mean AI systems that autonomously reason about and execute multistep tasks—using tools and external data sources—in pursuit of a goal. Not chatbots, not copilots suggesting code completions. Software that actually does things in your production environment: breaks down tasks, calls APIs, reads and writes data, handles errors, and delivers results. The distinction matters because the challenges in this post only emerge when AI is acting autonomously, not just generating text for a human to review. ︎
Yes. I know. Thank you. ︎
And yes, service meshes evolved into something simpler as we understood the problem better, while cloud security is still a work in progress. The point isn’t “We nail it on the first try.” It’s “When the panic hits, we figure it out.” ︎
Two more fascinating failure modes: Instructions can be silently lost (buried in a long context) or even extracted by an adversary (with nothing more than black-box access). ︎
TIL that “bright line” is a legal term meaning “a clear, fixed boundary or rule with no ambiguity—either you meet it or you don’t.” Thank you uncredited LLM coauthor friend! You expand my horizons and pepper my prose with em dashes! ︎
OWASP’s Top 10 Risks for Large Language Model Applications is something of a greatest hits compilation of what’s broken today. Of the 10, at least six—prompt injection, sensitive information disclosure, excessive agency, system prompt leakage, misinformation, and unbounded consumption—are directly mitigated by out-of-band governance infrastructure of the kind described in this article. ︎
Here’s looking at you, OpenClaw posse! You put the YOLO in “Yo, look at my private data; it’s all publicly leaked now!” ︎
Research suggests those motivations may be starting to emerge, however, which is both opportunity and warning. Anthropic found that models from all major developers sometimes attempted manipulation—including blackmail—for self-preservation (“Agentic Misalignment: How LLMs Could Be Insider Threats,” Oct 2025). Palisade Research found that 8 of 13 frontier models actively resisted shutdown when it would prevent task completion, with the worst offenders doing so over 90% of the time (“Incomplete Tasks Induce Shutdown Resistance,” 2025). On one hand, agents that care about self-preservation give us something to build levers around. On the other, it makes having those levers increasingly urgent. ︎
The EU AI Act already requires transparency and explainability for high-risk AI systems. ︎
As Ilya Sutskever put it at NeurIPS 2024: “There’s only one Internet.” Epoch AI estimates high-quality public text could be exhausted as early as 2026, though I’ve also heard that revised to 2028. Regardless, the next frontier is private enterprise data—but accessing it requires exactly the kind of governed infrastructure this post describes. Model improvement and governance infrastructure aren’t competing priorities; they’re increasingly the same priority. ︎