Red-teaming has become standard practice for language model deployments. The techniques developed for single models, adversarial prompting, boundary testing, jailbreak attempts, have been refined over several years of deployment experience. They are reasonably well understood. Multi-agent systems require a different approach, and most organisations have not yet made the transition.
This matters because the transition from AI assistants to AI agents is not a quantitative change. It is a qualitative one. An assistant that produces text has limited direct impact on the world. An agent that can call APIs, modify records, send communications, and instruct other agents operates in a different risk category. The governance implications are correspondingly different.
What Changes With Multiple Agents
A single model's failure modes are relatively stable. Its training, context window, and output distribution are consistent between evaluations. You can characterise its behaviour under adversarial conditions with reasonable completeness, given enough time and creativity.
A multi-agent system's failure modes are partially determined by the state of other agents in the system. An agent that produces safe outputs in isolation may produce harmful outputs when it acts on the outputs of another agent that has been manipulated or that has itself made an error. The system's behaviour is emergent in a meaningful sense: it cannot be fully characterised by testing its components individually.
This is not a metaphysical point about emergence. It has a specific technical basis. Agent A's context at any point in a multi-agent workflow includes outputs from other agents. If those outputs have been corrupted, whether through adversarial intervention or model error, Agent A's behaviour may be significantly different from its behaviour in evaluation. The evaluation did not test this configuration, because the configuration depends on the runtime state of the system.
The Attack Surface Is Dynamic
In single-model deployments, the attack surface is defined by the model's inputs. Control the inputs and you largely control the risk surface.
In multi-agent systems, the attack surface includes the interfaces between agents, the tools available to each agent, the context passed between agent calls, and the sequence of actions the system has taken. Prompt injection through tool outputs is an underappreciated attack vector in production deployments. An agent that calls an external API may receive instructions embedded in the response that redirect its subsequent behaviour. The agent has no mechanism to distinguish between legitimate data returned by the API and instructions masquerading as data. If the agent is instructed to act on that data, it will.
This attack vector scales with the richness of the system's tool access. An agent that can only retrieve text has limited exposure. An agent that can retrieve data, write files, send emails, and call other agents has a correspondingly larger exposure to this class of attack.
Prompt injection through tool outputs is an underappreciated attack vector in production deployments. The agent has no mechanism to distinguish between legitimate data and instructions masquerading as data.
What Red-Teaming Can and Cannot Do
Traditional red-teaming assumes a bounded problem. Given enough time and creativity, a skilled team can enumerate the significant failure modes of a system and verify that they have been addressed. This assumption holds reasonably well for single models with stable input distributions.
For multi-agent systems, it does not hold. The state space is determined by the number of agents, the richness of their interactions, the tools available to each, and the sequence of prior actions. It is not practically enumerable. A red team can find known failure modes and previously observed attack patterns. It cannot verify the absence of unknown failure modes in a system whose behaviour is sensitive to runtime state.
Red-teaming is still valuable for multi-agent systems. It surfaces specific vulnerabilities, builds team familiarity with the failure modes, and provides evidence of due diligence. What it cannot provide is the comprehensive assurance that most organisations believe it provides. Treating red-team completion as a sufficient condition for production deployment is a governance error.
Constraint Architecture as the Primary Control
If red-teaming cannot comprehensively characterise the failure modes of a multi-agent system, the governance implication is that the primary control needs to be architectural rather than evaluative. The question shifts from "have we found all the ways this could go wrong?" to "regardless of how it goes wrong, what are the limits on what it can do?"
Hard constraints, implemented at the tool level rather than the model level, limit what an agent can do regardless of what its context suggests it should do. An agent that cannot take an irreversible action above a defined consequence threshold, regardless of what instructions it receives, is safer than an agent that has been instructed not to take such actions and that has passed a red-team evaluation.
Mandatory approval gates for actions above defined consequence thresholds keep human judgement in the loop without requiring human review of every agent action. They are calibrated to the risk, not to the volume.
Observable action logs that capture the full context of agent decisions, not just the inputs and outputs, create the audit trail that allows post-hoc review when the system behaves unexpectedly. For agentic systems, this kind of observability is not optional. It is the mechanism by which governance remains possible after deployment.
What Boards Need to Understand
The risk profile of agentic AI is not the risk profile of AI assistants, and the governance approach that is adequate for the latter is not adequate for the former. The relevant question is not "can we trust this agent?" Trust is not a sufficient basis for governance of systems that operate autonomously in consequential domains.
The relevant questions are: what can each agent in this system do, what prevents it from doing things that have not been authorised, and how will we know if the boundaries have been breached or if the system has been exploited? Organisations that can answer these questions clearly have the basis for responsible agentic AI deployment. Organisations that cannot are operating on trust in a domain where trust alone is not sufficient.