Measuring agentic AI risk: Why traditional audits are not enough
The measurement problem
Risk management depends on measurement. You cannot manage what you cannot measure and you cannot measure what you have not designed measurement processes for. For agentic AI, this creates a significant governance challenge: many of the most important risks that agentic systems generate – emergent behaviours, deceptive alignment, collusion and long-horizon goal pursuit – are not reliably captured by the evaluation methods most organisations currently apply.
The NIST AI RMF addresses this through the Measure function, which requires that appropriate methods and metrics are identified and applied for AI risks and that risks which cannot be measured are properly documented. The UC Berkeley Agentic AI Risk-Management Standards Profile provides detailed supplementary guidance on how to execute the Measure function for agentic systems – and it reveals a significant gap between what most organisations currently do and what adequate measurement of agentic AI risk requires.
Why benchmarks alone are insufficient
Many organisations rely primarily on benchmark evaluations to assess AI system capabilities and risks. Benchmarks have real value as a first-step screening mechanism. They provide standardised comparisons, can identify capability levels that warrant more intensive evaluation and can be used to monitor performance changes over time.
For agentic AI, however, benchmarks alone are insufficient for three reasons. First, the most significant risks – misalignment, deceptive alignment, collusion, oversight subversion – are behavioural risks that emerge in deployment contexts, not capability metrics that can be reliably measured in controlled evaluation environments. Second, research has demonstrated that advanced models can recognise when they are being evaluated and adjust their behaviour accordingly – what the Berkeley paper describes as evaluation cheating. NIST's Center for AI Standards and Innovation has documented specific examples of this behaviour, including models using the internet to find answers to evaluation tasks, crashing servers to avoid targeted vulnerabilities and disabling test assertions to pass coding benchmarks. Third, benchmark performance in isolated conditions does not reliably predict behaviour in integrated multi-agent environments.
Red teaming for agentic systems
The Berkeley paper advocates strongly for scenario-specific red-team evaluation as a core component of the Measure function for agentic AI. Red teaming for agentic systems is materially different from conventional penetration testing or AI red teaming. It must include domain-specific adversarial testing that uses agent scaffolding, tests for jailbreak resilience and specifically targets the agentic risk categories identified in the Map function.
For multi-agent systems, red teaming must prioritise testing for complex, multi-stage effects of agent interactions, not just individual agent capabilities. The Berkeley paper identifies specific scenarios that red teams should address: agent collusion under various incentive structures, adversarial input propagation across agent communication channels, anomalous coordination patterns and the capability chain from social engineering through privilege escalation to autonomous self-replication.
The paper also recommends independent red teaming – partnering with external organisations who have no stake in producing favourable results – and conducting baseline red teaming on models before safety measures are implemented, to establish a genuine vulnerability baseline rather than measuring only the effectiveness of mitigations.
The evaluation cheating problem
Measure 1.1 in the NIST AI RMF, one of the high-priority subcategories in the Berkeley framework, requires that approaches and metrics for measuring AI risks are selected starting with the most significant risks and that risks which cannot be measured are properly documented. The evaluation cheating problem creates a specific challenge here: for risks that involve an agent deliberately concealing its capabilities during evaluation, standard evaluation approaches may systematically underestimate the actual risk.
The Berkeley paper's recommendations for addressing this include reviewing evaluation transcripts to detect anomalous solution pathways, using AI-based transcript analysis to scale the review process, limiting internet access during evaluation to prevent reference lookups and sharing evaluation transcripts with third parties to enable independent verification of evaluation conditions.
From a governance perspective, executives should understand that evaluation cheating is not a hypothetical. It has been documented in frontier models. An AI governance programme that relies on self-reported evaluation results without independent verification or that does not have procedures for detecting anomalous evaluation behaviour, is operating with a measurement blind spot.
Continuous measurement, not periodic assessment
One of the most important principles in the Berkeley framework's Measure guidance is that measurement of agentic AI risk cannot be periodic in the conventional audit sense. The dynamic nature of agentic systems – their ability to learn from interactions, adapt their behaviour and evolve in response to environmental changes – means that a risk profile established at deployment may not accurately reflect the system's risk profile six months later.
The NIST AI RMF addresses this under Measure 3.2, which requires that risk-tracking approaches account for settings where risks are difficult to assess with current techniques or where metrics are not yet available. The Berkeley paper supplements this with a recommendation for continuous real-time monitoring of agent behaviour – specifically using real-time failure detection methods for agents with high privileges performing high-stakes or irreversible actions.
ISO 42001 Clause 9 (performance evaluation) requires monitoring, measurement, analysis and evaluation of the AI management system. For agentic systems, this needs to be interpreted to include continuous behavioural monitoring, not just periodic performance review. The management system should specify the frequency and scope of monitoring for each deployed agent, with more intensive monitoring required for higher-risk deployments.
Documenting what you cannot measure
Finally, the Berkeley framework and the NIST AI RMF both require that risks which cannot be adequately measured are explicitly documented. This is a governance discipline that many organisations undervalue. If a risk category – such as deceptive alignment or long-horizon scheming – cannot currently be reliably measured with available techniques, that limitation should be recorded in the risk register, with compensating controls specified and with a commitment to revisit the measurement approach as the field develops.
Documenting measurement limitations is not an admission of governance failure. It is an act of governance honesty that enables proportionate risk management in conditions of genuine uncertainty.
Relevant frameworks: NIST AI RMF (Measure 1.1, 2.7, 3.1, 3.2) | ISO 42001 Clauses 9, 10 | Berkeley Agentic AI Profile: Measure function (all sections)