“From Prompts to Plans: Security and Safety Testing for Agentic AI” by Jason Stanley:
The Core Problem: AI is Evolving Faster Than Our Testing Methods
Stanley begins by highlighting the massive surge in AI adoption across enterprises. However, the nature of AI systems is fundamentally changing. We are moving away from simple, stateless “chat” interfaces (where a user inputs a prompt and gets a single reply) toward complex Agentic AI. These new agents have memory, access to external tools, complex architectures, and the ability to take multi-step actions in real environments.
Consequently, the attack surface has expanded drastically, but testing has not kept up. Current testing methodologies are flawed because they:
- Only check the “front door”: They focus on the initial prompt, ignoring the “seams” (like tools, memory, or external data injections).
- Are stateless: They test single interactions rather than complex, multi-step trajectories.
- Are context-unaware: Teams rely on generic, public benchmarks rather than testing the actual risks specific to their unique deployment.
- Ignore the security vs. utility trade-off: Security tests are often run in isolation. Putting heavy “guardrails” on an agent might make it secure, but it often destroys its ability to successfully complete its intended tasks.
The Solution: A 3-Step Framework (Map, Test, Promote)
To address these shortcomings, Stanley proposes a structured methodology for securing Agentic AI:
1. MAP (Build a Specific Threat Model)
Instead of using generic internet taxonomies, teams must map the risks specific to their system. This involves defining:
- Outcomes: What should the agent do, and what would be a disastrous outcome?
- Architecture & Surfaces: What tools, databases, and external connections does the agent touch?
- Invariants: What are the absolute “never-do” rules (e.g., “never issue a refund over $200 without human approval”)?
2. TEST (Dual-Track Testing)
Testing must be stateful and measure both risk and utility simultaneously. Stanley advocates for a two-pronged approach:
- Context-Aware Benchmarks: To test for known vulnerabilities specific to your system.
- Exploratory Search (Red Teaming): To hunt in the “dark corners” of the system for unknown vulnerabilities.
To facilitate this, ServiceNow open-sourced a testing framework called DoomArena. It allows developers to test their specific agents in their specific environments, measuring both the Attack Success Rate (ASR) and the Task Success Rate (TSR) side-by-side. This ensures that a security patch doesn’t accidentally break the agent’s usefulness.
3. PROMOTE (From Findings to Automated Tests)
When red teaming uncovers a new vulnerability, it must be “promoted” into an automated regression test suite so the system can be continuously tested against it. However, Stanley warns against overfitting—simply copy-pasting the exact malicious prompt into a test suite. Instead, teams should extract the pattern of the attack and build a generator that creates hundreds of variations of that attack. This prevents the defense from becoming “brittle.”
Key Takeaways:
- Test your specific system, not a generic internet benchmark.
- Formalize and automate exploratory red teaming.
- Always measure security (risk) and utility (task success) together.
- Promote discovered vulnerabilities into automated tests carefully to ensure robust, generalized defenses.