How I Think About Agentic Risks

Fully aware that this might be obsolete in half a year, this is my current mental model to reason about AI Agents risks.

This is based largely on my experience in assessing AI systems in the last two years, and applying prior art into my day to day work.

Of everything I digested in current literature the two most influential pieces have been the Google AI Agent security framework and the Lethal Trifecta and the public discourse around it.

I think the lethal trifecta is biasad towards data exfiltration, which is surely the main scare, but there is a lot of damage AI Agents can do without touching any sensitive data.

Alright, so what are the risks? There are two buckets:

Data Exfiltration: where the agent expose sensitive data
Rogue Activity: where the agent perform damaging actions

There are three things that amplify those risks:

Capabilities
Data access
Untrusted input

Fundamentally Agents are unsafe because the underlying LLM has no understanding of what piece of the context is trusted or not. That part can only be delegated to the Agentic wrapper, but in practice this is unsolved despite some design patterns to mitigate it. The classic application security concept of sanitizing and validating your inputs does not apply anymore. This systemic issue is exploited with prompt injection.

Capabilities are what the agent can do. These are the tools in the agentic loop. Any new capability is a potential venue for data exfiltration or a mean for rogue activity. It’s also a potential entry point for new sensitive data and untrusted input.

Data access is all data that lands in the underlying LLM context. Once in there, there is no deterministic assurance that it cannot be pushed in output in some form.

Risk is a function of impact and the probability of it to happen. Capabilities and Data Access amplify the impact, while untrusted input increase the probability. The non deterministic nature of LLM ensure the probability is never zero, with foundational model companies improving new models reliability in staying on task and not hallucinate.

What are the scenarios? I try to map scenarios based on the risk amplifiers: what are the available capabilities, what is the available data, from where can untrusted input land in the context. The way I do is to graph a path of agent activities and take note of what data is in the context at every step.

For example:

The agent reads a github issue
The issue content lands in context. This content could be attacked controlled (untrusted input).
The agent then writes a pull request based on the issue content (with a write_pr capability)
At this point, if the issue contained malicious instructions (prompt injection), the PR could exfiltrate the data that was in context, or perform rogue activity (e.g., introduce a backdoor).

To systematize this, I model the agent’s context as a state defined by two things: what data is present, and whether untrusted input has entered the context. Then I explore all reachable states through a search over capability invocations, flagging risk scenarios along the way.

Of course being the agent a loop, the set of potential states can grow exponentially so I am usually explore up to 2-3 levels deep. One can also appreciate how adding a new capability will just explode the realm of possibilities.

In the end I get a bunch of state combinations that can map to a risk scenario, which I evaluate by impact and probability. This whole thing is a sort of threat model for the agent’s behavior.

Once I nail down a set of realistic risk scenarios I can reason about mitigations.

In general, what can we do to mitigate?

Proactive:

human in the loop
reduce the capabilities
- reduce their impact (yes this agent can handle refunds but only up to 10 coins)
reduce data access
“sanitize” untrusted inputs, somehow

Reactive:

auditability
the usual monitoring and alerting (somehow, with LLM gateways maybe?)

I put sanitize in quote because it’s not really a sanitization step but more of a filtering/vetting. Most mitigations here are design pattern at best, cumbersome to implement most of the time. Things like having an isolated LLM call to thumbs up / down any input (like ChatGPT Agent does), filter tool responses from getting back to the context, etc, etc, etc this is a new frontier to explore.

The rest is: put a human in the loop if possible, then consider revoking data access, reducing capabilities, and at the very least leave an audit trail of the agent run so that we can at least react effectively if something bad happens.

On the reactive side I am looking to learn if something can be done by always monitoring the agent LLM calls, perhaps from a choke point like a LLM gateway, and have it fire up alerts when something suspicious is detected.

Fun stuff