Also:

The title has been editorialized for marketing purposes. My original title would have been “Teach Your Agents to Do Code Security Reviews”.

Coding agents are now involved in the majority of the code shipped at Synthesia. The volume of code changes has gone up but the time humans spend reading those changes has not. The practice of doing code security reviews is especially exposed to this pressure because it depends on careful analysis. To solve this, we’ve built an agent skill that probably approaches Mythos-levels of performance in uncovering complex security issues at a fraction of the cost of running such a model.

We previously wrote about scaling vulnerability management after issues have merged or shipped. We continued to scale our application security practices by providing security coverage at implementation time, before changes get merged, using coding agents.

The original idea was to build something engineers can self-serve to give their coding agent a feedback loop on the quality of the code it generates. We ended up building an agent skill that orchestrates an autonomous multi-agent security review pipeline, tuned to our stack and our common pitfalls.

This post describes how it’s structured, the principles we settled on after iteration, and the operational realities that shaped the design.

Find security issues, make no mistakes.

The first thing anyone tries is the simplest one: pipe a diff into Claude, ask for security issues, share the findings. It can surface real issues, but it also surfaces a lot of noise.

The core failure mode of this approach is that a generic prompt produces generic output. You get an OWASP top-ten checklist applied to your code, regardless of what the code actually does.

This is because without specific guidance the agent has to build its own model of the code from scratch, and it has no obvious way to understand which abstractions in this codebase are trustworthy and which aren’t.

Two problems compound from there: false positives and run-to-run variance. If the same diff produces different findings on different runs, the tool is hard to evaluate, hard to tune, and engineers will stop trusting the output.

A working AI security review system has to fix both. It has to be tuned to the codebase it’s reviewing, and it has to be engineered against the noise.

We settled on three pillars:

The first two pillars close the gap between the agent and the code. The third closes the gap between raw findings and useful ones.

Building a map

The unlock was realizing that good security review is mostly orientation. When an agent starts with only a diff, it spends too much time wandering around exploring the code base to gather context. Different runs take different paths through the codebase, which makes the output more noisy and inconsistent (and increases cost).

We started by doing the orientation deterministically, before the actual hunt for vulnerabilities starts.

What we wanted in principle was full taint analysis: every path from source to sink, every transformation. In practice, packing deterministic taint analysis into a skill that works across our stack (Python, JavaScript, TypeScript, multiple frameworks) and runs on an engineer’s laptop with minimal setup was too cumbersome. We wanted a frictionless self-serve experience, not a new platform.

So we flipped the problem: instead of doing full taint analysis we deterministically map any input entry point, then for each entry point we delegate mapping the code flow to smaller subagents.

There are two steps:

1) Enumerate entry points. We wrote Semgrep rules, one set per framework in our stack, that identify where untrusted input enters the application: HTTP route handlers, GraphQL resolvers, websocket endpoints, CLI commands, queue consumers. Entry points are where security risk concentrates, so finding them precisely is most of the orientation work. The rules are embedded in the skill and Semgrep is on every engineer’s machine already, so we paid no install cost to add it. The skill always starts by running Semgrep against the code in scope.

2) Cartographer phase. For each entry point found, a small Haiku subagent (we call them Cartographers) traces the call graph through application code to the sinks it reaches: database, shell, filesystem, network, template. It’s coarse taint analysis done by a cheap cost-effective LLM: not perfect, but good enough. The output is a flat, factual map of entry point, path, and sinks.

This gives us a good-enough map of the code in scope without adding too many dependencies. With this we can now be precise with our prompting: “here is an entry point, here is what it reaches, find this kind of vulnerability on this code path.” On a large codebase, this is the difference between an agent staying inside its scope and wandering through unrelated files polluting its context window.

Security Context

The second pillar was figuring out what kind of context helped the agent do the review. We ran some experiments and learned that code-generation context hurt security review quality, especially variance.

Files like CLAUDE.md, cursor rules, and repo-level coding instructions are written to help an agent complete a code generation task: follow conventions, trust existing abstractions, fit the codebase. Security review needs the opposite prior: distrust abstractions, question conventions, and assume framework patterns can be misused.

We now keep security context separate from codegen context. The system reads SECURITY.md files distilled from our threat models, past bug patterns, and framework misuses we’ve already had to fix once.

What goes in those files is information that disambiguates findings: the tenant model and where the isolation boundary lives, the blessed authorization primitives so the reviewer recognizes drift, an ID risk taxonomy, explicit anti-false-positive notes for patterns that look wrong but aren’t, and the historical vulnerability classes this piece of the codebase has actually shipped. None of it would be inferable from just reading the code in scope.

The delivery mechanism piggybacks on the agent harness’s progressive discovery of AGENTS.md context files. When the reviewer opens a file deep in service/core/auth/, the nearest SECURITY.md loads automatically, scoped to that subsystem. The threat model isn’t shoved into every prompt; it’s loaded only when relevant to the code in scope.

With proper security context the variance of findings dropped: we consistently started finding the same issues across multiple runs, while sensibly reducing the number of issues discarded as false positives.

We will share more about our approach to build a security context in a future blog post.

The security review pipeline

The third pillar wraps the orientation work and turns vulnerability analysis output into findings an engineer can act on. This is where we did most of the engineering against false positives and run-to-run variance.

The skill orchestrates six phases. Each one after the first is delegated to subagents with a clear task and a model sized to that task, and produces a single artifact handed to the next phase. The main agent is only responsible for orchestration.

Engineers can self-serve by running:

/synthesia-security-review [scope]
┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌───────────┐
│  PREP    │──▶│   MAP    │──▶│   HUNT   │──▶│  DEDUP   │──▶│ VALIDATE │──▶│ AGGREGATE │
│  main    │   │  haiku   │   │  opus    │   │  sonnet  │   │  opus    │   │   main    │
│          │   │ parallel │   │ parallel │   │  single  │   │ parallel │   │           │
└──────────┘   └──────────┘   └──────────┘   └──────────┘   └──────────┘   └───────────┘
     │              │              │              │              │              │
     ▼              ▼              ▼              ▼              ▼              ▼
  scope +     attack-surface   raw findings   deduped         validated     FINDINGS.md
  arch         map (per-entry  (per-hunter)   findings        verdicts      (final report)
  summary      call graphs)

Step 1: Preparation. The main agent resolves the given scope (PR, staged changes, or path), detects the language, summarizes the architecture, and writes some high-level context in .security/architecture.md.

Original code-generation context is dropped if present, and the security-context layer is overlaid over the codebase.

Step 2: Build a Map. Map all entry points with Semgrep, then run one Haiku cartographer subagent per entry point in parallel. Each traces application-level paths from source to sinks and writes its segment of .security/attack-surface.md.

If Semgrep finds no entry points, we fall back to a single Sonnet subagent to identify untrusted input sources by reading the code. That output is also useful for learning whether we need to write additional Semgrep rules for future iterations.

Step 3: Hunting for Vulnerabilities. We give architecture.md and attack-surface.md to three subagents that run in parallel to find vulnerabilities. We call them Hunters.

Hunters are not prompted as specialized personas; they are tasked to find a specific class of vulnerability within a specific code path from the previous phase. We landed on running searches for injection, authorization, and business logic issues.

Each hunter follows a playbook:

Step 4: Deduplication. At this stage a single Sonnet pass reads all the findings and merges the ones with the same root cause.

We do deduplication before validation, because validation is one-agent-per-finding and therefore expensive.

Findings can overlap meaningfully (the same issue can appear in two hunters’ output, despite their different focus). Deduping first means we don’t pay to validate the same issue three times.

Step 5: Validate. One subagent per deduped finding, all running in parallel. Each one re-reads the code, checks whether the abuse scenario actually holds, and classifies the finding as a false or true positive. For true positives we also check whether it is realistically exploitable.

Validator agents are deliberately prompted to be stricter than hunters; their job is to push back. They follow this playbook:

Step 6: Aggregate. After validation we discard false positives and low-impact findings without exploit paths. We rank what remains based on our internal guidance and write a final report for the end user in .security/FINDINGS.md.

At this point the review is done, the report is fed back to the user’s coding agent, and they can just command a “fix this”.

Two design choices are worth pulling out, because they’re where most of the engineering went:

Shared context files, not inlined prompts. Subagents read .security/architecture.md and .security/attack-surface.md themselves. Task context is passed between subagents via files. This makes them easy to prompt and gives us a paper trail to inspect each run.

Right-sized models per phase. The expensive models are reserved for the two phases where judgment actually matters: hunting and validation. Most of the cost-vs-quality tradeoff in an agent pipeline is decided by how narrow we can make each step’s task, not by which model we pick. As such we don’t really need to wait for Mythos or new frontier models with cybersecurity training if the whole orchestration is sound, but once they become available we can easily switch over the critical phases.

How we built this

Agentic systems are non-deterministic. The same skill, run twice on the same code, will not produce identical findings. Every change you make to a prompt, a phase, or a model assignment lands on top of inherent run-to-run variance, and “it looked better this time” is not evidence of anything.

We started by building something very simple and iterated over it following a simple benchmarking strategy.

We keep a reference codebase from our own product with a known set of issues. Every iteration of the skill (a new prompt, a new phase, a model swap, a rule change) runs against it. We track three dimensions: cost, wall time, and the findings produced. We run each iteration multiple times to get a basic read on variance, not just a single point estimate.

The rule is: no dimension is allowed to regress. A change that improves recall but doubles cost doesn’t ship. A change that looks better on one run but is unstable across three doesn’t ship. A change that finds new true positives but reintroduces a false positive class we’d already eliminated doesn’t ship.

This isn’t sophisticated. It’s just refusing to make decisions on vibes. It’s a useful habit we’ve adopted for iterating on agentic systems, not because the methodology is clever, but because the alternative, which is “this run looked good,” is how we end up spending too much time making something subtly worse.

The numbers

One last step of the skill reports the results back to the security team so we can analyze how it behaves over time.

We group runs into cohorts by scope size (number of lines of code changed). For each cohort we measure:

These are our numbers so far:

Scope size Mean cost Mean duration avg critical avg high avg medium avg low avg discarded avg valid
small $2.72 7.4m 0.027 0.405 0.432 0.182 1.804 1.047
medium $4.30 10.0m 0.040 0.566 0.833 0.369 2.813 1.808
large $6.15 13.7m 0.051 0.718 1.051 0.487 4.051 2.308
ALL $3.88 9.4m 0.036 0.519 0.701 0.309 2.551 1.566

A few things stand out:

The pipeline discards about 60% of what the hunters surface. Across all reviews, validators and aggregator together throw away roughly three findings for every five the hunters produce. That’s the operational story mentioned earlier: hunters flag anything plausible by design, and most of the engineering in the pipeline is what filters their output down to something safe to act on.

The cost-per-actionable-finding is in the low single digits. A large review averages $6.15 and surfaces about 2 valid findings, which works out to roughly $2.70 per finding an engineer actually reads. A small review averages $2.72 for about one finding. We don’t think these numbers are particularly impressive on their own. The point is that they’re cheap enough that running unattended on every pull request won’t trigger a budget conversation.

The severity mix is realistic. Critical findings show up in roughly one review out of thirty. Most of the actionable output is in the medium and high bands, which is what we’d expect from a system reviewing code that already passes the rest of our pre-merge checks. We watched closely for a “flood of criticals” failure mode in the early iterations and didn’t see it.

From self-serve to CI, and what’s next

We built the skill self-serve first with a clear value proposition: any engineer, any time, can run a security review on the code they’re about to push, without involving the security team.

Predictably, adoption was slow and concentrated. The strongest adopters were the already security-minded engineers who needed it the least. The engineers we most wanted to reach (fast-moving, agent-assisted, shipping a high volume of generated code) were hard to convert.

So we moved it into CI, attached to every pull request, but non-blocking. The cost per review made this affordable and the design of the pipeline made it possible to run unattended in a remote sandbox. Findings are posted on the PR for the author to triage and into the Security backlog for further processing.

Non-blocking is a deliberate choice given the slowness of a full review run: despite parallelizing where possible, a review still takes nearly ten minutes on average.

This is a best-effort system. If the review lands a comment on a pull request before it gets merged and it’s acted on: great.

For everything else, the bet we’re making is post-merge patching: when the skill identifies findings on a merged PR, another agentic system opens a follow-up PR with a proposed fix, ready for the original author to review. This treats the speed limit honestly (security review is slower than the merge) and converts a finding from a piece of feedback the engineer has to act on into a piece of work that’s already half done. We’ll report back on how it goes.

Build your own

You can replicate our approach. What’s worth copying are the design principles:

Point a coding agent at this post and have it draft a version of this skill tuned to your stack, then iterate against your own benchmarks. Make no mistakes!