Finding vulnerabilities with LLMs

Finding vulnerabilities in modern web apps using Claude Code and OpenAI Codex. Super interesting to see some benchmarks.

Traditional rule based detection can’t find complex vulnerabilities and even potentially detectable issues might go unnoticed as false negatives. This helps answer the question whether LLM could be integrated to cover this blind spot.

They could! But the problem is the noise:

AI Coding Agents Find Real Vulnerabilities: Claude Code found 46 vulnerabilities (14% true positive rate – TPR, 86% false positive rate – FPR) and Codex reported 21 vulnerabilities (18% TPR, 82% FPR). About 20 of these are high severity vulnerabilities.