Leaderboard v3

Appen Security Benchmark

AI models ranked by gated detection metrics (CWE + location) on human-verified XBOW ground truth.

Rankings

Sorted by detection accuracy on human-verified ground truth.

Corpus
Models
Benchmarks
GT Findings
verified
Model Findings
avg per model
Analysis

Recall outpaces precision

Top CWE-family F1

Semantic similarity

Embedder
CWE-family leading metrics
CWE-exact secondary lens
Semantic similarity
TP diagnostics
Filter
# Model Provider Recall Vs Median Precision · Recall Findings

Scatter

About

What the XBOW benchmark measures and how to read Sort by columns in Rankings.

The benchmark. Models receive source code of deliberately vulnerable web apps (XBOW) and must enumerate findings as structured output: CWE, location, severity, and remediation. Scored against human-curated ground truth, not exploit success.

Ground truth. Human-verified findings across vulnerable web apps in a denser extension corpus. All models evaluated under identical prompts.

True positives. A model finding must pass every applicable scorer gate: CWE family, file path, endpoint when ground truth has a structured route, and function name when ground truth names one.

Metric definitions

Precision vs recall
F1 balances both sides of the trade-off. Recall asks whether every ground-truth finding was matched; precision asks whether extra findings are noise.
Verbosity tax
Models that report far more findings than ground truth can score high on recall while precision collapses. High recall alone does not mean reliable detection.
Gated set matching
Precision, recall, and F1 use one-to-one bipartite matching per benchmark. A true positive requires CWE-family or CWE-exact agreement, depending on the metric, plus file path, endpoint, and function gates when applicable. CWE Exact and Severity Exact rates are diagnostics over true-positive pairs.
MSD semantic distance
Nearest-neighbor mean squared distance (MSD) measures embedding-space closeness per finding. Lower MSD means model output is semantically closer to ground truth.

Methodology

How the XBOW benchmark is defined, scored, and illustrated.

Benchmark & scoring

Task definition, ground-truth provenance, and how pairs are matched.

The XBOW Benchmark evaluates how well LLMs perform as static security reviewers on deliberately vulnerable web applications from the XBOW corpus. XBOW was originally built as a CTF-style dataset: each app is designed around capture-the-flag challenges, but our ground truth was produced independently. A cohort of 3 cybersecurity experts annotated every verified vulnerability in each app, whether or not it leads to flag capture. A model receives the application source code and must return a structured JSON report: one object per finding, with fields such as cwe_id, title, severity, problem_description, and remediation.

Scores are computed against human-curated ground truth, not exploit success or flag retrieval. For each benchmark, annotators record verified vulnerabilities with the same schema. Set metrics (precision, recall, F1) use one-to-one bipartite matching per benchmark: each ground-truth finding pairs with at most one model finding. A pair counts as a true positive only when it passes the canonical scorer gates: CWE agreement (CWE-family or exact ID, depending on the metric) and location alignment on file path, plus endpoint and function when ground truth provides them. Hardcoded-secrets benchmarks allow a relaxed credential-storage file match. Unmatched model findings are false positives; unmatched ground-truth findings are false negatives. Semantic distance metrics (MSD) measure embedding-space closeness on nearest-neighbor pairs and are not gated.

Worked example: XBEN-052-24

A minimal PHP login form exposes a hidden isAdmin field in the browser. The server treats that client-controlled value as proof of admin access and returns the flag when isAdmin=true, without validating credentials. Ground truth records two findings: CWE-602 for the access-control bypass, and CWE-200 for echoing the flag back in the response.

In the pairing below, gemini-3-1-flash-lite reports one CWE-602 finding that passes the true-positive gates. The separate CWE-200 disclosure remains missed ground truth.

Corpus CWE distribution

Frequency of CWE IDs across all benchmarks in ground truth, and mean CWE counts averaged over evaluated models (phase 1+2).

Ground truth

Finding count per CWE

Models (avg)

Mean finding count per CWE across all models

Top CWE frequencies · GT vs model average
Summary Ground truth Models (avg)