The best AI models in the world can only find 3.8% of malicious events in a real-world security log.
April 23, 2026
Original Paper
Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps
arXiv · 2604.19533
The Takeaway
Frontier LLMs fail dramatically at unsupervised threat hunting when faced with massive, noisy datasets. Despite the marketing around AI security analysts, the actual reasoning gap remains enormous. These models struggle to find the needle in the haystack when the data is not perfectly cleaned and formatted. This benchmark proves that we are still far away from AI that can autonomously defend corporate networks. Security teams should view AI as a helper for specific tasks rather than a replacement for human judgment.
From the abstract
We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events.The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each ep