AI & ML Paradigm Challenge

The best AI models in the world can only find 3.8% of malicious events in a real-world security log.

April 23, 2026

Original Paper

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

arXiv · 2604.19533

The Takeaway

Frontier LLMs fail dramatically at unsupervised threat hunting when faced with massive, noisy datasets. Despite the marketing around AI security analysts, the actual reasoning gap remains enormous. These models struggle to find the needle in the haystack when the data is not perfectly cleaned and formatted. This benchmark proves that we are still far away from AI that can autonomously defend corporate networks. Security teams should view AI as a helper for specific tasks rather than a replacement for human judgment.

From the abstract

We introduce the Cyber Defense Benchmark, a benchmark for measuring how well large language model (LLM) agents perform the core SOC analyst task of threat hunting: given a database of raw Windows event logs with no guided questions or hints, identify the exact timestamps of malicious events.The benchmark wraps 106 real attack procedures from the OTRF Security-Datasets corpus - spanning 86 MITRE ATT&CK sub-techniques across 12 tactics - into a Gymnasium reinforcement-learning environment. Each ep

Read the original paper →

← Back to today's papers