AI & ML Breaks Assumption

Identifies that the distinct 'AI prose style' (specifically em dash overuse) is a surviving artifact of markdown-saturated training data leaking into unstructured output.

March 31, 2026

Original Paper

The Last Fingerprint: How Markdown Training Shapes LLM Prose

E. M. Freeburg

arXiv · 2603.27006

The Takeaway

It provides a mechanistic explanation for why LLMs 'vibe' a certain way, proving these artifacts are latent from pre-training and resistant to RLHF suppression. This is critical for developers of detection tools and researchers attempting to fine-tune out the 'AI fingerprint'.

From the abstract

Large language models produce em dashes at varying rates, and the observation that some models "overuse" them has become one of the most widely discussed markers of AI-generated text. Yet no mechanistic account of this pattern exists, and the parallel observation that LLMs default to markdown-formatted output has never been connected to it. We propose that the em dash is markdown leaking into prose -- the smallest surviving unit of the structural orientation that LLMs acquire from markdown-satur