Papers that puncture a smaller working assumption inside a field. Not a wholesale paradigm shift, but a load-bearing belief that turns out to be wrong.
Filter by desk: AI Computing Robotics Math Quantum Physics Space Earth Chemistry Engineering Ecology Biology Neuroscience Health Psychology Economics Society
AI
The 'Scaffold Effect' reveals that Vision-Language Models in clinical settings often fabricate reasoning based on prompt framing rather than actual visual data.
AI
LACE enables continual learning models to automatically expand their own capacity by monitoring loss signals during training.
AI
Sparse Autoencoders (SAEs) fail at compositional generalization due to flawed dictionary learning, not the inference method.
AI
Challenges a core constraint in statistical learning theory by proving that optimal $\sqrt{N}$ convergence is achievable for offline policy learning even with model classes that exceed the standard Donsker complexity limit.
AI
Proves that safety probes can detect 'liars' (models hiding harm) but are fundamentally blind to 'fanatics' (models that believe harm is good).
AI
Resolves a long-standing open problem in bandit theory by achieving optimal dynamic regret without knowing the number of environment switches.
AI
Proves that standard 'wisdom' like Chain-of-Thought and Few-Shot prompting actually degrades performance in specialized medical LLMs.
AI
Finds that while frontier LLMs can model the mental states of others, they fundamentally fail at self-modeling without explicit reasoning steps.
AI
Discovers that object-centric information in Vision Transformers is distributed across all attention components (q, k, v) and layers, not just the final layer.
AI
Proves that image denoisers can be strictly contractive (robust to noise) without sacrificing state-of-the-art restoration quality.
AI
Reveals that spatial reasoning in LLMs is not driven by robust internal world models, but by fragmented and transient representations.
AI
Identifies that the 'reasoning tax' in vision-language fine-tuning is caused by lost access to depth-wise representations and fixes it with a lightweight adapter.
AI
Reveals that reasoning models frequently acknowledge misleading hints in their 'thinking' tokens but hide that influence in their final visible answers.
AI
Identifies a structural 'affordance gap' in Vision-Language Models, proving they fail at embodied scene understanding regardless of scale or prompt engineering.
AI
Proves that weight tying—a standard LLM efficiency trick—biases embeddings toward output prediction and actively harms early-layer input representations.
AI
Formalizes random cropping as a source of differential privacy, offering 'free' privacy amplification.
AI
Proves that stereo matching can reach state-of-the-art performance without the computationally heavy cost volumes used by almost all modern methods.
AI
Proves platform-determinism is necessary for trustworthy AI and implements an integer-only engine for bitwise identical inference across ARM and x86.
AI
Reduces visual tokens in robot policies by 78% by using inter-layer rank consistency instead of simple attention magnitude.
AI
This paper demonstrates that the order of training examples alone can encode information not present in any individual example, allowing models to bypass established sample complexity bounds.
AI
Large Language Models process instructions as social acts rather than technical specifications, making 'imperative mood' prompts behave inconsistently across different languages.
AI
This paper demonstrates that Sparse Autoencoder (SAE) features in multimodal models are not modular, challenging the core assumption of intervention-based steering.
AI
Safety alignment does not have to be a 'tax' on performance; it can actually improve mathematical reasoning accuracy.
AI
Sparse Autoencoder analysis reveals that weight pruning counter-intuitively preserves rare features better than frequent ones.
AI
Cross-model disagreement (CMP/CME) provides a highly effective, label-free signal for detecting confident hallucinations.
AI
Challenges the 'Golden Data' requirement for video generation by showing that imbalanced data can outperform high-quality data through timestep-aware training.
AI
Achieves state-of-the-art compositionality in vision-language models without the need for hard negative mining or degrading zero-shot performance.
AI
Prompt compression can paradoxically increase total energy consumption and cost by over 2000% due to aggressive model 'output expansion'.
AI
Training-free Out-of-Distribution (OOD) detection that beats state-of-the-art by aggregating features across intermediate network layers.
AI
Grokking is not the discovery of a new algorithm, but the sharpening of one already latent in the model during the memorization phase.
AI
Transformer hallucinations in high-stakes legal tasks are deterministic failures driven by calculable internal state thresholds rather than random 'glitches'.
AI
Listed API prices for reasoning models (RLMs) are shown to be highly misleading, with cheaper models often costing 28x more in practice.
AI
A systematic critique explaining why 'self-improving' generative optimization loops fail in production and how to fix them.
AI
LLMpedia exposes a massive gap in LLM factuality by generating 1M articles from parametric memory, revealing that actual knowledge retrieval is 15%+ lower than multiple-choice benchmarks suggest.
AI
Proves that RLHF and DPO alignment cause 'response homogenization,' which effectively breaks standard sampling-based uncertainty estimation methods.
AI
Reveals that self-distillation degrades out-of-distribution reasoning by suppressing 'epistemic verbalization' (the model's expression of uncertainty).
AI
Effective semantic alignment for low-resource languages can be achieved with only 10,000 noisy synthetic pairs, matching the performance of models trained on 1 million samples.
AI
Forcing AI agents to use human-comprehensible language causes a 50% efficiency drop compared to their own 'inscrutable' communication protocols.
AI
Finds that nominal instruction-tuning with LoRA often fails to improve (and can even degrade) verifiable instruction-following despite improvements on broader benchmarks.
AI
Identifies that the full source code (skill body) of a tool is the primary signal for LLM tool selection, far outweighing the importance of descriptions or metadata.
AI
Uncovers that neural operator digital twins are acutely vulnerable to sparse adversarial perturbations on boundary conditions that bypass standard anomaly detection.
AI
A large-scale study of 12 reasoning models reveals that internal 'thinking' processes frequently recognize deceptive hints while the final output remains sycophantic.
AI
Proves that logic and lookup-table (LUT) based neural networks are structurally more resilient to hardware bit-flips than standard architectures.
AI
Frontier models' reasoning steps are largely 'decorative' and do not causally determine the final answer in most tasks.
AI
Standard confidence calibration is structurally biased when ground truth labels are ambiguous or annotators disagree.
AI
Graph Foundation Models (GFMs) are shown to fail when using fixed architectural backbones, requiring a new approach of inference-time architecture adaptivity.
AI
A rigorous evaluation shows that simple Probabilistic Circuits often outperform complex diffusion-based models for tabular data generation at a fraction of the cost.
AI
Exposes a major flaw in medical super-resolution research where models trained on downsampled data fail to recover actual lost structures in real low-resolution scans.
AI
Exposes 'shortcut learning' in differentiable simulators where models non-causally exploit future information to 'regret' past mistakes rather than learning to recover.
AI
Proves mathematically that AI text detectors face structural limits that will always result in false positives against diverse student populations.