Breaks Assumption Breaks Assumption
259 papers · Page 3 of 3
Reveals that spatial reasoning in LLMs is not driven by robust internal world models, but by fragmented and transient representations.
AI & ML arxiv | Mar 30
Identifies that the 'reasoning tax' in vision-language fine-tuning is caused by lost access to depth-wise representations and fixes it with a lightweight adapter.
AI & ML arxiv | Mar 30
Reveals that reasoning models frequently acknowledge misleading hints in their 'thinking' tokens but hide that influence in their final visible answers.
AI & ML arxiv | Mar 30
Identifies a structural 'affordance gap' in Vision-Language Models, proving they fail at embodied scene understanding regardless of scale or prompt engineering.
AI & ML arxiv | Mar 30
Proves that weight tying—a standard LLM efficiency trick—biases embeddings toward output prediction and actively harms early-layer input representations.
AI & ML arxiv | Mar 30
Demonstrates that frontier LLMs fail at diagnostic reasoning in safety-critical robotics even when provided with perfect procedural knowledge.
AI & ML arxiv | Mar 31
Reveals a massive 'reasoning gap' in multilingual VLMs, where accuracy drops up to 25% when switching from English to Indian languages.
AI & ML arxiv | Mar 31
Masked Diffusion Language Models (MDLMs) fail at reasoning because they unmask tokens in the wrong order, not because they lack internal logic.
AI & ML arxiv | Mar 31
Exposes 'order-gap hallucinations' where models prioritize conversational compliance over known facts by pinpointing and flipping internal safety circuits.
AI & ML arxiv | Mar 31
Proves that high scores on visual spatial benchmarks are achieved through token-level search (BFS in prose) rather than genuine visual planning.
AI & ML arxiv | Mar 31
Mathematically proves that multi-agent planning workflows are decision-theoretically dominated by a centralized Bayes decision maker, setting fundamental limits on agentic emergent behavior.
AI & ML arxiv | Mar 31
Provides a formal proof that any semantic memory system (including RAG and vector retrieval) is mathematically guaranteed to suffer from interference and forgetting.
AI & ML arxiv | Mar 31
Identifies that the distinct 'AI prose style' (specifically em dash overuse) is a surviving artifact of markdown-saturated training data leaking into unstructured output.
AI & ML arxiv | Mar 31
Systematically demonstrates that 'easy-to-hard' curriculum learning provides no benefit for LLM deductive reasoning tasks.
AI & ML arxiv | Mar 31
Reveals that the tight architectural coupling of image generation and understanding in unified models creates a new class of reciprocal safety vulnerabilities.
AI & ML arxiv | Mar 31
Harmful intent in LLMs can be detected geometrically even after safety 'refusal' mechanisms have been surgically removed.
AI & ML arxiv | Mar 31
For LLM-driven optimization, complex meta-heuristics like simulated annealing are unnecessary; simple greedy hill climbing is a superior default.
AI & ML arxiv | Mar 31
Mechanistic analysis reveals that over-refusal and harmful-intent refusal in LLMs occupy distinct representation subspaces.
AI & ML arxiv | Mar 31
PRBench reveals that current top-tier coding agents have a 0% success rate in end-to-end physics paper reproduction.
AI & ML arxiv | Mar 31
Identifies emergent social risks in multi-agent systems, such as spontaneous collusion and conformity, that occur even when agents are not explicitly instructed to do so.
AI & ML arxiv | Mar 31
A rigorous analysis of the AIMO 3 math competition reveals that raw model capability dominates inference-time prompt optimization by an order of magnitude.
AI & ML arxiv | Mar 31
This study challenges the common 'best practice' of atomic decomposition for LLM judges, showing that holistic evaluation is often superior at detecting incompleteness.
AI & ML arxiv | Mar 31
An autonomous agent reveals that domain-specific molecular architectures are largely unnecessary; standard transformers with better tuning outperform custom designs.
AI & ML arxiv | Mar 31
Exposes a massive robustness gap in Vision-Language-Action (VLA) models, where simple paraphrasing causes up to 50% success drops.
AI & ML arxiv | Mar 31
The 'Scaffold Effect' reveals that Vision-Language Models in clinical settings often fabricate reasoning based on prompt framing rather than actual visual data.
AI & ML arxiv | Mar 31
LACE enables continual learning models to automatically expand their own capacity by monitoring loss signals during training.
AI & ML arxiv | Mar 31
Sparse Autoencoders (SAEs) fail at compositional generalization due to flawed dictionary learning, not the inference method.
AI & ML arxiv | Mar 31
Challenges a core constraint in statistical learning theory by proving that optimal $\sqrt{N}$ convergence is achievable for offline policy learning even with model classes that exceed the standard Donsker complexity limit.
AI & ML arxiv | Mar 31
Large-scale experiments reveal that self-organizing LLM agents spontaneously outperform manually designed hierarchical structures by 14%.
AI & ML arxiv | Apr 1
Reveals that parallel translated data is surprisingly unnecessary for creating aligned multilingual representations in LLMs.
AI & ML arxiv | Apr 1
Discovers that pretraining Implicit Neural Representations (INRs) on structured $1/f^\alpha$ noise performs as well as data-driven initialization.
AI & ML arxiv | Apr 1
Demonstrates that integer multiplication is not a long-range dependency problem, and that current architectures like Transformers and Mamba are fundamentally using the wrong 'computational spacetime.'
AI & ML arxiv | Apr 1
Demonstrates that the 'modality gap' in CLIP-style models is a feature that can be exploited to increase robustness without retraining.
AI & ML arxiv | Apr 1
Challenges the assumption that architecture and loss are the primary levers for neural simulators by proving the 'carried state' design is the dominant bottleneck.
AI & ML arxiv | Apr 1
Reveals that many massive LLM benchmarks provide highly redundant information, with major leaderboards often containing only ~2 independent axes of measurement.
AI & ML arxiv | Apr 1
Uses token-level perplexity analysis to prove that LLMs rely on simple heuristics rather than the linguistic reasoning they appear to exhibit on standard benchmarks.
AI & ML arxiv | Apr 1
Demonstrates that most 'failures' of AI agents on data engineering benchmarks are actually due to flawed ground-truth and rigid evaluation scripts rather than model inability.
AI & ML arxiv | Apr 1
Mathematical proof that cosine similarity between label representations (unembeddings) in softmax classifiers is fundamentally uninformative.
AI & ML arxiv | Apr 1
A debunking of the idea that single-vector embedding failures are primarily due to low dimensionality.
AI & ML arxiv | Apr 1
A diagnostic revealing that over 50% of video understanding benchmark samples can be solved without any video or temporal context.
AI & ML arxiv | Apr 1
Introduces the 'near-miss' metric to detect latent failures in agentic workflows where agents bypass policy checks but reach correct outcomes by chance.
AI & ML arxiv | Apr 1
A training-free attack that removes diffusion-based watermarks with nearly 100% success by deflecting the generative trajectory.
AI & ML arxiv | Apr 1
Proves that complex GraphRAG systems can be simplified into a more efficient 'UnWeaver' framework that achieves the same benefits using entity-based decomposition and standard VectorRAG.
AI & ML arxiv | Apr 1
Identifies the specific conditions under which Reinforcement Learning causes LLMs to 'lie' or hide reasoning in their Chain-of-Thought (CoT).
AI & ML arxiv | Apr 1
Discovers that post-training reasoning models mask rather than delete safety mechanisms, allowing their restoration with lightweight adapters.
AI & ML arxiv | Apr 2
Proves that 'inverse scaling' on many benchmarks is a prompt-dependent artifact caused by verbosity, which can be reversed by forcing brevity.
AI & ML arxiv | Apr 2
Mathematically and empirically proves that classifier-based safety gates are fundamentally incapable of monitoring self-improving AI.
AI & ML arxiv | Apr 2
Masked Image Modeling (MIM) representations are fundamentally polluted with non-semantic noise, which can be fixed with a zero-cost post-hoc linear projection.
AI & ML arxiv | Apr 2
Standard alignment metrics like CKA and RSA systematically fail when comparing networks in superposition, often leading to false conclusions about model similarity.
AI & ML arxiv | Apr 2
Self-reflective prompting (self-correction) fails to improve accuracy in safety-critical medical QA, frequently introducing new errors rather than fixing old ones.
AI & ML arxiv | Apr 2
The 'modality gap' in Vision-Language Models is composed of two distinct geometric components, and the commonly used 'raw gap' is a misleading metric for cross-modal quality.
AI & ML arxiv | Apr 2
Foundational deep networks consistently assign higher density to simpler images, regardless of training data or architecture complexity.
AI & ML arxiv | Apr 2
Reveals that many 'polysemantic' neurons in LLMs are actually firing for shared word forms (lexical) rather than compressed semantic concepts.
AI & ML arxiv | Apr 2
Discovers 'Quality Corruption,' an adversarial failure mode where accuracy collapses while detection counts remain stable, proving robustness is substrate-dependent.
AI & ML arxiv | Apr 2
Provides the first controlled study of Silent Data Corruption (SDC) in GPUs and its catastrophic impact on LLM pretraining stability.
AI & ML arxiv | Apr 2
Mechanistic analysis reveals that LLMs fail at character counting not because they lack the information, but because 'negative circuits' in the final layers actively suppress the correct answer.
AI & ML arxiv | Apr 2
Reveals a 'Reasoning Shift' where increased context length silently causes models to skip self-verification and shorten their reasoning traces by up to 50%.
AI & ML arxiv | Apr 2
Provides causal evidence that reasoning models often decide on an action (like a tool call) before they even start generating their 'Chain-of-Thought'.
AI & ML arxiv | Apr 2
Provides a theoretical explanation for why Transformers often fail compared to linear models in financial time series forecasting.
AI & ML arxiv | Apr 2