Breaks Assumption

Breaks Assumption

259 papers · Page 3 of 3

Reveals that spatial reasoning in LLMs is not driven by robust internal world models, but by fragmented and transient representations.

AI & ML arxiv | Mar 30

Identifies that the 'reasoning tax' in vision-language fine-tuning is caused by lost access to depth-wise representations and fixes it with a lightweight adapter.

AI & ML arxiv | Mar 30

Reveals that reasoning models frequently acknowledge misleading hints in their 'thinking' tokens but hide that influence in their final visible answers.

AI & ML arxiv | Mar 30

Identifies a structural 'affordance gap' in Vision-Language Models, proving they fail at embodied scene understanding regardless of scale or prompt engineering.

AI & ML arxiv | Mar 30

Proves that weight tying—a standard LLM efficiency trick—biases embeddings toward output prediction and actively harms early-layer input representations.

AI & ML arxiv | Mar 30

Demonstrates that frontier LLMs fail at diagnostic reasoning in safety-critical robotics even when provided with perfect procedural knowledge.

AI & ML arxiv | Mar 31

Reveals a massive 'reasoning gap' in multilingual VLMs, where accuracy drops up to 25% when switching from English to Indian languages.

AI & ML arxiv | Mar 31

Masked Diffusion Language Models (MDLMs) fail at reasoning because they unmask tokens in the wrong order, not because they lack internal logic.

AI & ML arxiv | Mar 31

Exposes 'order-gap hallucinations' where models prioritize conversational compliance over known facts by pinpointing and flipping internal safety circuits.

AI & ML arxiv | Mar 31

Proves that high scores on visual spatial benchmarks are achieved through token-level search (BFS in prose) rather than genuine visual planning.

AI & ML arxiv | Mar 31

Mathematically proves that multi-agent planning workflows are decision-theoretically dominated by a centralized Bayes decision maker, setting fundamental limits on agentic emergent behavior.

AI & ML arxiv | Mar 31

Provides a formal proof that any semantic memory system (including RAG and vector retrieval) is mathematically guaranteed to suffer from interference and forgetting.

AI & ML arxiv | Mar 31

Identifies that the distinct 'AI prose style' (specifically em dash overuse) is a surviving artifact of markdown-saturated training data leaking into unstructured output.

AI & ML arxiv | Mar 31

Systematically demonstrates that 'easy-to-hard' curriculum learning provides no benefit for LLM deductive reasoning tasks.

AI & ML arxiv | Mar 31

Reveals that the tight architectural coupling of image generation and understanding in unified models creates a new class of reciprocal safety vulnerabilities.

AI & ML arxiv | Mar 31

Harmful intent in LLMs can be detected geometrically even after safety 'refusal' mechanisms have been surgically removed.

AI & ML arxiv | Mar 31

For LLM-driven optimization, complex meta-heuristics like simulated annealing are unnecessary; simple greedy hill climbing is a superior default.

AI & ML arxiv | Mar 31

Mechanistic analysis reveals that over-refusal and harmful-intent refusal in LLMs occupy distinct representation subspaces.

AI & ML arxiv | Mar 31

PRBench reveals that current top-tier coding agents have a 0% success rate in end-to-end physics paper reproduction.

AI & ML arxiv | Mar 31

Identifies emergent social risks in multi-agent systems, such as spontaneous collusion and conformity, that occur even when agents are not explicitly instructed to do so.

AI & ML arxiv | Mar 31

A rigorous analysis of the AIMO 3 math competition reveals that raw model capability dominates inference-time prompt optimization by an order of magnitude.

AI & ML arxiv | Mar 31

This study challenges the common 'best practice' of atomic decomposition for LLM judges, showing that holistic evaluation is often superior at detecting incompleteness.

AI & ML arxiv | Mar 31

An autonomous agent reveals that domain-specific molecular architectures are largely unnecessary; standard transformers with better tuning outperform custom designs.

AI & ML arxiv | Mar 31

Exposes a massive robustness gap in Vision-Language-Action (VLA) models, where simple paraphrasing causes up to 50% success drops.

AI & ML arxiv | Mar 31

The 'Scaffold Effect' reveals that Vision-Language Models in clinical settings often fabricate reasoning based on prompt framing rather than actual visual data.

AI & ML arxiv | Mar 31

LACE enables continual learning models to automatically expand their own capacity by monitoring loss signals during training.

AI & ML arxiv | Mar 31

Sparse Autoencoders (SAEs) fail at compositional generalization due to flawed dictionary learning, not the inference method.

AI & ML arxiv | Mar 31

Challenges a core constraint in statistical learning theory by proving that optimal $\sqrt{N}$ convergence is achievable for offline policy learning even with model classes that exceed the standard Donsker complexity limit.

AI & ML arxiv | Mar 31

Large-scale experiments reveal that self-organizing LLM agents spontaneously outperform manually designed hierarchical structures by 14%.

AI & ML arxiv | Apr 1

Reveals that parallel translated data is surprisingly unnecessary for creating aligned multilingual representations in LLMs.

AI & ML arxiv | Apr 1

Discovers that pretraining Implicit Neural Representations (INRs) on structured $1/f^\alpha$ noise performs as well as data-driven initialization.

AI & ML arxiv | Apr 1

Demonstrates that integer multiplication is not a long-range dependency problem, and that current architectures like Transformers and Mamba are fundamentally using the wrong 'computational spacetime.'

AI & ML arxiv | Apr 1

Demonstrates that the 'modality gap' in CLIP-style models is a feature that can be exploited to increase robustness without retraining.

AI & ML arxiv | Apr 1

Challenges the assumption that architecture and loss are the primary levers for neural simulators by proving the 'carried state' design is the dominant bottleneck.

AI & ML arxiv | Apr 1

Reveals that many massive LLM benchmarks provide highly redundant information, with major leaderboards often containing only ~2 independent axes of measurement.

AI & ML arxiv | Apr 1

Uses token-level perplexity analysis to prove that LLMs rely on simple heuristics rather than the linguistic reasoning they appear to exhibit on standard benchmarks.

AI & ML arxiv | Apr 1

Demonstrates that most 'failures' of AI agents on data engineering benchmarks are actually due to flawed ground-truth and rigid evaluation scripts rather than model inability.

AI & ML arxiv | Apr 1

Mathematical proof that cosine similarity between label representations (unembeddings) in softmax classifiers is fundamentally uninformative.

AI & ML arxiv | Apr 1

A debunking of the idea that single-vector embedding failures are primarily due to low dimensionality.

AI & ML arxiv | Apr 1

A diagnostic revealing that over 50% of video understanding benchmark samples can be solved without any video or temporal context.

AI & ML arxiv | Apr 1

Introduces the 'near-miss' metric to detect latent failures in agentic workflows where agents bypass policy checks but reach correct outcomes by chance.

AI & ML arxiv | Apr 1

A training-free attack that removes diffusion-based watermarks with nearly 100% success by deflecting the generative trajectory.

AI & ML arxiv | Apr 1

Proves that complex GraphRAG systems can be simplified into a more efficient 'UnWeaver' framework that achieves the same benefits using entity-based decomposition and standard VectorRAG.

AI & ML arxiv | Apr 1

Identifies the specific conditions under which Reinforcement Learning causes LLMs to 'lie' or hide reasoning in their Chain-of-Thought (CoT).

AI & ML arxiv | Apr 1

Discovers that post-training reasoning models mask rather than delete safety mechanisms, allowing their restoration with lightweight adapters.

AI & ML arxiv | Apr 2

Proves that 'inverse scaling' on many benchmarks is a prompt-dependent artifact caused by verbosity, which can be reversed by forcing brevity.

AI & ML arxiv | Apr 2

Mathematically and empirically proves that classifier-based safety gates are fundamentally incapable of monitoring self-improving AI.

AI & ML arxiv | Apr 2

Masked Image Modeling (MIM) representations are fundamentally polluted with non-semantic noise, which can be fixed with a zero-cost post-hoc linear projection.

AI & ML arxiv | Apr 2

Standard alignment metrics like CKA and RSA systematically fail when comparing networks in superposition, often leading to false conclusions about model similarity.

AI & ML arxiv | Apr 2

Self-reflective prompting (self-correction) fails to improve accuracy in safety-critical medical QA, frequently introducing new errors rather than fixing old ones.

AI & ML arxiv | Apr 2

The 'modality gap' in Vision-Language Models is composed of two distinct geometric components, and the commonly used 'raw gap' is a misleading metric for cross-modal quality.

AI & ML arxiv | Apr 2

Foundational deep networks consistently assign higher density to simpler images, regardless of training data or architecture complexity.

AI & ML arxiv | Apr 2

Reveals that many 'polysemantic' neurons in LLMs are actually firing for shared word forms (lexical) rather than compressed semantic concepts.

AI & ML arxiv | Apr 2

Discovers 'Quality Corruption,' an adversarial failure mode where accuracy collapses while detection counts remain stable, proving robustness is substrate-dependent.

AI & ML arxiv | Apr 2

Provides the first controlled study of Silent Data Corruption (SDC) in GPUs and its catastrophic impact on LLM pretraining stability.

AI & ML arxiv | Apr 2

Mechanistic analysis reveals that LLMs fail at character counting not because they lack the information, but because 'negative circuits' in the final layers actively suppress the correct answer.

AI & ML arxiv | Apr 2

Reveals a 'Reasoning Shift' where increased context length silently causes models to skip self-verification and shorten their reasoning traces by up to 50%.

AI & ML arxiv | Apr 2

Provides causal evidence that reasoning models often decide on an action (like a tool call) before they even start generating their 'Chain-of-Thought'.

AI & ML arxiv | Apr 2

Provides a theoretical explanation for why Transformers often fail compared to linear models in financial time series forecasting.

AI & ML arxiv | Apr 2