Scaling Insight Scaling Insight
101 papers · Page 1 of 2
Speculative Decoding Scaling Laws (SDSL) provides a theoretical framework to predict optimal throughput hyperparameters for LLM inference systems before pre-training.
AI & ML arxiv | Mar 13
Cyber-attack capabilities of AI models scale log-linearly with inference-time compute, with no plateau in sight.
AI & ML arxiv | Mar 13
Adversarial prompt injection causes jailbreak success rates to transition from polynomial to exponential scaling with inference-time samples.
AI & ML arxiv | Mar 13
Applying Rotary Positional Embeddings (RoPE) to only 10% of hidden dimensions is sufficient for full model convergence, enabling 10x memory savings in positional caches.
AI & ML arxiv | Mar 13
Provides a learning-theoretic characterization of model collapse, proving exactly when replaying past outputs destroys model diversity.
AI & ML arxiv | Mar 13
Exhaustive circuit mapping of a biological foundation model reveals massive redundancy and annotation bias.
AI & ML arxiv | Mar 13
Establishes scaling laws for sampling compute in LLM Reinforcement Learning, providing a playbook for optimal parallel rollout and batch allocation.
AI & ML arxiv | Mar 13
Discovers that as LLMs scale, their complex non-linear depth dynamics converge into accurate, low-order linear surrogates.
AI & ML arxiv | Mar 16
Longitudinal evidence reveals that successive ChatGPT versions are converging in output diversity, suggesting potential model collapse from synthetic data saturation.
AI & ML arxiv | Mar 16
Adversarial test case evolution improves code reinforcement learning by creating harder, more discriminative verification signals that drive better model performance.
AI & ML arxiv | Mar 16
Proves the existence of a 'distributional simplicity bias' in diffusion models, where low-order statistics are learned linearly while high-order correlations require cubic sample complexity.
AI & ML arxiv | Mar 16
Factual selection in LLMs is driven by rotational dynamics on a hypersphere rather than scalar magnitude shifts, with the behavior emerging suddenly at the 1.6B parameter mark.
AI & ML arxiv | Mar 17
Grokking is driven by a norm-driven representational phase transition with a predictable scaling law.
AI & ML arxiv | Mar 17
Challenges the monotonic 'bigger is better' scaling paradigm by proving that institutional fitness peaks at an environment-dependent scale.
AI & ML arxiv | Mar 17
Proposes spectral clipping to stabilize LLM training by addressing 'spectral spikes' in stochastic gradient noise that adaptive optimizers like AdamW fail to handle.
AI & ML arxiv | Mar 17
Introduces Matrix-to-Matrix RNNs (M$^2$RNN) with matrix-valued hidden states that outperform hybrid Transformers while using 3x smaller state sizes.
AI & ML arxiv | Mar 17
The Infinite Problem Generator (IPG) uses executable code to synthesize and verify 100% accurate physics reasoning data, overcoming LLM hallucination in data scaling.
AI & ML arxiv | Mar 17
Determines the optimal compute distribution for retrieval agents, showing that re-ranking depth is far more critical than query expansion strength.
AI & ML arxiv | Mar 17
Provides the first theoretical proof that dataset distillation efficiently encodes the low-dimensional structure of non-linear tasks.
AI & ML arxiv | Mar 17
Attention Residuals replace fixed-weight residual connections with softmax attention over preceding layers to prevent hidden-state dilution in deep LLMs.
AI & ML arxiv | Mar 17
This paper proves that increasing test-time compute via beam search can actually hurt LLM reasoning performance due to overestimation bias.
AI & ML arxiv | Mar 17
Sparsity (MoE and GQA) is found to act as a critical regulator for variance propagation, mitigating the 'curse of depth' in LLMs.
AI & ML arxiv | Mar 17
A factorial study on EHR foundation models reveals that joint encoding of code-attribute pairs (local binding) is the primary driver of performance and efficiency.
AI & ML arxiv | Mar 18
Spectral Edge Dynamics (SED) provides an early-warning signal for grokking, predicting generalization up to 1,700 steps before it occurs.
AI & ML arxiv | Mar 18
Demonstrates that massive scaling of diverse simulator resets can replace manual curriculum engineering for complex dexterous manipulation.
AI & ML arxiv | Mar 18
Derives closed-form power-law scaling for hyperparameters like learning rate and batch size using modern optimization theory rather than expensive empirical sweeps.
AI & ML arxiv | Mar 18
Provides a geometric 'manifold envelopment' framework to explain why unsupervised RL for mathematical reasoning often collapses and how to stabilize it.
AI & ML arxiv | Mar 18
The study provides a formal link showing that internal 'world model' representations in transformers are a direct byproduct of the predictive geometry of the training data.
AI & ML arxiv | Mar 18
Shows that 'Mid-Training' on high-quality reasoning data is the primary driver of model capability, whereas RL only succeeds as a sparse refinement step.
AI & ML arxiv | Mar 19
Video fine-tuning consistently degrades static image understanding in multimodal LLMs, revealing a zero-sum trade-off between spatial and temporal capabilities.
AI & ML arxiv | Mar 19
Mechanistic probing reveals a directional asymmetry in how LLMs encode hierarchy: hypernymy is redundant and resilient, while hyponymy is fragile and compact.
AI & ML arxiv | Mar 19
Provides the first theoretical proof that Graph Transformers structurally prevent the 'oversmoothing' failure mode inherent to deep GCNs.
AI & ML arxiv | Mar 19
Extreme neural network sparsification causes a catastrophic interpretability collapse even when global accuracy remains stable.
AI & ML arxiv | Mar 20
This paper provides theoretical proof that autocurriculum—where a model selects its own training problems—requires exponentially fewer reasoning demonstrations.
AI & ML arxiv | Mar 20
The 'Progressive Intensity Hypothesis' establishes that weaker perturbations (pruning) should precede stronger ones (quantization) for optimal joint model compression.
AI & ML arxiv | Mar 20
Mechanistic analysis of 'counting circuits' in VLMs allows for lightweight interventions that improve general visual reasoning performance.
AI & ML arxiv | Mar 20
Synthetic data scaling reaches a new level by moving from simple rephrasing to creating 'megadocs' through rationale insertion and stitching.
AI & ML arxiv | Mar 20
Discovers how uncertainty estimation signals like self-consistency and verbalized confidence scale and complement each other in reasoning models.
AI & ML arxiv | Mar 20
Establishes scaling laws to determine the optimal compute split between general pretraining and domain-specific specialization.
AI & ML arxiv | Mar 20
Discovers a multiplicative scaling law governing how LLMs revise their beliefs during iterative reasoning (CoT, reflection).
AI & ML arxiv | Mar 23
A massive controlled study reveals that post-training algorithm rankings (DPO, SimPO, etc.) completely invert as models scale.
AI & ML arxiv | Mar 23
Researchers identify a 'selection bottleneck' that mathematically determines when diverse agent teams outperform homogeneous self-consistency teams.
AI & ML arxiv | Mar 24
This work formalizes why 'human' mathematics is distinct from the space of all valid deductions using information-theoretic compression measurements on MathLib.
AI & ML arxiv | Mar 24
Discovers that language-centric training in Multimodal LLMs actively degrades their internal visual representation quality.
AI & ML arxiv | Mar 24
Identifies that in-context reasoning over pretraining knowledge only emerges after specific types of fine-tuning, not from pretraining alone.
AI & ML arxiv | Mar 24
Sensitivity to compression in Transformers spans five orders of magnitude, with early-layer MLP up-projections identified as catastrophic failure points.
AI & ML arxiv | Mar 24
Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.
AI & ML arxiv | Mar 24
Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.
AI & ML arxiv | Mar 24
Restores monotonic scaling in LLM tree search by replacing standard MCTS selection with Gumbel sampling and Sequential Halving.
AI & ML arxiv | Mar 24
Introduces the Neural Zeroth-order Kernel (NZK) to provide a theoretical foundation for training models without backpropagation.
AI & ML arxiv | Mar 24
Proves that structured retrieval is exponentially more efficient than sequential context scanning for agentic reasoning.
AI & ML arxiv | Mar 24
Discovers 'silent commitment failure,' where some model architectures produce confident, incorrect outputs with zero detectable warning signals before execution.
AI & ML arxiv | Mar 24
Provides a causal explanation for 'embedding collapse' in Transformers, linking it to the concept of semantic shift rather than just text length.
AI & ML arxiv | Mar 24
Depth-Recurrent Transformers decouple computational depth from parameter count, revealing a 'computational frontier' where performance on reasoning tasks snaps from zero to perfect based on iteration steps.
AI & ML arxiv | Mar 24
Identifies structured table data as a primary driver for scaling long-context reasoning in LLMs.
AI & ML arxiv | Mar 24
Introduces a robust framework for optimal Mixture-of-Experts (MoE) architecture design across six orders of magnitude in compute.
AI & ML arxiv | Mar 24
Provides a strictly controlled comparison of autoregressive vs. masked diffusion language models on identical compute budgets.
AI & ML arxiv | Mar 24
hidden states in LLMs occupy a Riemannian submanifold where tokens are Voronoi regions, revealing a universal 'hourglass' intrinsic dimension profile across all tested models.
AI & ML arxiv | Mar 25
The standard 'Chinchilla Approach 2' for fitting scaling laws is systematically biased, potentially leading to millions of dollars in wasted compute at frontier scales.
AI & ML arxiv | Mar 25
Reveals that RLVR-driven reasoning improvements in LLMs are the result of highly sparse changes to a tiny fraction of 'critical' token distributions.
AI & ML arxiv | Mar 25
Robotic bipedal mass scales with the square of leg length rather than the cubic scaling found in biological systems.
AI & ML arxiv | Mar 25
A quantitative model that predicts the performance gain of merging independent LLM specialists before committing compute.
AI & ML arxiv | Mar 25
Identifies the 'Caterpillar Tree' as the theoretically optimal structure for test-time computation and backtracking in LLMs.
AI & ML arxiv | Mar 25
Persistent structural memory in neural networks is fundamentally limited by the instability of jointly-learned coordinate systems.
AI & ML arxiv | Mar 25
Theoretical analysis reveals that the efficiency benefits of low-dimensional data structures for diffusion models diminish significantly when the data manifold is non-linear.
AI & ML arxiv | Mar 25
Access to conversational memory allows an 8B model to outperform a 235B model on user-specific queries while reducing inference costs by 96%.
AI & ML arxiv | Mar 25
Synthetic Mixed Training allows an 8B model to finally outperform RAG on long-document comprehension by combining synthetic QAs with rewritten documents.
AI & ML arxiv | Mar 26
Newer LLM architectures like MoE and SSMs are making 'early-exit' decoding significantly less effective than in previous generations.
AI & ML arxiv | Mar 26
Diffusion models can be proven to generalize by capturing manifold geometry long before they achieve density estimation or memorization.
AI & ML arxiv | Mar 26
Provides a systematic blueprint for scaling Reinforcement Learning (RL) in LLMs using multi-turn synthetic data generation and difficulty-based curricula.
AI & ML arxiv | Mar 26
Identifies a 'critical threshold' in human-AI symbiosis beyond which human capability collapses abruptly and irreversibly due to over-delegation.
AI & ML arxiv | Mar 26
Reveals that synthetic rewriting is a quality multiplier for high-grade data, but fails to fix low-quality source data.
AI & ML arxiv | Mar 27
A systematic study reveals that grokking is not an architectural property of Transformers but an interaction between weight decay and optimization stability.
AI & ML arxiv | Mar 27
MSRL scales multimodal reward modeling by transferring reasoning capabilities from text to vision-language tasks without requiring new multimodal preference data.
AI & ML arxiv | Mar 27
Uses the Minimum Description Length principle to predict exactly when neural networks will transition from simple 'spurious' shortcuts to complex features.
AI & ML arxiv | Mar 30
A billion-scale time-series benchmark that identifies a 'context-length crossover' where foundation models start to crush deep learning baselines.
AI & ML arxiv | Mar 30
Challenges the assumption that 'background' pixels are useless in GUI agents and identifies a 'recency effect' for optimal token pruning.
AI & ML arxiv | Mar 30
An 800 Hz data glove reveals that human hand dexterity contains critical high-frequency motion energy (>100 Hz) previously invisible to standard sensors.
AI & ML arxiv | Mar 30
Provides the first sharp theoretical characterization of why spectral optimizers like Muon drastically outperform SGD in storage capacity and scaling for language models.
AI & ML arxiv | Mar 30
Proves that causal representation learning is possible with far fewer environments and unknown intervention targets than previously assumed.
AI & ML arxiv | Mar 30
Scales multi-agent path finding to 1000 agents with near-linear runtime by decoupling geometric planning from execution-time conflict resolution.
AI & ML arxiv | Mar 31
Synthetic multi-view generation breaks the performance ceiling of single-view robotic datasets.
AI & ML arxiv | Mar 31
Formalizes the 'Observability Gap' to explain why coding agents plateau: humans can only provide feedback on visible outputs, while bugs reside in invisible execution states.
AI & ML arxiv | Mar 31
Provides a high-dimensional theoretical foundation for why two-phase optimizers like DiLoCo are mathematically superior to standard SGD in specific noise regimes.
AI & ML arxiv | Mar 31
Shows that standard task-completion benchmarks fail to distinguish agent capabilities and proposes 'Working Memory Fidelity' as a more predictive metric.
AI & ML arxiv | Mar 31
Mathematical proof that LayerNorm structurally reduces model complexity compared to RMSNorm due to its mean-centering geometry.
AI & ML arxiv | Mar 31
Provides empirical evidence and a mechanistic explanation for why LoRA drastically reduces catastrophic forgetting in sequential fine-tuning compared to full fine-tuning.
AI & ML arxiv | Mar 31
A controlled study proving that the temporal organization (curriculum) of multimodal data is a first-order variable in balancing reasoning vs. OCR capabilities.
AI & ML arxiv | Mar 31
The eigenvalue tail index of a neural network's weight matrices serves as a near-perfect (R^2 = 0.984) diagnostic for label noise in the training data.
AI & ML arxiv | Mar 31
Discovers that LLM hidden states undergo geometric 'warping' at digit-count boundaries, mimicking human psychological perception.
AI & ML arxiv | Mar 31
This paper establishes the formal information-theoretic limits and conditions under which self-improving AI systems can be safely verified.
AI & ML arxiv | Mar 31
HyperP provides the first hyperparameter transfer laws for hypersphere optimization, ensuring stable scaling for models using the Muon optimizer.
AI & ML arxiv | Mar 31
Identifies a 'dual-capability bottleneck' where low-rated training data is essential for state tracking while high-rated data is needed for decision quality.
AI & ML arxiv | Apr 1
Provides a computationally efficient 'early warning' system for emergent capabilities like grokking and induction head formation using 2-datapoint reduced density matrices.
AI & ML arxiv | Apr 1
Identifies 'label leakage' from limited task diversity as the primary bottleneck for relational foundation models, rather than raw data volume.
AI & ML arxiv | Apr 1
Discovers that video diffusion models commit to high-level plans in the first few denoising steps, enabling a new inference-time scaling technique called ChEaP.
AI & ML arxiv | Apr 1
Neural collapse is triggered by a predictable 'feature-norm threshold' (fn*) that is invariant to training conditions, serving as a new diagnostic for training progress.
AI & ML arxiv | Apr 2
Gradient-based data valuation (TracIn) outperforms all human-crafted metadata heuristics for ordering curriculum learning in motion planners.
AI & ML arxiv | Apr 2
Demonstrates that LLM judge panels follow power-law discovery curves, where panel size and persona diversity are critical for uncovering edge-case failures.
AI & ML arxiv | Apr 2
Establishes a three-dimensional scaling law for RAG-pretraining, modeling the optimal data budget allocation between model parameters, tokens, and retrieval store size.
AI & ML arxiv | Apr 2