Scaling Insight

Scaling Insight

101 papers · Page 1 of 2

Speculative Decoding Scaling Laws (SDSL) provides a theoretical framework to predict optimal throughput hyperparameters for LLM inference systems before pre-training.

AI & ML arxiv | Mar 13

Cyber-attack capabilities of AI models scale log-linearly with inference-time compute, with no plateau in sight.

AI & ML arxiv | Mar 13

Adversarial prompt injection causes jailbreak success rates to transition from polynomial to exponential scaling with inference-time samples.

AI & ML arxiv | Mar 13

Applying Rotary Positional Embeddings (RoPE) to only 10% of hidden dimensions is sufficient for full model convergence, enabling 10x memory savings in positional caches.

AI & ML arxiv | Mar 13

Provides a learning-theoretic characterization of model collapse, proving exactly when replaying past outputs destroys model diversity.

AI & ML arxiv | Mar 13

Exhaustive circuit mapping of a biological foundation model reveals massive redundancy and annotation bias.

AI & ML arxiv | Mar 13

Establishes scaling laws for sampling compute in LLM Reinforcement Learning, providing a playbook for optimal parallel rollout and batch allocation.

AI & ML arxiv | Mar 13

Discovers that as LLMs scale, their complex non-linear depth dynamics converge into accurate, low-order linear surrogates.

AI & ML arxiv | Mar 16

Longitudinal evidence reveals that successive ChatGPT versions are converging in output diversity, suggesting potential model collapse from synthetic data saturation.

AI & ML arxiv | Mar 16

Adversarial test case evolution improves code reinforcement learning by creating harder, more discriminative verification signals that drive better model performance.

AI & ML arxiv | Mar 16

Proves the existence of a 'distributional simplicity bias' in diffusion models, where low-order statistics are learned linearly while high-order correlations require cubic sample complexity.

AI & ML arxiv | Mar 16

Factual selection in LLMs is driven by rotational dynamics on a hypersphere rather than scalar magnitude shifts, with the behavior emerging suddenly at the 1.6B parameter mark.

AI & ML arxiv | Mar 17

Grokking is driven by a norm-driven representational phase transition with a predictable scaling law.

AI & ML arxiv | Mar 17

Challenges the monotonic 'bigger is better' scaling paradigm by proving that institutional fitness peaks at an environment-dependent scale.

AI & ML arxiv | Mar 17

Proposes spectral clipping to stabilize LLM training by addressing 'spectral spikes' in stochastic gradient noise that adaptive optimizers like AdamW fail to handle.

AI & ML arxiv | Mar 17

Introduces Matrix-to-Matrix RNNs (M$^2$RNN) with matrix-valued hidden states that outperform hybrid Transformers while using 3x smaller state sizes.

AI & ML arxiv | Mar 17

The Infinite Problem Generator (IPG) uses executable code to synthesize and verify 100% accurate physics reasoning data, overcoming LLM hallucination in data scaling.

AI & ML arxiv | Mar 17

Determines the optimal compute distribution for retrieval agents, showing that re-ranking depth is far more critical than query expansion strength.

AI & ML arxiv | Mar 17

Provides the first theoretical proof that dataset distillation efficiently encodes the low-dimensional structure of non-linear tasks.

AI & ML arxiv | Mar 17

Attention Residuals replace fixed-weight residual connections with softmax attention over preceding layers to prevent hidden-state dilution in deep LLMs.

AI & ML arxiv | Mar 17

This paper proves that increasing test-time compute via beam search can actually hurt LLM reasoning performance due to overestimation bias.

AI & ML arxiv | Mar 17

Sparsity (MoE and GQA) is found to act as a critical regulator for variance propagation, mitigating the 'curse of depth' in LLMs.

AI & ML arxiv | Mar 17

A factorial study on EHR foundation models reveals that joint encoding of code-attribute pairs (local binding) is the primary driver of performance and efficiency.

AI & ML arxiv | Mar 18

Spectral Edge Dynamics (SED) provides an early-warning signal for grokking, predicting generalization up to 1,700 steps before it occurs.

AI & ML arxiv | Mar 18

Demonstrates that massive scaling of diverse simulator resets can replace manual curriculum engineering for complex dexterous manipulation.

AI & ML arxiv | Mar 18

Derives closed-form power-law scaling for hyperparameters like learning rate and batch size using modern optimization theory rather than expensive empirical sweeps.

AI & ML arxiv | Mar 18

Provides a geometric 'manifold envelopment' framework to explain why unsupervised RL for mathematical reasoning often collapses and how to stabilize it.

AI & ML arxiv | Mar 18

The study provides a formal link showing that internal 'world model' representations in transformers are a direct byproduct of the predictive geometry of the training data.

AI & ML arxiv | Mar 18

Shows that 'Mid-Training' on high-quality reasoning data is the primary driver of model capability, whereas RL only succeeds as a sparse refinement step.

AI & ML arxiv | Mar 19

Video fine-tuning consistently degrades static image understanding in multimodal LLMs, revealing a zero-sum trade-off between spatial and temporal capabilities.

AI & ML arxiv | Mar 19

Mechanistic probing reveals a directional asymmetry in how LLMs encode hierarchy: hypernymy is redundant and resilient, while hyponymy is fragile and compact.

AI & ML arxiv | Mar 19

Provides the first theoretical proof that Graph Transformers structurally prevent the 'oversmoothing' failure mode inherent to deep GCNs.

AI & ML arxiv | Mar 19

Extreme neural network sparsification causes a catastrophic interpretability collapse even when global accuracy remains stable.

AI & ML arxiv | Mar 20

This paper provides theoretical proof that autocurriculum—where a model selects its own training problems—requires exponentially fewer reasoning demonstrations.

AI & ML arxiv | Mar 20

The 'Progressive Intensity Hypothesis' establishes that weaker perturbations (pruning) should precede stronger ones (quantization) for optimal joint model compression.

AI & ML arxiv | Mar 20

Mechanistic analysis of 'counting circuits' in VLMs allows for lightweight interventions that improve general visual reasoning performance.

AI & ML arxiv | Mar 20

Synthetic data scaling reaches a new level by moving from simple rephrasing to creating 'megadocs' through rationale insertion and stitching.

AI & ML arxiv | Mar 20

Discovers how uncertainty estimation signals like self-consistency and verbalized confidence scale and complement each other in reasoning models.

AI & ML arxiv | Mar 20

Establishes scaling laws to determine the optimal compute split between general pretraining and domain-specific specialization.

AI & ML arxiv | Mar 20

Discovers a multiplicative scaling law governing how LLMs revise their beliefs during iterative reasoning (CoT, reflection).

AI & ML arxiv | Mar 23

A massive controlled study reveals that post-training algorithm rankings (DPO, SimPO, etc.) completely invert as models scale.

AI & ML arxiv | Mar 23

Researchers identify a 'selection bottleneck' that mathematically determines when diverse agent teams outperform homogeneous self-consistency teams.

AI & ML arxiv | Mar 24

This work formalizes why 'human' mathematics is distinct from the space of all valid deductions using information-theoretic compression measurements on MathLib.

AI & ML arxiv | Mar 24

Discovers that language-centric training in Multimodal LLMs actively degrades their internal visual representation quality.

AI & ML arxiv | Mar 24

Identifies that in-context reasoning over pretraining knowledge only emerges after specific types of fine-tuning, not from pretraining alone.

AI & ML arxiv | Mar 24

Sensitivity to compression in Transformers spans five orders of magnitude, with early-layer MLP up-projections identified as catastrophic failure points.

AI & ML arxiv | Mar 24

Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.

AI & ML arxiv | Mar 24

Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.

AI & ML arxiv | Mar 24

Restores monotonic scaling in LLM tree search by replacing standard MCTS selection with Gumbel sampling and Sequential Halving.

AI & ML arxiv | Mar 24

Introduces the Neural Zeroth-order Kernel (NZK) to provide a theoretical foundation for training models without backpropagation.

AI & ML arxiv | Mar 24

Proves that structured retrieval is exponentially more efficient than sequential context scanning for agentic reasoning.

AI & ML arxiv | Mar 24

Discovers 'silent commitment failure,' where some model architectures produce confident, incorrect outputs with zero detectable warning signals before execution.

AI & ML arxiv | Mar 24

Provides a causal explanation for 'embedding collapse' in Transformers, linking it to the concept of semantic shift rather than just text length.

AI & ML arxiv | Mar 24

Depth-Recurrent Transformers decouple computational depth from parameter count, revealing a 'computational frontier' where performance on reasoning tasks snaps from zero to perfect based on iteration steps.

AI & ML arxiv | Mar 24

Identifies structured table data as a primary driver for scaling long-context reasoning in LLMs.

AI & ML arxiv | Mar 24

Introduces a robust framework for optimal Mixture-of-Experts (MoE) architecture design across six orders of magnitude in compute.

AI & ML arxiv | Mar 24

Provides a strictly controlled comparison of autoregressive vs. masked diffusion language models on identical compute budgets.

AI & ML arxiv | Mar 24

hidden states in LLMs occupy a Riemannian submanifold where tokens are Voronoi regions, revealing a universal 'hourglass' intrinsic dimension profile across all tested models.

AI & ML arxiv | Mar 25

The standard 'Chinchilla Approach 2' for fitting scaling laws is systematically biased, potentially leading to millions of dollars in wasted compute at frontier scales.

AI & ML arxiv | Mar 25

Reveals that RLVR-driven reasoning improvements in LLMs are the result of highly sparse changes to a tiny fraction of 'critical' token distributions.

AI & ML arxiv | Mar 25

Robotic bipedal mass scales with the square of leg length rather than the cubic scaling found in biological systems.

AI & ML arxiv | Mar 25

A quantitative model that predicts the performance gain of merging independent LLM specialists before committing compute.

AI & ML arxiv | Mar 25

Identifies the 'Caterpillar Tree' as the theoretically optimal structure for test-time computation and backtracking in LLMs.

AI & ML arxiv | Mar 25

Persistent structural memory in neural networks is fundamentally limited by the instability of jointly-learned coordinate systems.

AI & ML arxiv | Mar 25

Theoretical analysis reveals that the efficiency benefits of low-dimensional data structures for diffusion models diminish significantly when the data manifold is non-linear.

AI & ML arxiv | Mar 25

Access to conversational memory allows an 8B model to outperform a 235B model on user-specific queries while reducing inference costs by 96%.

AI & ML arxiv | Mar 25

Synthetic Mixed Training allows an 8B model to finally outperform RAG on long-document comprehension by combining synthetic QAs with rewritten documents.

AI & ML arxiv | Mar 26

Newer LLM architectures like MoE and SSMs are making 'early-exit' decoding significantly less effective than in previous generations.

AI & ML arxiv | Mar 26

Diffusion models can be proven to generalize by capturing manifold geometry long before they achieve density estimation or memorization.

AI & ML arxiv | Mar 26

Provides a systematic blueprint for scaling Reinforcement Learning (RL) in LLMs using multi-turn synthetic data generation and difficulty-based curricula.

AI & ML arxiv | Mar 26

Identifies a 'critical threshold' in human-AI symbiosis beyond which human capability collapses abruptly and irreversibly due to over-delegation.

AI & ML arxiv | Mar 26

Reveals that synthetic rewriting is a quality multiplier for high-grade data, but fails to fix low-quality source data.

AI & ML arxiv | Mar 27

A systematic study reveals that grokking is not an architectural property of Transformers but an interaction between weight decay and optimization stability.

AI & ML arxiv | Mar 27

MSRL scales multimodal reward modeling by transferring reasoning capabilities from text to vision-language tasks without requiring new multimodal preference data.

AI & ML arxiv | Mar 27

Uses the Minimum Description Length principle to predict exactly when neural networks will transition from simple 'spurious' shortcuts to complex features.

AI & ML arxiv | Mar 30

A billion-scale time-series benchmark that identifies a 'context-length crossover' where foundation models start to crush deep learning baselines.

AI & ML arxiv | Mar 30

Challenges the assumption that 'background' pixels are useless in GUI agents and identifies a 'recency effect' for optimal token pruning.

AI & ML arxiv | Mar 30

An 800 Hz data glove reveals that human hand dexterity contains critical high-frequency motion energy (>100 Hz) previously invisible to standard sensors.

AI & ML arxiv | Mar 30

Provides the first sharp theoretical characterization of why spectral optimizers like Muon drastically outperform SGD in storage capacity and scaling for language models.

AI & ML arxiv | Mar 30

Proves that causal representation learning is possible with far fewer environments and unknown intervention targets than previously assumed.

AI & ML arxiv | Mar 30

Scales multi-agent path finding to 1000 agents with near-linear runtime by decoupling geometric planning from execution-time conflict resolution.

AI & ML arxiv | Mar 31

Synthetic multi-view generation breaks the performance ceiling of single-view robotic datasets.

AI & ML arxiv | Mar 31

Formalizes the 'Observability Gap' to explain why coding agents plateau: humans can only provide feedback on visible outputs, while bugs reside in invisible execution states.

AI & ML arxiv | Mar 31

Provides a high-dimensional theoretical foundation for why two-phase optimizers like DiLoCo are mathematically superior to standard SGD in specific noise regimes.

AI & ML arxiv | Mar 31

Shows that standard task-completion benchmarks fail to distinguish agent capabilities and proposes 'Working Memory Fidelity' as a more predictive metric.

AI & ML arxiv | Mar 31

Mathematical proof that LayerNorm structurally reduces model complexity compared to RMSNorm due to its mean-centering geometry.

AI & ML arxiv | Mar 31

Provides empirical evidence and a mechanistic explanation for why LoRA drastically reduces catastrophic forgetting in sequential fine-tuning compared to full fine-tuning.

AI & ML arxiv | Mar 31

A controlled study proving that the temporal organization (curriculum) of multimodal data is a first-order variable in balancing reasoning vs. OCR capabilities.

AI & ML arxiv | Mar 31

The eigenvalue tail index of a neural network's weight matrices serves as a near-perfect (R^2 = 0.984) diagnostic for label noise in the training data.

AI & ML arxiv | Mar 31

Discovers that LLM hidden states undergo geometric 'warping' at digit-count boundaries, mimicking human psychological perception.

AI & ML arxiv | Mar 31

This paper establishes the formal information-theoretic limits and conditions under which self-improving AI systems can be safely verified.

AI & ML arxiv | Mar 31

HyperP provides the first hyperparameter transfer laws for hypersphere optimization, ensuring stable scaling for models using the Muon optimizer.

AI & ML arxiv | Mar 31

Identifies a 'dual-capability bottleneck' where low-rated training data is essential for state tracking while high-rated data is needed for decision quality.

AI & ML arxiv | Apr 1

Provides a computationally efficient 'early warning' system for emergent capabilities like grokking and induction head formation using 2-datapoint reduced density matrices.

AI & ML arxiv | Apr 1

Identifies 'label leakage' from limited task diversity as the primary bottleneck for relational foundation models, rather than raw data volume.

AI & ML arxiv | Apr 1

Discovers that video diffusion models commit to high-level plans in the first few denoising steps, enabling a new inference-time scaling technique called ChEaP.

AI & ML arxiv | Apr 1

Neural collapse is triggered by a predictable 'feature-norm threshold' (fn*) that is invariant to training conditions, serving as a new diagnostic for training progress.

AI & ML arxiv | Apr 2

Gradient-based data valuation (TracIn) outperforms all human-crafted metadata heuristics for ordering curriculum learning in motion planners.

AI & ML arxiv | Apr 2

Demonstrates that LLM judge panels follow power-law discovery curves, where panel size and persona diversity are critical for uncovering edge-case failures.

AI & ML arxiv | Apr 2

Establishes a three-dimensional scaling law for RAG-pretraining, modeling the optimal data budget allocation between model parameters, tokens, and retrieval store size.

AI & ML arxiv | Apr 2