Efficiency Breakthrough

375 papers · Page 1 of 8

Efficiency Breakthrough / Category lead

Recovers short-text performance in context-extended LLMs using 60x less data than current state-of-the-art distillation methods.

Context extension typically degrades short-context performance. This paper shows how to restore those capabilities by aligning attention distributions using a linear-memory kernel, requiring only 4M tokens compared to the standard 256M.

By SeriesFusion Editorial Board · April 2, 2026

Filter by desk: AI Computing Robotics Math Quantum Physics Space Earth Chemistry Engineering Ecology Biology Neuroscience Health Psychology Economics Society

Introduces entropy-guided adaptive decoding that gives small models reasoning performance comparable to frontier models at a fraction of the cost.

Proposes a 'no-backprop' stochastic process memory for edge agents that solves the retention-forgetting tradeoff with fixed compute.

MAC-Attention achieves 14x attention-phase speedups and reduces KV cache accesses by 99% for long-context LLMs by reusing computation from semantically similar queries.

A modified 110M parameter ColBERT model can identify fine-grained evidence spans as accurately as a 27B parameter LLM, but at a fraction of the cost.

A lightweight framework for triaging agentic trajectories post-deployment without the cost of human review or auxiliary LLM calls.

A cross-graph tuning-free prompting framework for GNNs that achieves massive gains on unseen graphs without retraining.

Self-Routing removes the need for learned routers in Mixture-of-Experts (MoE) by using hidden states directly for expert assignment.

Improves Qwen2.5-7B performance on AIME2024 by 137% through test-time iterative rethinking and majority-voted pseudo-labels.

Automates mathematical optimization modeling using reinforcement learning with solver-derived rewards instead of human process supervision.

Optimizes LLM inference scheduling by treating output length as a heavy-tailed distribution rather than a point estimate.

Introduces negative early exit and adaptive boosting to make Monte Carlo Tree Search (MCTS) practical for real-time LLM inference.

Achieves a major breakthrough in dataset distillation, reaching 60% accuracy on ImageNet-1K using only a handful of synthetic images.

Enables 'Elastic Inference' where a single trained model can be converted to multiple lower-precision formats on-the-fly without retraining.

Scales imitation learning data efficiency by generating synthetic 'multi-view' demonstrations from a single expert trajectory.

Proposes Physical Imitation Learning (PIL) to offload up to 87% of a control policy's mechanical power to passive robotic joints.

CircuitProbe identifies reasoning circuits in Transformers 1000x faster than brute-force methods and predicts the efficacy of layer duplication.

Spectral Compact Training (SCT) enables training 70B-parameter architectures on consumer hardware like the Steam Deck (8GB RAM) via permanent SVD factors.

This paper achieves O(1) complexity for multimillion-class classification by leveraging predefined vector systems in the latent space.

Molecular Memory allows MoE systems to recover previously learned domain expertise 9-11x faster by utilizing cost-penalized fitness metrics that preserve dormant experts.

OBD-LLM uses second-order Hessian information to achieve 20-40% better low-rank decomposition accuracy than the current state-of-the-art SVD-LLM.

PixelPrune identifies and removes pixel-level redundancy before the Vision Transformer encoder, delivering up to 4.2x inference speedup for high-resolution VLM tasks.

EmbedPart achieves a 100x speedup over Metis for graph partitioning by clustering node embeddings rather than operating on raw graph structures.

A lightweight probing method predicts LLM downstream task performance from internal representations during training, reducing evaluation latency from one hour to three minutes.

Canonical Correlation Analysis (CCA) can reduce image representation dimensionality by 75% while actually improving downstream performance through cross-model agreement.

Decouples weather forecasting from spatial resolution by using Flow Matching to super-resolve coarse trajectories as a post-processing step.

Introduces S0 tuning for hybrid RNN-attention models, outperforming LoRA by 10.8% with zero inference overhead.

Reduces the compute cost of LLM test-time scaling by up to 67% using conformal prediction to calibrate reasoning paths.

Combines the YOCO architecture with recursive computation to scale representational depth without inflating the KV cache.

Solves the long-standing trade-off in low-rank matrix recovery by achieving both optimal sample complexity and fast convergence.

Enables Gaussian Processes to scale on modern parallel hardware by removing the need for Cholesky decompositions.

Decouples data mixture ratio selection from continual pre-training by optimizing distribution vectors post-hoc with 15-35x lower compute cost.

Combines differentiable optimization with exact ILP solvers to achieve a 10x performance gain in solving NP-hard combinatorial scheduling problems.

A fabricated 16nm SoC that performs real-time 3D occupancy mapping under 6 mW, reducing query energy by over 80%.

Generates complete, simulatable analog circuits in milliseconds, outperforming search-based methods by over 600x.

Introduces PolarQuant, a quantization method that uses Hadamard rotation to make LLM weights near-lossless at 5-bit without calibration data.

Scales curvature-aware bilevel optimization to BERT-sized models using KFAC, significantly outperforming standard gradient unrolling.

Enables infinite-length video understanding on a single consumer GPU (RTX 3090) through a training-free visual memory mechanism.

Obtain epistemic and aleatoric uncertainty from a single forward-backward pass of an unmodified pretrained LLM.

A vector-wise sparse attention mechanism that accelerates long-context video inference by 2.6x with zero loss in accuracy.

A unified quantization and runtime framework for deploying multiple LoRA-adapted generative models on edge devices simultaneously.

A 1D continuous image tokenizer that uses semantic masking to achieve a 64x reduction in token usage without sacrificing generation fidelity.

A compiler approach to agent logs that reduces token consumption by 50-66% while improving context learning performance.

A stabilization mechanism for adapting LLMs to time-series tasks that reduces memory footprint by up to 1,776x.

Applies Shapley values from cooperative game theory to solve the 'free-rider' problem in GRPO-based reinforcement learning post-training.

Produces high-fidelity SHAP explanations for tabular data 1000x faster than traditional methods by integrating them directly into the model architecture.

Proposes a unified tensor-factorization view of attention that encompasses MHA, GQA, and MLA while reducing parameter counts by an order of magnitude.

Achieves competitive continual learning accuracy with a 90% reduction in memory cost.

Batch-level query routing for LLMs allows for strict cost and capacity control that per-query methods cannot achieve.

Achieves high-fidelity LiDAR densification in just 156ms while strictly enforcing sensor physics to prevent 'ghost points'.

Demonstrates that Liquid Neural Networks can outperform Diffusion Policies in imitation learning with half the parameters and nearly 2x faster inference.