AI & ML

1625 papers · Page 9 of 17

BenchBench shifts the focus from model performance to model 'designer' capability by benchmarking automated benchmark generation.

New Capability arxiv | Mar 24

An open-source family of language models for Kazakh that outperforms much larger multilingual models by using a language-specific tokenizer.

Open Release arxiv | Mar 24

Proposes 'semantic sections' as a replacement for global feature vectors to interpret LLMs in complex, non-linear representation spaces.

Paradigm Shift arxiv | Mar 24

A routing framework that uses internal prefill activations to select the optimal LLM for a task, capturing 45% of the oracle accuracy gap with 74% cost savings.

Efficiency Breakthrough arxiv | Mar 24

Introduces Bayesian scattering as a mathematically grounded, non-learned baseline for image uncertainty quantification.

Paradigm Shift arxiv | Mar 24

Demonstrates that direct supervised alignment outperforms self-supervised pretraining for clinical outcome prediction in healthcare.

Breaks Assumption arxiv | Mar 24

A red-teaming protocol that uses RL-driven 'profit' objectives to find structural exploits in AI agents instead of just prompt-injection vulnerabilities.

Paradigm Shift arxiv | Mar 24

Contrastive Association Learning (CAL) successfully recovers functional gene associations from expression data where standard similarity metrics fail.

New Capability arxiv | Mar 24

Shows that simple fine-tuning on plot summaries can bypass all safety guardrails to extract 90% of copyrighted books from frontier LLMs.

Breaks Assumption arxiv | Mar 24

Identifies that in-context reasoning over pretraining knowledge only emerges after specific types of fine-tuning, not from pretraining alone.

Scaling Insight arxiv | Mar 24

Consistency under paraphrase in medical VLMs is a false proxy for reliability that hides models ignoring visual inputs entirely.

Breaks Assumption arxiv | Mar 24

Pretrained Diffusion Transformers (DiTs) possess an intrinsic 'synchronization gap' where different features commit at specific, depth-localized layers.

Paradigm Shift arxiv | Mar 24

Sensitivity to compression in Transformers spans five orders of magnitude, with early-layer MLP up-projections identified as catastrophic failure points.

Scaling Insight arxiv | Mar 24

The 'routing paradox' proves that selective attention requires the very pairwise computations it aims to replace, explaining why pure recurrent models fail at associative recall.

Paradigm Shift arxiv | Mar 24

CLT-Forge democratizes mechanistic interpretability by providing an end-to-end library for training Cross-Layer Transcoders and generating feature attribution graphs.

Open Release arxiv | Mar 24

Dream Diffusion Policy enables robots to survive severe OOD disturbances by detecting reality-imagination discrepancies and switching to an internal world model.

New Capability arxiv | Mar 24

Cortical Policy introduces a dual-stream view transformer inspired by the human brain's dorsal and ventral pathways to solve complex robotic manipulation.

New Capability arxiv | Mar 24

LongCat-Flash-Prover is a 560B MoE model that sets a new SOTA for open-weights formal reasoning, achieving a 97.1% pass rate on MiniF2F-Test.

Open Release arxiv | Mar 24

Context-aware Visual Fine-tuning (CoVFT) allows a 7B MLLM to outperform its 13B counterpart by resolving optimization conflicts in vision encoders.

Scaling Insight arxiv | Mar 24

VAE tokenizers in Latent Diffusion Models create 'overly compact' manifolds that cause variance collapse, leading to unstable generative sampling.

Paradigm Shift arxiv | Mar 24

Introduces 'Mixture of Chapters' to scale Transformer memory to 262K tokens without the quadratic cost of standard attention.

Scaling Insight arxiv | Mar 24

CounterScene endows generative world models with explicit counterfactual reasoning for safety-critical driving evaluation.

Paradigm Shift arxiv | Mar 24

A training-free visual token pruning framework for Large Vision-Language Models that preserves geometric structure through subspace reconstruction.

Efficiency Breakthrough arxiv | Mar 24

Free Sinewich enables parameter-efficient multi-task learning using frequency-based weight modulation with near-zero overhead.

Efficiency Breakthrough arxiv | Mar 24

Reveals that state-of-the-art MLLMs fail to maintain stable spatial representations under simple counterfactual viewpoint changes.

Breaks Assumption arxiv | Mar 24

LiFR-Seg achieves high-frame-rate semantic segmentation using low-frame-rate cameras by propagating features through asynchronous event streams.

New Capability arxiv | Mar 24

Proposes multi-cluster memory for test-time adaptation, proving that a single unstructured memory pool is fundamentally insufficient for non-i.i.d. data streams.

Paradigm Shift arxiv | Mar 24

ORACLE uses symbolic reasoning engines to verify intermediate reasoning steps in synthetic data generation, moving beyond simple answer-correctness filtering.

New Capability arxiv | Mar 24

AlphaAdj uses a VLM to dynamically adjust Control Barrier Function parameters in real-time for safe and efficient robotic navigation.

New Capability arxiv | Mar 24

BadGraph demonstrates that LLMs can generate universal adversarial attacks that exploit vulnerabilities in both GNN and PLM architectures on graph data.

Breaks Assumption arxiv | Mar 24

SPECTRE-G2 is a unified anomaly detector that uses eight complementary signals to detect 'unknown unknown' structural anomalies.

New Capability arxiv | Mar 24

Restores monotonic scaling in LLM tree search by replacing standard MCTS selection with Gumbel sampling and Sequential Halving.

Scaling Insight arxiv | Mar 24

A training-free system for 3D scene reconstruction and editing from sparse RGB images using 3D-aware diffusion models to fill geometric gaps.

New Capability arxiv | Mar 24

Introduces the Neural Zeroth-order Kernel (NZK) to provide a theoretical foundation for training models without backpropagation.

Scaling Insight arxiv | Mar 24

Shows that a simple pruned adaptation module (PAM) outperforms complex SOTA foundation-model-based continual learning methods.

Breaks Assumption arxiv | Mar 24

Demonstrates that entropy-based uncertainty is insufficient for safe selective prediction and proposes combining it with correctness probes.

Breaks Assumption arxiv | Mar 24

Reframes plasticity loss in Reinforcement Learning as an optimization problem where networks get trapped in local optima of previous tasks.

Paradigm Shift arxiv | Mar 24

Introduces Reward Sharpness-Aware Fine-Tuning (RSA-FT) to mitigate reward hacking in diffusion models without retraining reward models.

New Capability arxiv | Mar 24

GIDE enables precise, training-free image editing for discrete Diffusion LLMs by introducing a novel Discrete Noise Inversion mechanism.

New Capability arxiv | Mar 24

Prompt Replay speeds up GRPO training by selectively reusing 'medium difficulty' prompts to maximize learning signal in RL rollouts.

Efficiency Breakthrough arxiv | Mar 24

Repurposes a 2B-parameter latent video transformer as a differentiable physics simulator for urban wind flow optimization.

Paradigm Shift arxiv | Mar 24

Provides the first empirical evidence of a 'Quality-Homogenization Tradeoff' where AI-assisted writing strips structural diversity from human thinking.

Breaks Assumption arxiv | Mar 24

Challenges the widespread assumption that auxiliary dynamics supervision creates useful latent structures for robotics.

Breaks Assumption arxiv | Mar 24

Proves that structured retrieval is exponentially more efficient than sequential context scanning for agentic reasoning.

Scaling Insight arxiv | Mar 24

Proposes replacing flat conversation histories with a tree-based architecture to solve 'logical context poisoning.'

Paradigm Shift arxiv | Mar 24

Breaks the massive compute barrier for medium-range weather forecasting, training on a single consumer-grade GPU.

Efficiency Breakthrough arxiv | Mar 24

Enables multimodal models to self-evolve their reasoning without human labels or external reward models.

New Capability arxiv | Mar 24

Replaces self-attention with Reaction-Diffusion PDEs as the predictive engine for world models.

Paradigm Shift arxiv | Mar 24

Identifies architectural 'stream separation' as the key to making linear safety interventions effective.

Breaks Assumption arxiv | Mar 24

An autonomous agent loop that optimizes GPU kernels to outperform human-expert and compiler-generated baselines.

Efficiency Breakthrough arxiv | Mar 24

Reconceptualizes human-agent interaction as dynamically generated software rather than just chat.

Paradigm Shift arxiv | Mar 24

Exposes that LLMs solve complex puzzles via 'reduction' to known patterns rather than true epistemic reasoning.

Breaks Assumption arxiv | Mar 24

Introduces AgentHER, a framework that salvages 'failed' agent trajectories by relabeling them as successful demonstrations for alternative goals.

Efficiency Breakthrough arxiv | Mar 24

ADARUBRIC generates task-specific evaluation rubrics on the fly, significantly outperforming static rubrics in human correlation and agent training outcomes.

Paradigm Shift arxiv | Mar 24

TIDE is a post-training early-exit system that allows individual tokens to skip unnecessary layers, improving throughput by up to 8% with minimal calibration.

Efficiency Breakthrough arxiv | Mar 24

PivotRL identifies 'pivot' turns in agent trajectories where actions matter most, enabling compute-efficient reinforcement learning that matches end-to-end RL at 4x lower cost.

Efficiency Breakthrough arxiv | Mar 24

Discovers 'silent commitment failure,' where some model architectures produce confident, incorrect outputs with zero detectable warning signals before execution.

Scaling Insight arxiv | Mar 24

Provides a causal explanation for 'embedding collapse' in Transformers, linking it to the concept of semantic shift rather than just text length.

Scaling Insight arxiv | Mar 24

KG-Hopper enables 7B-parameter models to outperform 70B systems on complex Knowledge Graph reasoning by embedding the entire multi-hop process into a single 'thinking' stage.

Efficiency Breakthrough arxiv | Mar 24

Introduces Cross-Context Verification (CCV) to detect benchmark contamination, finding that contamination is binary: models either recall solutions perfectly or lack reasoning entirely.

Breaks Assumption arxiv | Mar 24

DSPA performs preference alignment at inference time by steering Sparse Autoencoder (SAE) features, bypassing the need for expensive weight-update training.

Paradigm Shift arxiv | Mar 24

DRTriton uses large-scale synthetic data and curriculum RL to automatically generate highly optimized Triton kernels, significantly outperforming top-tier LLMs.

New Capability arxiv | Mar 24

Introduces git-inspired primitives to enable truly asynchronous and non-interfering multi-agent software engineering collaboration.

New Capability arxiv | Mar 24

Demonstrates that learning systems can stably converge to incorrect solutions when feedback reliability is unobservable.

Breaks Assumption arxiv | Mar 24

Achieves state-of-the-art open-vocabulary segmentation using a training-free, purely geometric projection and propagation method.

Efficiency Breakthrough arxiv | Mar 24

Reveals that 'erasing' concepts from video diffusion models only suppresses output rather than removing the underlying representations.

Breaks Assumption arxiv | Mar 24

Solves the 'recursive drift' problem in self-improving LLMs by using symbolic verification to gate training data quality.

New Capability arxiv | Mar 24

Introduces a counterfactual framework for precise individual credit assignment in collaborative multi-agent LLM systems.

Paradigm Shift arxiv | Mar 24

Provides the first unified theoretical formalism for hierarchical memory systems used by long-context language agents.

Paradigm Shift arxiv | Mar 24

Proves an information-theoretic lower bound showing that embedding hidden payloads in LLM text must increase its Kolmogorov complexity.

Breaks Assumption arxiv | Mar 24

Transitions MLLMs from reactive planning to 'mental navigation' by forcing the construction of hierarchical cognitive maps from egocentric video.

New Capability arxiv | Mar 24

Enables merging independently trained specialist models (e.g., Vision-LLM and Audio-LLM) into a single multimodal model without any paired training data.

Efficiency Breakthrough arxiv | Mar 24

Standard entropy-based uncertainty quantification (UQ) fails in RAG because the 'induction heads' that copy correct answers also trigger 'entropy neurons', causing false uncertainty signals.

Breaks Assumption arxiv | Mar 24

Rule-State Inference (RSI) inverts the standard ML paradigm by treating known regulatory rules as priors and inferring the latent state of compliance and drift, rather than approximating rules from noisy data.

Paradigm Shift arxiv | Mar 24

GSB-PPO lifts proximal policy optimization from discrete action steps to full generation trajectories by framing it as a Generalized Schrödinger Bridge.

Paradigm Shift arxiv | Mar 24

Auditing 'Silicon Bureaucracy' reveals that LLM benchmark scores are often inflated by contamination-related memory reactivation rather than genuine generalization.

Breaks Assumption arxiv | Mar 24

SparseVoxelDet is the first fully sparse object detector for event cameras that never instantiates a dense tensor, achieving 858x GPU memory compression.

Efficiency Breakthrough arxiv | Mar 24

HumanOmni-Speaker achieves end-to-end speaker diarization and lip-reading by compressing high-frequency motion residuals into just 6 tokens per frame.

New Capability arxiv | Mar 24

PRM-as-a-Judge shifts robotic evaluation from binary success/failure to a dense, potential-based progress metric system.

Paradigm Shift arxiv | Mar 24

Depth-Recurrent Transformers decouple computational depth from parameter count, revealing a 'computational frontier' where performance on reasoning tasks snaps from zero to perfect based on iteration steps.

Scaling Insight arxiv | Mar 24

The 'Mirage' study demonstrates that frontier MLLMs generate detailed reasoning traces and clinical findings for images they were never actually shown.

Breaks Assumption arxiv | Mar 24

Confidence-Evidence Bayesian Gain (CEBaG) provides deterministic hallucination detection for medical VQA without requiring 10-20 stochastic generations.

Efficiency Breakthrough arxiv | Mar 24

FIM-Merging provides a theoretical framework for layer-adaptive model merging using the Fisher Information Matrix to bound merging error.

Paradigm Shift arxiv | Mar 24

Challenges the gold standard of Upper Confidence Bound (UCB) exploration in diversity-aware bandit tasks.

Breaks Assumption arxiv | Mar 24

Identifies structured table data as a primary driver for scaling long-context reasoning in LLMs.

Scaling Insight arxiv | Mar 24

Achieves zero-shot, zero-training collaborative navigation between humanoid and quadruped robots.

New Capability arxiv | Mar 24

Enables high-performance Zeroth-Order (ZO) fine-tuning of LLMs by leveraging online curvature signals.

Efficiency Breakthrough arxiv | Mar 24

Reduces token consumption in interleaved multimodal reasoning by over 72% using dynamic visual thoughts.

Efficiency Breakthrough arxiv | Mar 24

Introduces a training-free method to visualize and validate the invariances of any feature extractor using diffusion priors.

New Capability arxiv | Mar 24

Hypothesizes and demonstrates a unified Gaussian latent geometry connecting vision encoders and generative models.

Paradigm Shift arxiv | Mar 24

Eliminates the need for strictly aligned image pairs in infrared and visible image fusion.

Efficiency Breakthrough arxiv | Mar 24

Solves the structural redundancy problem in symbolic regression by collapsing expression DAG isomorphisms.

Paradigm Shift arxiv | Mar 24

Reduces human annotation requirements for NLP model testing by up to 95%.

Efficiency Breakthrough arxiv | Mar 24

Reveals that frozen LLMs contain person-specific 'neural signatures' that can predict individual brain activity.

New Capability arxiv | Mar 24

Introduces a robust framework for optimal Mixture-of-Experts (MoE) architecture design across six orders of magnitude in compute.

Scaling Insight arxiv | Mar 24

Synergizes prompt optimization with policy optimization to overcome the 'sparse reward' problem in complex reasoning tasks.

Paradigm Shift arxiv | Mar 24

Demonstrates that the two standard mathematical interpretations of Temporal Difference (TD) error diverge in deep reinforcement learning.

Breaks Assumption arxiv | Mar 24

Identifies the 'golden subspace' for test-time adaptation, enabling extreme efficiency in online model updates.

Paradigm Shift arxiv | Mar 24

Uses the chronological visitation order of medical scans as a self-supervised signal for disease progression modeling.

New Capability arxiv | Mar 24

Achieves a 50x reduction in visual tokens for Video-LLMs while preserving over 90% of baseline performance.

Efficiency Breakthrough arxiv | Mar 24