All 224 ICLR 2026 Oral papers

An unofficial reader's companion. Browse by topic, search, read AI-assisted summaries, or jump to the cross-paper trends view.

Topics

LLMs & Reasoning 97 Reinforcement Learning & Agents 24 Diffusion & Flow Matching 18 Safety, Privacy & Alignment 14 Theory & Optimization 13 Vision & 3D 12 Multimodal & Speech 9 Datasets, Benchmarks & Evaluation 9 Efficiency, Systems & Kernels 8 Graph Learning 6 Interpretability & Mechanistic Understanding 5 Causal & Statistical Methods 5

All papers

Min rating

Sort

$p\textrm{-less}$ Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

p-less sampling dynamically sets truncation threshold using information theory for hyperparameter-free LLM decoding with robust quality at high temperatures.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Runyan Tan et al.

$PhyWorldBench$: A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench evaluates text-to-video models on physics adherence across fundamental, composite, and anti-physics scenarios.

Datasets, Benchmarks & Evaluation

Avg rating: 5.50 (4–6) · Jing Gu et al.

A Representer Theorem for Hawkes Processes via Penalized Least Squares Minimization

Representer theorem for Hawkes processes shows dual coefficients are analytically fixed to unity via penalized least squares.

Theory & Optimization

Avg rating: 5.50 (2–8) · Hideaki Kim et al.

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

FFDP framework scales image registration to 100μm human brain MRI volumes using IO-aware kernels and distributed tensor sharding.

LLMs & Reasoning

Avg rating: 6.50 (2–10) · Rohit Jena et al.

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.

LLMs & Reasoning

Avg rating: 5.50 (2–8) · Ben Finkelshtein et al.

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

AdAEM dynamically generates value-assessment questions for LLMs by probing internal value boundaries using in-context optimization.

LLMs & Reasoning

Avg rating: 7.00 (4–8) · Jing Yao et al.

Addressing divergent representations from causal interventions on neural networks

Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.

Interpretability & Mechanistic Understanding

Avg rating: 5.20 (4–8) · Satchel Grant et al.

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

ADP lightweight protocol unifies 13 heterogeneous agent datasets into single training schema achieving 20% average performance gain over base models.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Yueqi Song et al.

AgentGym-RL: An Open-Source Framework to Train LLM Agents for Long-Horizon Decision Making via Multi-Turn RL

Presents unified RL framework for training LLM agents on long-horizon decision-making with staged interaction scaling.

Reinforcement Learning & Agents

Avg rating: 7.00 (6–10) · Zhiheng Xi et al.

AnyUp: Universal Feature Upsampling

AnyUp inference-time feature upsampler generalizes across different feature types and resolutions without encoder-specific retraining.

Vision & 3D

Avg rating: 6.50 (6–8) · Thomas Wimmer et al.

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Presents AstaBench, comprehensive benchmark suite with production-grade tools for rigorous evaluation of AI agents on scientific research tasks.

Reinforcement Learning & Agents

Avg rating: 7.00 (6–8) · Jonathan Bragg et al.

AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

AutoEP uses LLM reasoning with real-time landscape analysis to dynamically control metaheuristic algorithms without training.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Zhenxing Xu et al.

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.

LLMs & Reasoning

Avg rating: 5.50 (4–8) · Bartłomiej Marek et al.

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Zhaomin Wu et al.

BioX-Bridge: Model Bridging for Unsupervised Cross-Modal Knowledge Transfer across Biosignals

BioX-Bridge enables parameter-efficient cross-modal knowledge transfer across biosignals using lightweight prototype-based bridge networks between foundation models.

Multimodal & Speech

Avg rating: 6.00 (4–8) · Chenqi Li et al.

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Nan Huo et al.

CauKer: Classification Time Series Foundation Models Can Be Pretrained on Synthetic Data

Generates diverse synthetic time series for pretraining foundation models with clear scaling laws.

Causal & Statistical Methods

Avg rating: 6.00 (4–8) · Shifeng Xie et al.

Causal Structure Learning in Hawkes Processes with Complex Latent Confounder Networks

Develops causal structure learning framework for Hawkes processes identifying latent confounder subprocesses.

Causal & Statistical Methods

Avg rating: 7.00 (6–8) · Songyao Jin et al.

Characterizing the Discrete Geometry of ReLU Networks

Theoretical bounds on polyhedral complex connectivity and diameter reveal fundamental ReLU network geometry properties.

Theory & Optimization

Avg rating: 7.50 (6–8) · Blake B. Gaines et al.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

WebDevJudge benchmark reveals significant LLM-as-judge gaps due to failures in functional equivalence and feasibility verification.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Pierre-Carl Langlais et al.

Compactness and Consistency: A Conjoint Framework for Deep Graph Clustering

CoCo framework captures compactness and consistency in graph neural network representations for improved deep graph clustering.

Graph Learning

Avg rating: 6.80 (4–8) · Wei Ju et al.

Compositional Diffusion with Guided search for Long-Horizon Planning

Introduces CDGS integrating compositional diffusion with guided search for coherent long-horizon plan generation.

Diffusion & Flow Matching

Avg rating: 6.50 (6–8) · Utkarsh Aashu Mishra et al.

Conformal Robustness Control: A New Strategy for Robust Decision

CRC optimizes prediction set construction under explicit robustness constraints instead of coverage for more efficient robust decisions.

Causal & Statistical Methods

Avg rating: 6.50 (6–8) · Yang Hu et al.

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

CounselBench large-scale benchmark with 2000 expert evaluations and 120 adversarial questions for evaluating LLMs in mental health question answering.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Yahan Li et al.

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Expert-Router Coupling loss tightly couples MoE router decisions with expert capabilities by treating router embeddings as proxy tokens.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Ang Lv et al.

Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport

Cross-domain lossy compression unifies rate and classification constraints via optimal transport framework.

Diffusion & Flow Matching

Avg rating: 6.00 (2–10) · Nam Nguyen et al.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.

Datasets, Benchmarks & Evaluation

Avg rating: 7.00 (6–8) · Zhun Wang et al.

DCFold: Efficient Protein Structure Generation with Single Forward Pass

Distills AlphaFold3 into single-step sampler with temporal geodesic matching achieving 15x inference acceleration.

Diffusion & Flow Matching

Avg rating: 5.50 (4–8) · Zhe Zhang et al.

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

CoTAR replaces transformer attention with centralized MLP module for efficient medical time series modeling, reducing complexity to linear.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Guoqi Yu et al.

Depth Anything 3: Recovering the Visual Space from Any Views

DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.

Vision & 3D

Avg rating: 7.00 (6–8) · Haotong Lin et al.

DepthLM: Metric Depth from Vision Language Models

DepthLM shows VLMs can match pure vision models in metric depth estimation with text-based supervised finetuning and visual prompting without architecture changes.

LLMs & Reasoning

Avg rating: 6.67 (4–10) · Zhipeng Cai et al.

Differentiable Model Predictive Control on the GPU

DiffMPC provides GPU-accelerated differentiable MPC solver leveraging problem structure for efficient parallelization.

Reinforcement Learning & Agents

Avg rating: 7.33 (6–8) · Emre Adabag et al.

Differentially Private Domain Discovery

WGM-based methods provide efficient domain discovery with near-optimal guarantees for missing mass on Zipfian data.

Safety, Privacy & Alignment

Avg rating: 6.50 (6–8) · Vinod Raman et al.

Difficult Examples Hurt Unsupervised Contrastive Learning: A Theoretical Perspective

EBTs frame System 2 thinking as energy minimization enabling inference-time reasoning emergence across modalities.

Theory & Optimization

Avg rating: 6.00 (6–6) · Yi-Ge Zhang et al.

Diffusion Language Model Knows the Answer Before It Decodes

Prophet identifies early answer convergence in diffusion language models to accelerate decoding by 3.4x on reasoning tasks.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Pengxiang Li et al.

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

DiffusionNFT enables efficient online reinforcement learning for diffusion models via forward process optimization with up to 25x efficiency gains.

Reinforcement Learning & Agents

Avg rating: 7.33 (6–8) · Kaiwen Zheng et al.

Discount Model Search for Quality Diversity Optimization in High-Dimensional Measure Spaces

Proposes Discount Model Search for quality diversity optimization in high-dimensional measure spaces.

Reinforcement Learning & Agents

Avg rating: 5.50 (4–8) · Bryon Tjanaka et al.

Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning

Characterizes distributional equivalence for linear non-Gaussian latent-variable cyclic causal models without structural assumptions.

LLMs & Reasoning

Avg rating: 8.00 (8–8) · Haoyue Dai et al.

DTO-KD: Dynamic Trade-off Optimization for Effective Knowledge Distillation

DTO-KD uses multi-objective optimization to dynamically balance task and distillation losses at gradient level for better knowledge distillation.

Vision & 3D

Avg rating: 6.67 (6–8) · Zeeshan Hayder et al.

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.

Datasets, Benchmarks & Evaluation

Avg rating: 7.50 (6–10) · Wayne Chi et al.

EditVerse: Unifying Image and Video Editing and Generation with In-Context Learning

PAPL aligns discrete diffusion training with planning-based inference via planned ELBO for improved text and protein generation.

Multimodal & Speech

Avg rating: 5.60 (4–8) · Xuan Ju et al.

Efficient Resource-Constrained Training of Transformers via Subspace Optimization

WASI applies subspace-based training to transformer models reducing memory by 62x and FLOPs by 2x while maintaining accuracy on edge devices.

Efficiency, Systems & Kernels

Avg rating: 6.00 (6–6) · Le-Trung Nguyen et al.

EigenBench: A Comparative Behavioral Measure of Value Alignment

EigenBench measures language model value alignment using model ensemble judgments aggregated with EigenTrust without ground truth labels.

Safety, Privacy & Alignment

Avg rating: 6.00 (4–10) · Jonathn Chang et al.

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

EmotionThinker reformulates speech emotion recognition as deep reasoning with prosody enhancement and specialized reinforcement learning.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Dingdong WANG et al.

Energy-Based Transformers are Scalable Learners and Thinkers

Common Corpus releases 2 trillion permissively-licensed tokens for open-science LLM pre-training covering diverse languages.

LLMs & Reasoning

Avg rating: 6.00 (2–8) · Alexi Gladstone et al.

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

AIGB-Pearl enhances generative auto-bidding with trajectory evaluator and KL-Lipschitz-constrained optimization for safe exploration beyond offline data.

Reinforcement Learning & Agents

Avg rating: 6.00 (4–8) · Zhiyu Mou et al.

Every Language Model Has a Forgery-Resistant Signature

Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.

Safety, Privacy & Alignment

Avg rating: 6.00 (4–8) · Matthew Finlayson et al.

Exchangeability of GNN Representations with Applications to Graph Retrieval

Graph embeddings exhibit exchangeability property, enabling efficient graph retrieval via transport-based similarity approximation with locality-sensitive hashing.

Graph Learning

Avg rating: 6.00 (4–8) · Kartik Nair et al.

Exploratory Causal Inference in SAEnce

Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.

Interpretability & Mechanistic Understanding

Avg rating: 7.00 (4–8) · Tommaso Mencattini et al.

Exploratory Diffusion Model for Unsupervised Reinforcement Learning

Proposes ExDM using diffusion models for exploration and policy learning in unsupervised reinforcement learning.

Reinforcement Learning & Agents

Avg rating: 6.00 (6–6) · Chengyang Ying et al.

Exploring Synthesizable Chemical Space with Iterative Pathway Refinements

ReaSyn iteratively refines synthetic pathways bidirectionally with discrete flow models for synthesizable molecular design.

LLMs & Reasoning

Avg rating: 5.50 (4–8) · Seul Lee et al.

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

Reveals long sequence modeling degrades gene expression prediction; proximal epigenomic signals with confounding mitigation suffice.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Zhao Yang et al.

FALCON: Few-step Accurate Likelihoods for Continuous Flows

Triple-BERT addresses order dispatching via centralized SARL with action decomposition and BERT-based attention.

Diffusion & Flow Matching

Avg rating: 7.00 (4–10) · Danyal Rehman et al.

Fast Escape, Slow Convergence: Learning Dynamics of Phase Retrieval under Power-Law Data

Analyzes phase retrieval learning dynamics with anisotropic data, deriving explicit scaling laws and three-phase trajectories.

Theory & Optimization

Avg rating: 5.50 (4–6) · Guillaume Braun et al.

Fast training of accurate physics-informed neural networks without gradient descent

Frozen-PINNs employ space-time separation with random features for fast, accurate PDE solving without gradient descent.

Theory & Optimization

Avg rating: 7.00 (4–8) · Chinmay Datar et al.

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability–Plasticity Tradeoff

FIRE balances stability-plasticity tradeoff using Frobenius error and isometry deviation constraints without heavy hyperparameter tuning.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Isaac Han et al.

FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Accelerates video LLMs via training-free spatiotemporal token merging, retaining 99.1% performance with 10% of tokens.

LLMs & Reasoning

Avg rating: 5.50 (4–8) · Ziyang Fan et al.

FlashWorld: High-quality 3D Scene Generation within Seconds

Proposes FlashWorld generating high-quality 3D scenes in seconds using dual-mode diffusion with cross-mode distillation.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Xinyang Li et al.

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.

LLMs & Reasoning

Avg rating: 5.50 (4–8) · Shibo Hong et al.

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Characterizes in-context learning capabilities of Mamba, showing it learns optimal Laplacian smoothing estimator.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Marco Bondaschi et al.

From movement to cognitive maps: recurrent neural networks reveal how locomotor development shapes hippocampal spatial coding

RNN models of hippocampus reveal how locomotor development statistics shape emergence of spatial neural representations.

Reinforcement Learning & Agents

Avg rating: 6.50 (2–10) · Marco P Abrate et al.

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Gaia2 benchmarks LLM agents in asynchronous dynamic environments with action-level verification for RL training.

LLMs & Reasoning

Avg rating: 8.00 (6–10) · Romain Froger et al.

Gaussian certified unlearning in high dimensions: A hypothesis testing approach

Analyzes machine unlearning in high dimensions showing single noisy Newton step with Gaussian noise suffices for privacy-accuracy.

Safety, Privacy & Alignment

Avg rating: 6.00 (4–8) · Aaradhya Pandey et al.

Generating metamers of human scene understanding

EditVerse unifies image and video generation/editing via token sequences enabling cross-modal knowledge transfer.

Diffusion & Flow Matching

Avg rating: 6.67 (6–8) · Ritik Raina et al.

Generative Human Geometry Distribution

Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.

Diffusion & Flow Matching

Avg rating: 5.50 (2–8) · Xiangjun Tang et al.

Generative Universal Verifier as Multimodal Meta-Reasoner

OmniVerifier provides universal visual verification for multimodal reasoning and introduces sequential test-time scaling for image generation and editing.

LLMs & Reasoning

Avg rating: 8.00 (8–8) · Xinchen Zhang et al.

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

GEPA uses genetic-Pareto selection with natural language reflection to outperform RL-based prompt optimization with 35x fewer rollouts.

LLMs & Reasoning

Avg rating: 6.00 (2–10) · Lakshya A Agrawal et al.

GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models

GLASS Flows samples Markov transitions via inner flow matching models to improve inference-time reward alignment in flow and diffusion models.

Diffusion & Flow Matching

Avg rating: 7.00 (6–8) · Peter Holderrieth et al.

Global Resolution: Optimal Multi-Draft Speculative Sampling via Convex Optimization

Solves optimal multi-draft speculative sampling via convex optimization achieving 90% acceptance rates.

Causal & Statistical Methods

Avg rating: 6.50 (6–8) · Rahul Krishna Thomas et al.

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Tao Ren et al.

Hallucination Begins Where Saliency Drops

Gradient-aware diagnostic tool using saliency to identify hallucination patterns, proposing SGRS and LocoRE interventions to reduce output errors.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Xiaofeng Zhang et al.

HATSolver: Learning Gröbner Bases with Hierarchical Attention Transformers

HATSolver uses hierarchical attention transformers to compute Gröbner bases for multivariate polynomial systems more efficiently than flat attention models.

LLMs & Reasoning

Avg rating: 4.67 (4–6) · Mohamed Malhou et al.

High-dimensional Analysis of Synthetic Data Selection

Demonstrates that covariance matching procedure improves synthetic data quality for training neural networks better than mean shift or other approaches.

Uncategorized

Avg rating: 5.33 (4–6) · Parham Rezaei et al.

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.

LLMs & Reasoning

Avg rating: 7.20 (6–8) · Shawn Im et al.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Study reveals incompatibility between ascending quality curriculum and decaying learning rate in LLM pretraining, proposing moderated decay and model averaging solutions.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Kairong Luo et al.

How Reliable is Language Model Micro-Benchmarking?

Work establishes meta-evaluation measures showing many micro-benchmarks cannot reliably rank similar-performing models.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Gregory Yauney et al.

Hubble: a Model Suite to Advance the Study of LLM Memorization

Releases Hubble suite of open-source LLMs with controlled perturbed variants to systematically study memorization risks.

Safety, Privacy & Alignment

Avg rating: 7.50 (6–8) · Johnny Wei et al.

Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

HGM identifies metaproductivity-performance mismatch and uses clade-based lineage metrics to guide self-improving coding agents.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Wenyi Wang et al.

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport

Hyperparameter Trajectory Inference uses conditional Lagrangian optimal transport to reconstruct neural network outputs across hyperparameter spectra without expensive retraining.

Reinforcement Learning & Agents

Avg rating: 4.00 (2–6) · Harry Amad et al.

Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation

Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.

Vision & 3D

Avg rating: 6.00 (6–6) · Feng Hong et al.

In-Place Test-Time Training

In-Place TTT framework enables LLMs to perform test-time training by adapting MLP projection matrices with alignment to next-token prediction.

LLMs & Reasoning

Avg rating: 7.33 (6–8) · Guhao Feng et al.

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

AgentFlow trainable in-the-flow agentic system using Flow-GRPO for on-policy learning with long-horizon sparse rewards.

LLMs & Reasoning

Avg rating: 7.33 (6–8) · Zhuofeng Li et al.

InfoNCE Induces Gaussian Distribution

Shows InfoNCE loss induces Gaussian distribution in contrastive representations, providing principled explanation for observed Gaussianity.

Theory & Optimization

Avg rating: 4.00 (2–8) · Roy Betser et al.

Information Shapes Koopman Representation

Proposes information-theoretic Lagrangian formulation to balance simplicity and expressiveness in Koopman representation learning for dynamical systems.

Uncategorized

Avg rating: 5.50 (4–6) · Xiaoyuan Cheng et al.

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.

LLMs & Reasoning

Avg rating: 7.33 (6–8) · Haotian Ye et al.

Instilling an Active Mind in Avatars via Cognitive Simulation

Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Jianwen Jiang et al.

Intrinsic Entropy of Context Length Scaling in LLMs

Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.

LLMs & Reasoning

Avg rating: 5.50 (2–10) · Jingzhe Shi et al.

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Guangnian Wan et al.

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Xinpeng Wang et al.

It's All Just Vectorization: einx, a Universal Notation for Tensor Operations

einx is universal notation for tensor operations using vectorization, reducing large APIs to small consistent operation sets.

Uncategorized

Avg rating: 6.00 (4–8) · Florian Fervers et al.

Latent Fourier Transform

LatentFT provides frequency-domain controls for generative music via diffusion autoencoder with latent-space Fourier transform enabling timescale-based manipulation.

Multimodal & Speech

Avg rating: 5.00 (2–8) · Mason Long Wang et al.

Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

LPWM enables self-supervised object-centric world modeling with latent action module for stochastic video generation and control.

Reinforcement Learning & Agents

Avg rating: 7.33 (6–8) · Tal Daniel et al.

Latent Speech-Text Transformer

Aggregates speech tokens into latent patches for efficient speech-text modeling with cross-modal alignment.

Multimodal & Speech

Avg rating: 6.00 (2–10) · Yen-Ju Lu et al.

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Systematic study reveals LLMs acquire visual perception priors from diverse data and reasoning priors from code/math corpora.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Junlin Han et al.

Learning to Segment for Vehicle Routing Problems

L2Seg accelerates vehicle routing solvers 2-7x by learning to identify stable and unstable solution segments.

Theory & Optimization

Avg rating: 5.00 (4–6) · Wenbin Ouyang et al.

Learning with Dual-level Noisy Correspondence for Multi-modal Entity Alignment

Proposes framework to handle noisy entity-attribute and inter-graph correspondences in multi-modal entity alignment.

Graph Learning

Avg rating: 7.50 (6–8) · Haobin Li et al.

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

HyCa uses hybrid ODE solvers with dimension-wise caching strategies to accelerate diffusion transformers by 5-6x without retraining.

LLMs & Reasoning

Avg rating: 7.00 (4–10) · Shikang Zheng et al.

LLM DNA: Tracing Model Evolution via Functional Representations

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

LLMs & Reasoning

Avg rating: 5.50 (4–6) · Zhaomin Wu et al.

LLM Fingerprinting via Semantically Conditioned Watermarks

Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.

Safety, Privacy & Alignment

Avg rating: 6.50 (6–8) · Thibaud Gloaguen et al.

LLMs Get Lost In Multi-Turn Conversation

Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.

LLMs & Reasoning

Avg rating: 8.00 (6–10) · Philippe Laban et al.

Locality-aware Parallel Decoding for Efficient Autoregressive Image Generation

Introduces parallel decoding for autoregressive image generation with flexible ordering achieving 3.4x latency reduction.

Vision & 3D

Avg rating: 7.00 (6–8) · Zhuoyang Zhang et al.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

LongWriter-Zero applies RL from scratch to achieve ultra-long text generation without synthetic training data.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Yuhao Wu et al.

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

LoongRL uses emergent plan-retrieve-reason-recheck pattern trained on long-context tasks to generalize beyond training length.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Siyuan Wang et al.

Mamba-3: Improved Sequence Modeling using State Space Principles

Mamba-3 achieves 1.8 percentage point accuracy gain over Mamba-2 via expressive recurrence, complex-valued state updates, and MIMO formulation.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Aakash Lahoti et al.

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.

Reinforcement Learning & Agents

Avg rating: 6.00 (4–8) · Yaoyu Wang et al.

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

MC-Search benchmark evaluates multimodal agentic RAG with step-wise reasoning chains and introduces Search-Align for improved planning.

LLMs & Reasoning

Avg rating: 5.00 (4–6) · Xuying Ning et al.

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

mCLM uses modular chemical language combining natural language and molecular building blocks for function-aware synthesis.

LLMs & Reasoning

Avg rating: 5.50 (2–8) · Carl Edwards et al.

Mean Flow Policy with Instantaneous Velocity Constraint for One-step Action Generation

MVP achieves fastest one-step action generation with instantaneous velocity constraint providing high expressiveness for robotic control.

Reinforcement Learning & Agents

Avg rating: 7.00 (4–8) · Guojian Zhan et al.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Ran Xu et al.

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.

Reinforcement Learning & Agents

Avg rating: 6.50 (4–8) · Hongli Yu et al.

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

MetaEmbed uses learnable meta tokens with matryoshka training to enable test-time scaling for multimodal retrieval balancing quality and efficiency.

Multimodal & Speech

Avg rating: 7.00 (6–8) · Zilin Xiao et al.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

MoEs with optimal activation rates surpass dense LLMs under equal resource constraints (parameters, compute, data) with data reuse strategy.

LLMs & Reasoning

Avg rating: 5.00 (4–8) · Houyi Li et al.

Modality-free Graph In-context Alignment

MF-GIA framework enables graph neural networks to perform in-context learning across heterogeneous domains without modality assumptions using gradient fingerprints.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Wei Zhuo et al.

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Models for Embodied Task Planning

MomaGraph learns unified task-oriented scene representations integrating spatial-functional relationships for embodied agents to perform planning and manipulation.

Reinforcement Learning & Agents

Avg rating: 6.50 (6–8) · Yuanchen Ju et al.

Monocular Normal Estimation via Shading Sequence Estimation

RoSE estimates surface normals via shading sequence prediction, addressing 3D misalignment in monocular normal estimation.

Datasets, Benchmarks & Evaluation

Avg rating: 6.40 (6–8) · Zongrui Li et al.

MotionStream: Real-Time Video Generation with Interactive Motion Controls

Introduces MotionStream enabling sub-second latency motion-controlled infinite-length video generation via causal diffusion.

Efficiency, Systems & Kernels

Avg rating: 5.50 (4–6) · Joonghyuk Shin et al.

MrRoPE: Mixed-radix Rotary Position Embedding

MrRoPE generalizes RoPE-extension via radix system conversion, achieving train-short-test-long with doubled effective context window.

Uncategorized

Avg rating: 6.50 (6–8) · Qingyuan Tian et al.

Multi-Domain Riemannian Graph Gluing for Building Graph Foundation Models

GraphGlue uses Riemannian geometry to merge multi-domain graphs into unified manifolds, enabling knowledge transfer across graph domains.

Graph Learning

Avg rating: 6.00 (4–8) · Li Sun et al.

Multimodal Aligned Semantic Knowledge for Unpaired Image-text Matching

MASK aligns semantic knowledge between images and text using word embeddings as bridges to match out-of-distribution words in unpaired matching.

Multimodal & Speech

Avg rating: 6.67 (6–8) · Laiguo Yin et al.

Multiplayer Nash Preference Optimization

MNPO extends Nash learning to multiplayer regime for aligning LLMs with heterogeneous human preferences via n-player game formulation.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Fang Wu et al.

Navigating the Latent Space Dynamics of Neural Models

Interprets neural autoencoders as dynamical systems with latent vector fields to analyze generalization, memorization, and out-of-distribution detection.

Vision & 3D

Avg rating: 6.50 (6–8) · Marco Fumero et al.

Neon: Negative Extrapolation From Self-Training Improves Image Generation

Neon inverts model degradation from self-training by extrapolating away from it, improving generative models with minimal compute.

Diffusion & Flow Matching

Avg rating: 7.00 (6–8) · Sina Alemohammad et al.

NextStep-1: Toward Autoregressive Image Generation with Continuous Tokens at Scale

NextStep-1 achieves state-of-the-art autoregressive text-to-image generation by modeling continuous image tokens with lightweight flow matching instead of diffusion.

Diffusion & Flow Matching

Avg rating: 4.50 (2–6) · Chunrui Han et al.

Non-Asymptotic Analysis of (Sticky) Track-and-Stop

Provides first finite-confidence analysis of Track-and-Stop and Sticky Track-and-Stop algorithms for pure exploration problems.

Reinforcement Learning & Agents

Avg rating: 6.00 (4–8) · Riccardo Poiani et al.

Non-Convex Federated Optimization under Cost-Aware Client Selection

Develops efficient federated optimization algorithm with cost-aware client selection achieving best communication and local complexity.

Theory & Optimization

Avg rating: 5.50 (4–6) · Xiaowen Jiang et al.

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Omni-Reward addresses modality imbalance and preference rigidity with omni-modal reward modeling framework.

Safety, Privacy & Alignment

Avg rating: 6.50 (6–8) · Zhuoran Jin et al.

On the Generalization Capacities of MLLMs for Spatial Intelligence

Camera-Aware MLLM framework improves spatial reasoning by injecting camera parameters and using geometric augmentation.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Gongjie Zhang et al.

On the Reasoning Abilities of Masked Diffusion Language Models

Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Anej Svete et al.

On The Surprising Effectiveness of a Single Global Merging in Decentralized Learning

Shows decentralized learning with single global merging achieves convergence rates matching parallel SGD under data heterogeneity.

Theory & Optimization

Avg rating: 7.50 (6–8) · Tongtian Zhu et al.

On the Wasserstein Geodesic Principal Component Analysis of probability measures

Geodesic PCA for probability distributions using Wasserstein geometry with neural network parametrization for continuous distributions.

Datasets, Benchmarks & Evaluation

Avg rating: 7.00 (4–10) · Nina Vesseron et al.

One for Two: A Unified Framework for Imbalanced Graph Classification via Dynamic Balanced Prototype

Unified framework for imbalanced graph classification using dynamic balanced prototypes and prototype load-balancing optimization.

Graph Learning

Avg rating: 5.50 (4–6) · Guanjun Wang et al.

Online Learning and Equilibrium Computation with Ranking Feedback

MRT systematically stress tests LLM agent monitoring revealing agent awareness dominates and hybrid scaffolding enables weak-to-strong.

Theory & Optimization

Avg rating: 5.50 (2–8) · Mingyang Liu et al.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.

Reinforcement Learning & Agents

Avg rating: 6.50 (6–8) · Karen Ullrich et al.

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts releases open-source datasets and models for training reasoning tasks, achieving state-of-the-art on AIME and code benchmarks.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Etash Kumar Guha et al.

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Taishi Nakamura et al.

Optimistic Task Inference for Behavior Foundation Models

OpTI-BFM uses optimistic decision criterion modeling uncertainty over reward functions to enable efficient task inference for behavior foundation models.

Reinforcement Learning & Agents

Avg rating: 6.50 (6–8) · Thomas Rupf et al.

Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding

Hierarchical Speculative Decoding uses lossless verification to maximize accepted tokens while preserving target distribution fidelity.

Efficiency, Systems & Kernels

Avg rating: 5.00 (0–8) · Yuxuan Zhou et al.

Overparametrization bends the landscape: BBP transitions at initialization in simple Neural Networks

Analyzes how overparametrization shifts BBP transition point in loss landscape, bending geometric properties.

Theory & Optimization

Avg rating: 6.50 (4–8) · Brandon Livio Annesi et al.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.

Reinforcement Learning & Agents

Avg rating: 6.50 (2–10) · Shuyang Jiang et al.

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

P-GenRM transforms user preferences into adaptive personas and scoring rubrics with test-time scaling for personalized reward modeling.

LLMs & Reasoning

Avg rating: 4.67 (4–6) · Pinyi Zhang et al.

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Federico Danieli et al.

Pareto-Conditioned Diffusion Models for Offline Multi-Objective Optimization

Pareto-Conditioned Diffusion formulates offline multi-objective optimization as conditional sampling problem avoiding explicit surrogate models.

Diffusion & Flow Matching

Avg rating: 6.50 (6–8) · Jatan Shrestha et al.

Partition Generative Modeling: Masked Modeling Without Masks

Partition Generative Models replace masking with partitioning for efficient parallel generation, achieving higher throughput than masked generative models.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Justin Deschenaux et al.

PateGAIL++: Utility Optimized Private Trajectory Generation with Imitation Learning

PATEGAIL++ privacy-preserving trajectory generation framework using sensitivity-aware noise allocation for improved privacy-utility trade-off.

Safety, Privacy & Alignment

Avg rating: 5.00 (4–6) · Yingjie Ma et al.

Pinet: Optimizing hard-constrained neural networks with orthogonal projection layers

Enforces convex output constraints via operator splitting enabling fast parametric optimization solving.

Theory & Optimization

Avg rating: 6.50 (6–8) · Panagiotis D. Grontas et al.

Planner Aware Path Learning in Diffusion Language Models Training

Theoretical characterization shows MDMs are expressively equivalent to padded looped transformers, more efficient for parallel problems.

Diffusion & Flow Matching

Avg rating: 5.50 (4–8) · Fred Zhangzhi Peng et al.

Plug-and-Play Compositionality for Boosting Continual Learning with Foundation Models

Proposes CompSLOT framework extracting interpretable concepts from vision transformers to enhance continual learning.

Vision & 3D

Avg rating: 5.33 (4–6) · Weiduo Liao et al.

Pre-training under infinite compute

Shows optimal weight decay is 30x larger than standard practice; ensembling achieves lower loss asymptote enabling data-efficient pre-training at scale.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Konwoo Kim et al.

Premise Selection for a Lean Hammer

LeanHammer combines neural premise selection with symbolic automation for first end-to-end hammer in Lean proof assistant.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Thomas Zhu et al.

Probabilistic Kernel Function for Fast Angle Testing

Proposes probabilistic kernel functions for angle testing enabling efficient approximate nearest neighbor search.

Efficiency, Systems & Kernels

Avg rating: 8.00 (8–8) · Kejing Lu et al.

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.

Reinforcement Learning & Agents

Avg rating: 6.00 (2–8) · Artyom Sorokin et al.

Quantitative Bounds for Length Generalization in Transformers

Quantitative bounds show training length required for length generalization depends on periodicity, locality, alphabet size, and model norms.

Theory & Optimization

Avg rating: 7.00 (6–8) · Zachary Izzo et al.

Quotient-Space Diffusion Models

Quotient-space diffusion models reduce learning difficulty for molecular structure generation via SE(3) symmetry handling.

Diffusion & Flow Matching

Avg rating: 7.50 (6–10) · Yixian Xu et al.

Radiometrically Consistent Gaussian Surfels for Inverse Rendering

RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.

Vision & 3D

Avg rating: 5.00 (2–6) · Kyu Beom Han et al.

RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Proposes RAIN-Merging to merge instruction-tuned and reasoning models while preserving structured thinking format.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Zhehao Huang et al.

RealPDEBench: A Benchmark for Complex Physical Systems with Real-World Data

RealPDEBench first benchmark integrating real-world measurements with paired simulations across five physical systems for scientific ML evaluation.

Datasets, Benchmarks & Evaluation

Avg rating: 7.50 (4–10) · Peiyan Hu et al.

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.

LLMs & Reasoning

Avg rating: 5.00 (4–6) · Shijie Zhao et al.

Reasoning with Sampling: Your Base Model is Smarter Than You Think

Power sampling algorithm elicits strong reasoning from base models at inference time via MCMC without additional training.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Aayush Karan et al.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.

Safety, Privacy & Alignment

Avg rating: 6.00 (4–8) · Zeyi Liao et al.

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Deyu Zou et al.

RefineStat: Efficient Exploration for Probabilistic Program Synthesis

RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Madhav Kanda et al.

Reliable Weak-to-Strong Monitoring of LLM Agents

MetamerGen generates scene metamers aligned with human perception using foveal/peripheral features and latent diffusion.

Safety, Privacy & Alignment

Avg rating: 6.00 (4–8) · Neil Kale et al.

Revela: Dense Retriever Learning via Language Modeling

Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Fengyu Cai et al.

Rodrigues Network for Learning Robot Actions

Rodrigues Networks inject kinematics-aware inductive biases for improved action learning in articulated robot tasks.

Reinforcement Learning & Agents

Avg rating: 6.00 (2–8) · Jialiang Zhang et al.

SafeDPO: A Simple Approach to Direct Preference Optimization with Enhanced Safety

SafeDPO reformulates safety alignment as closed-form objective, achieving strong safety-helpfulness trade-offs without auxiliary models.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Geon-Hyeong Kim et al.

SAFETY-GUIDED FLOW (SGF): A UNIFIED FRAMEWORK FOR NEGATIVE GUIDANCE IN SAFE GENERATION

SGF unifies negative guidance in safe generation via MMD potentials and control barrier analysis with time-critical guidance windows.

Diffusion & Flow Matching

Avg rating: 6.50 (4–10) · Mingyu Kim et al.

SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer

Generates minute-long high-resolution videos efficiently with linear attention and constant-memory KV cache for block autoregression.

Efficiency, Systems & Kernels

Avg rating: 6.50 (6–8) · Junsong Chen et al.

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.

Multimodal & Speech

Avg rating: 6.80 (6–10) · Zhaoyang Liu et al.

Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute

Proteina-Complexa unifies generative modeling and hallucination for atomistic binder design via pretraining on Teddymer and test-time optimization.

Diffusion & Flow Matching

Avg rating: 7.00 (4–10) · Kieran Didi et al.

Scaling Laws and Spectra of Shallow Neural Networks in the Feature Learning Regime

Analyzes scaling laws for shallow networks with feature learning via sparse estimation and matrix compression theory.

Graph Learning

Avg rating: 7.00 (6–8) · Leonardo Defilippis et al.

Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI

PRISM framework projects fMRI signals into structured text space for visual stimulus reconstruction with object-centric diffusion and attribute search modules.

Vision & 3D

Avg rating: 6.00 (4–8) · Zheng Huang et al.

Semi-Supervised Preference Optimization with Limited Feedback

SSPO achieves data efficiency in preference optimization by pseudo-labeling unpaired data using theoretically-grounded reward thresholds.

LLMs & Reasoning

Avg rating: 6.00 (2–8) · Seonggyun Lee et al.

Sequences of Logits Reveal the Low Rank Structure of Language Models

Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.

LLMs & Reasoning

Avg rating: 7.33 (6–8) · Noah Golowich et al.

Shoot First, Ask Questions Later? Building Rational Agents that Explore and Act Like People

Develops methods for LMs to ask informative questions and make decisions under uncertainty using Bayesian Experimental Design.

LLMs & Reasoning

Avg rating: 6.67 (4–8) · Gabriel Grand et al.

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

SimuHome introduces Matter protocol-grounded smart home simulator and 600-episode benchmark evaluating LLM agents on device control and workflow scheduling.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Gyuhyeon Seo et al.

Softmax Transformers are Turing-Complete

Proves length-generalizable softmax transformers with chain-of-thought and relative positional encoding are Turing-complete.

LLMs & Reasoning

Avg rating: 5.50 (2–10) · Hongjian Jiang et al.

Speculative Actions: A Lossless Framework for Faster AI Agents

Speculative Actions accelerates agent systems by predicting and executing likely future actions in parallel.

Efficiency, Systems & Kernels

Avg rating: 7.50 (6–10) · Naimeng Ye et al.

Spherical Watermark: Encryption-Free, Lossless Watermarking for Diffusion Models

Watermarks diffusion models losslessly via spherical mapping preserving Gaussian prior up to third-order moments.

Safety, Privacy & Alignment

Avg rating: 7.50 (6–8) · Xiaoxiao Hu et al.

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Generates ultra-long videos by actively correcting self-generated errors through error-recycling fine-tuning.

Diffusion & Flow Matching

Avg rating: 6.50 (4–8) · Wuyang Li et al.

Steering the Herd: A Framework for LLM-based Control of Social Learning

Framework studying strategic control of social learning by algorithmic information mediators with theoretical analysis and LLM-based simulations.

Safety, Privacy & Alignment

Avg rating: 6.50 (2–8) · Raghu Arghal et al.

Structured Flow Autoencoders: Learning Structured Probabilistic Representations with Flow Matching

Structured Flow Autoencoders integrate flow matching with graphical models for structured representation learning.

Causal & Statistical Methods

Avg rating: 6.00 (4–8) · Yidan Xu et al.

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Wendong XU et al.

TabStruct: Measuring Structural Fidelity of Tabular Data

TabStruct benchmark evaluates tabular data generators on structural fidelity and conventional dimensions using global utility metric without ground-truth causal structures.

Datasets, Benchmarks & Evaluation

Avg rating: 7.00 (4–10) · Xiangjian Jiang et al.

Taming Momentum: Rethinking Optimizer States Through Low-Rank Approximation

LoRA-Pre low-rank optimizer reduces momentum matrix memory via online linear learner decomposition while maintaining optimization performance.

LLMs & Reasoning

Avg rating: 6.00 (6–6) · Zhengbo Wang et al.

Task-free Adaptive Meta Black-box Optimization

ABOM performs task-free adaptive meta black-box optimization using online parameter adaptation without predefined task distributions.

Reinforcement Learning & Agents

Avg rating: 5.50 (2–8) · Chao Wang et al.

TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

Learns zero-shot RL representations via temporal difference latent prediction recovering successor factorization.

Reinforcement Learning & Agents

Avg rating: 7.50 (6–8) · Marco Bagatella et al.

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.

Interpretability & Mechanistic Understanding

Avg rating: 6.50 (4–10) · Usha Bhalla et al.

Temporal superposition and feature geometry of RNNs under memory demands

Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.

Interpretability & Mechanistic Understanding

Avg rating: 7.50 (6–8) · Pratyaksh Sharma et al.

Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.

Vision & 3D

Avg rating: 8.00 (8–8) · Hyojun Go et al.

The Art of Scaling Reinforcement Learning Compute for LLMs

ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.

LLMs & Reasoning

Avg rating: 7.50 (6–8) · Fnu Devvrit et al.

The Coverage Principle: How Pre-Training Enables Post-Training

Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.

LLMs & Reasoning

Avg rating: 7.33 (6–8) · Fan Chen et al.

The Polar Express: Optimal Matrix Sign Methods and their Application to the Muon Algorithm

Polar Express computes polar decomposition with minimax-optimized update rules for efficient GPU-friendly training.

LLMs & Reasoning

Avg rating: 8.00 (6–10) · Noah Amsel et al.

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Aideen Fay et al.

The Spacetime of Diffusion Models: An Information Geometry Perspective

Spacetime perspective views diffusion latent spaces as Fisher-Rao metric manifolds enabling efficient geodesic computation without simulation.

Diffusion & Flow Matching

Avg rating: 6.50 (4–8) · Rafal Karczewski et al.

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Compresses KV cache in reasoning models via thought-adaptive quantization and eviction achieving near-lossless accuracy.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Akshat Ramachandran et al.

Through the Lens of Contrast: Self-Improving Visual Reasoning in VLMs

VC-STaR mitigates visual hallucinations through contrastive VQA pairs for self-improving visual reasoning.

LLMs & Reasoning

Avg rating: 6.40 (6–8) · Zhiyu Pan et al.

TileLang: Bridge Programmability and Performance in Modern Neural Kernels

TileLang enables hardware-aware fused kernel programming with tile inference and recommendation achieving 5-6x speedup.

Efficiency, Systems & Kernels

Avg rating: 7.00 (4–8) · Lei Wang et al.

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Shows tool-use enables state space models to achieve length generalization previously limited by fixed-size memory.

LLMs & Reasoning

Avg rating: 7.00 (4–8) · Eran Malach et al.

Token-Importance Guided Direct Preference Optimization

Proposes token-importance guided DPO with gradient attribution weighting and triplet loss for fine-grained LLM alignment.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Ning Yang et al.

TRACE: Your Diffusion Model is Secretly an Instance Edge Detector

TRACE reveals diffusion models encode hidden instance boundary priors and leverages them for unsupervised instance segmentation without dense annotations.

Diffusion & Flow Matching

Avg rating: 6.00 (4–8) · Sanghyun Jo et al.

Train-before-Test Harmonizes Language Model Rankings

Proposes train-before-test approach showing model potential rankings transfer across benchmarks better than direct evaluation.

LLMs & Reasoning

Avg rating: 7.00 (6–8) · Guanhua Zhang et al.

Transformers are Inherently Succinct

Proves transformers with unique-hard attention are exponentially more succinct than finite automata and LTL formulas but verification is EXPSPACE-complete.

LLMs & Reasoning

Avg rating: 7.00 (4–8) · Pascal Bergsträßer et al.

Triple-BERT: Do We Really Need MARL for Order Dispatch on Ride-Sharing Platforms?

Characterizes online learning with ranking feedback showing sublinear regret impossible in general, possible with variation bounds.

Reinforcement Learning & Agents

Avg rating: 6.50 (6–8) · Zijian Zhao et al.

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

TROLL replaces PPO clip objective with differentiable trust region projection for more stable and efficient LLM reward fine-tuning.

Reinforcement Learning & Agents

Avg rating: 6.50 (4–10) · Philipp Becker et al.

True Self-Supervised Novel View Synthesis is Transferable

Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.

Vision & 3D

Avg rating: 6.00 (4–8) · Thomas Mitchel et al.

TTSDS2: Resources and Benchmark for Evaluating Human-Quality Text to Speech Systems

TTSDS2 metric robustly correlates with human judgments for TTS evaluation across diverse speech domains maintaining >0.5 Spearman correlation.

Datasets, Benchmarks & Evaluation

Avg rating: 5.33 (2–8) · Christoph Minixhofer et al.

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.

Multimodal & Speech

Avg rating: 6.00 (2–8) · Jinchuan Tian et al.

Uncover Underlying Correspondence for Robust Multi-view Clustering

Proposes CorreGen, generative framework for multi-view clustering under noisy correspondence using EM algorithm.

Safety, Privacy & Alignment

Avg rating: 7.00 (6–8) · Haochen Zhou et al.

Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)

RealUID provides universal distillation for matching models without GANs, incorporating real data into one-step generator training.

Diffusion & Flow Matching

Avg rating: 6.00 (4–8) · Nikita Maksimovich Kornilov et al.

Verifying Chain-of-Thought Reasoning via Its Computational Graph

CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.

Interpretability & Mechanistic Understanding

Avg rating: 6.50 (4–8) · Zheng Zhao et al.

Veritas: Generalizable Deepfake Detection via Pattern-Aware Reasoning

Veritas deepfake detector uses pattern-aware reasoning via MLLMs to achieve superior generalization across unseen forgery techniques and data domains.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Hao Tan et al.

VibeVoice: Expressive Podcast Generation with Next-Token Diffusion

Presents VibeVoice for zero-shot expressive long-form multi-speaker podcast generation using next-token diffusion.

Multimodal & Speech

Avg rating: 6.67 (2–8) · Zhiliang Peng et al.

Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction–Reasoning Synergy

Vid-LLM is a video-based 3D multimodal LLM that extracts geometric cues from videos without external 3D data for 3D scene understanding.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Haijier Chen et al.

Visual Planning: Let's Think Only with Images

Proposes visual planning paradigm using purely visual representations for reasoning in spatially-grounded tasks.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Yi Xu et al.

Visual symbolic mechanisms: Emergent symbol processing in Vision Language Models

VLMs employ position IDs as content-independent spatial indices to solve visual binding across object features.

LLMs & Reasoning

Avg rating: 6.50 (6–8) · Rim Assouel et al.

WAFT: Warping-Alone Field Transforms for Optical Flow

WAFT replaces cost volumes with high-resolution warping for optical flow, ranking first on Spring, Sintel, and KITTI with 1.3-4.1x faster inference.

Vision & 3D

Avg rating: 6.67 (6–8) · Yihan Wang et al.

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

FAB enables adversaries to create compromised LLMs that exhibit dormant adversarial behaviors triggered only during downstream finetuning.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Thibaud Gloaguen et al.

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

Creates first unified audio-visual embedding space for text, audio, and video with hierarchical fusion and prompt-awareness.

LLMs & Reasoning

Avg rating: 6.00 (4–8) · Changli Tang et al.

WebDevJudge: Evaluating (M)LLMs as Critiques for Web Development Quality

FALCON enables few-step flow-based sampling with accurate likelihoods for efficient Boltzmann distribution sampling.

LLMs & Reasoning

Avg rating: 6.50 (4–8) · Chunyang Li et al.

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.

Safety, Privacy & Alignment

Avg rating: 6.50 (4–8) · Rajiv Movva et al.

Why DPO is a Misspecified Estimator and How to Fix It

AuxDPO introduces auxiliary variables mitigating DPO misspecification and moving toward RLHF solutions.

LLMs & Reasoning

Avg rating: 6.67 (6–8) · Aditya Gopalan et al.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Analyzes low-precision flash attention training failure caused by low-rank representations and biased BF16 rounding errors.

Efficiency, Systems & Kernels

Avg rating: 6.50 (4–8) · Haiquan Qiu et al.

World-In-World: World Models in a Closed-Loop World

Introduces closed-loop benchmark evaluating generative world models on embodied task performance rather than visual quality.

Datasets, Benchmarks & Evaluation

Avg rating: 7.00 (6–8) · Jiahan Zhang et al.

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

WSM establishes theoretical connection between LR decay and model merging for improved LLM pre-training.

LLMs & Reasoning

Avg rating: 7.00 (2–10) · Changxin Tian et al.