LLMs & Reasoning

Language models, chain-of-thought, reasoning, RLHF, alignment post-training, and evaluation of LLM capabilities.

All papers

Min rating

Sort

$p\textrm{-less}$ Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding

p-less sampling dynamically sets truncation threshold using information theory for hyperparameter-free LLM decoding with robust quality at high temperatures.

Avg rating: 6.00 (6–6) · Runyan Tan et al.

A Scalable Distributed Framework for Multimodal GigaVoxel Image Registration

FFDP framework scales image registration to 100μm human brain MRI volumes using IO-aware kernels and distributed tensor sharding.

Avg rating: 6.50 (2–10) · Rohit Jena et al.

Actions Speak Louder than Prompts: A Large-Scale Study of LLMs for Graph Inference

Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.

Avg rating: 5.50 (2–8) · Ben Finkelshtein et al.

AdAEM: An Adaptively and Automated Extensible Measurement of LLMs' Value Difference

AdAEM dynamically generates value-assessment questions for LLMs by probing internal value boundaries using in-context optimization.

Avg rating: 7.00 (4–8) · Jing Yao et al.

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

ADP lightweight protocol unifies 13 heterogeneous agent datasets into single training schema achieving 20% average performance gain over base models.

Avg rating: 6.50 (4–8) · Yueqi Song et al.

AutoEP: LLMs-Driven Automation of Hyperparameter Evolution for Metaheuristic Algorithms

AutoEP uses LLM reasoning with real-time landscape analysis to dynamically control metaheuristic algorithms without training.

Avg rating: 6.50 (6–8) · Zhenxing Xu et al.

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.

Avg rating: 5.50 (4–8) · Bartłomiej Marek et al.

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.

Avg rating: 6.67 (6–8) · Zhaomin Wu et al.

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.

Avg rating: 7.50 (6–8) · Nan Huo et al.

Common Corpus: The Largest Collection of Ethical Data for LLM Pre-Training

WebDevJudge benchmark reveals significant LLM-as-judge gaps due to failures in functional equivalence and feasibility verification.

Avg rating: 7.00 (6–8) · Pierre-Carl Langlais et al.

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

CounselBench large-scale benchmark with 2000 expert evaluations and 120 adversarial questions for evaluating LLMs in mental health question answering.

Avg rating: 6.67 (6–8) · Yahan Li et al.

Coupling Experts and Routers in Mixture-of-Experts via an Auxiliary Loss

Expert-Router Coupling loss tightly couples MoE router decisions with expert capabilities by treating router embeddings as proxy tokens.

Avg rating: 6.67 (6–8) · Ang Lv et al.

Decentralized Attention Fails Centralized Signals: Rethinking Transformers for Medical Time Series

CoTAR replaces transformer attention with centralized MLP module for efficient medical time series modeling, reducing complexity to linear.

Avg rating: 6.00 (4–8) · Guoqi Yu et al.

DepthLM: Metric Depth from Vision Language Models

DepthLM shows VLMs can match pure vision models in metric depth estimation with text-based supervised finetuning and visual prompting without architecture changes.

Avg rating: 6.67 (4–10) · Zhipeng Cai et al.

Diffusion Language Model Knows the Answer Before It Decodes

Prophet identifies early answer convergence in diffusion language models to accelerate decoding by 3.4x on reasoning tasks.

Avg rating: 6.50 (4–8) · Pengxiang Li et al.

Distributional Equivalence in Linear Non-Gaussian Latent-Variable Cyclic Causal Models: Characterization and Learning

Characterizes distributional equivalence for linear non-Gaussian latent-variable cyclic causal models without structural assumptions.

Avg rating: 8.00 (8–8) · Haoyue Dai et al.

EmotionThinker: Prosody-Aware Reinforcement Learning for Explainable Speech Emotion Reasoning

EmotionThinker reformulates speech emotion recognition as deep reasoning with prosody enhancement and specialized reinforcement learning.

Avg rating: 6.50 (6–8) · Dingdong WANG et al.

Energy-Based Transformers are Scalable Learners and Thinkers

Common Corpus releases 2 trillion permissively-licensed tokens for open-science LLM pre-training covering diverse languages.

Avg rating: 6.00 (2–8) · Alexi Gladstone et al.

Exploring Synthesizable Chemical Space with Iterative Pathway Refinements

ReaSyn iteratively refines synthetic pathways bidirectionally with discrete flow models for synthesizable molecular design.

Avg rating: 5.50 (4–8) · Seul Lee et al.

Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction

Reveals long sequence modeling degrades gene expression prediction; proximal epigenomic signals with confounding mitigation suffice.

Avg rating: 6.50 (6–8) · Zhao Yang et al.

FIRE: Frobenius-Isometry Reinitialization for Balancing the Stability–Plasticity Tradeoff

FIRE balances stability-plasticity tradeoff using Frobenius error and isometry deviation constraints without heavy hyperparameter tuning.

Avg rating: 6.00 (6–6) · Isaac Han et al.

FlashVID: Efficient Video Large Language Models via Training-free Tree-based Spatiotemporal Token Merging

Accelerates video LLMs via training-free spatiotemporal token merging, retaining 99.1% performance with 10% of tokens.

Avg rating: 5.50 (4–8) · Ziyang Fan et al.

FlashWorld: High-quality 3D Scene Generation within Seconds

Proposes FlashWorld generating high-quality 3D scenes in seconds using dual-mode diffusion with cross-mode distillation.

Avg rating: 6.00 (6–6) · Xinyang Li et al.

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.

Avg rating: 5.50 (4–8) · Shibo Hong et al.

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Characterizes in-context learning capabilities of Mamba, showing it learns optimal Laplacian smoothing estimator.

Avg rating: 7.50 (6–8) · Marco Bondaschi et al.

Gaia2: Benchmarking LLM Agents on Dynamic and Asynchronous Environments

Gaia2 benchmarks LLM agents in asynchronous dynamic environments with action-level verification for RL training.

Avg rating: 8.00 (6–10) · Romain Froger et al.

Generative Universal Verifier as Multimodal Meta-Reasoner

OmniVerifier provides universal visual verification for multimodal reasoning and introduces sequential test-time scaling for image generation and editing.

Avg rating: 8.00 (8–8) · Xinchen Zhang et al.

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

GEPA uses genetic-Pareto selection with natural language reflection to outperform RL-based prompt optimization with 35x fewer rollouts.

Avg rating: 6.00 (2–10) · Lakshya A Agrawal et al.

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.

Avg rating: 6.67 (6–8) · Tao Ren et al.

Hallucination Begins Where Saliency Drops

Gradient-aware diagnostic tool using saliency to identify hallucination patterns, proposing SGRS and LocoRE interventions to reduce output errors.

Avg rating: 6.00 (4–8) · Xiaofeng Zhang et al.

HATSolver: Learning Gröbner Bases with Hierarchical Attention Transformers

HATSolver uses hierarchical attention transformers to compute Gröbner bases for multivariate polynomial systems more efficiently than flat attention models.

Avg rating: 4.67 (4–6) · Mohamed Malhou et al.

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.

Avg rating: 7.20 (6–8) · Shawn Im et al.

How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining

Study reveals incompatibility between ascending quality curriculum and decaying learning rate in LLM pretraining, proposing moderated decay and model averaging solutions.

Avg rating: 6.00 (6–6) · Kairong Luo et al.

How Reliable is Language Model Micro-Benchmarking?

Work establishes meta-evaluation measures showing many micro-benchmarks cannot reliably rank similar-performing models.

Avg rating: 6.50 (4–8) · Gregory Yauney et al.

Huxley-G\"odel Machine: Human-Level Coding Agent Development by an Approximation of the Optimal Self-Improving Machine

HGM identifies metaproductivity-performance mismatch and uses clade-based lineage metrics to guide self-improving coding agents.

Avg rating: 6.00 (4–8) · Wenyi Wang et al.

In-Place Test-Time Training

In-Place TTT framework enables LLMs to perform test-time training by adapting MLP projection matrices with alignment to next-token prediction.

Avg rating: 7.33 (6–8) · Guhao Feng et al.

In-the-Flow Agentic System Optimization for Effective Planning and Tool Use

AgentFlow trainable in-the-flow agentic system using Flow-GRPO for on-policy learning with long-horizon sparse rewards.

Avg rating: 7.33 (6–8) · Zhuofeng Li et al.

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.

Avg rating: 7.33 (6–8) · Haotian Ye et al.

Instilling an Active Mind in Avatars via Cognitive Simulation

Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.

Avg rating: 7.00 (6–8) · Jianwen Jiang et al.

Intrinsic Entropy of Context Length Scaling in LLMs

Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.

Avg rating: 5.50 (2–10) · Jingzhe Shi et al.

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.

Avg rating: 6.00 (6–6) · Guangnian Wan et al.

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

Avg rating: 7.50 (6–8) · Xinpeng Wang et al.

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Systematic study reveals LLMs acquire visual perception priors from diverse data and reasoning priors from code/math corpora.

Avg rating: 7.00 (6–8) · Junlin Han et al.

Let Features Decide Their Own Solvers: Hybrid Feature Caching for Diffusion Transformers

HyCa uses hybrid ODE solvers with dimension-wise caching strategies to accelerate diffusion transformers by 5-6x without retraining.

Avg rating: 7.00 (4–10) · Shikang Zheng et al.

LLM DNA: Tracing Model Evolution via Functional Representations

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

Avg rating: 5.50 (4–6) · Zhaomin Wu et al.

LLMs Get Lost In Multi-Turn Conversation

Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.

Avg rating: 8.00 (6–10) · Philippe Laban et al.

LongWriter-Zero: Mastering Ultra-Long Text Generation via Reinforcement Learning

LongWriter-Zero applies RL from scratch to achieve ultra-long text generation without synthetic training data.

Avg rating: 6.00 (4–8) · Yuhao Wu et al.

LoongRL: Reinforcement Learning for Advanced Reasoning over Long Contexts

LoongRL uses emergent plan-retrieve-reason-recheck pattern trained on long-context tasks to generalize beyond training length.

Avg rating: 6.50 (4–8) · Siyuan Wang et al.

Mamba-3: Improved Sequence Modeling using State Space Principles

Mamba-3 achieves 1.8 percentage point accuracy gain over Mamba-2 via expressive recurrence, complex-valued state updates, and MIMO formulation.

Avg rating: 7.00 (6–8) · Aakash Lahoti et al.

MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains

MC-Search benchmark evaluates multimodal agentic RAG with step-wise reasoning chains and introduces Search-Align for improved planning.

Avg rating: 5.00 (4–6) · Xuying Ning et al.

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

mCLM uses modular chemical language combining natural language and molecular building blocks for function-aware synthesis.

Avg rating: 5.50 (2–8) · Carl Edwards et al.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Avg rating: 6.50 (4–8) · Ran Xu et al.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

MoEs with optimal activation rates surpass dense LLMs under equal resource constraints (parameters, compute, data) with data reuse strategy.

Avg rating: 5.00 (4–8) · Houyi Li et al.

Modality-free Graph In-context Alignment

MF-GIA framework enables graph neural networks to perform in-context learning across heterogeneous domains without modality assumptions using gradient fingerprints.

Avg rating: 6.00 (4–8) · Wei Zhuo et al.

Multiplayer Nash Preference Optimization

MNPO extends Nash learning to multiplayer regime for aligning LLMs with heterogeneous human preferences via n-player game formulation.

Avg rating: 6.00 (4–8) · Fang Wu et al.

On the Generalization Capacities of MLLMs for Spatial Intelligence

Camera-Aware MLLM framework improves spatial reasoning by injecting camera parameters and using geometric augmentation.

Avg rating: 6.00 (4–8) · Gongjie Zhang et al.

On the Reasoning Abilities of Masked Diffusion Language Models

Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.

Avg rating: 7.00 (6–8) · Anej Svete et al.

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts releases open-source datasets and models for training reasoning tasks, achieving state-of-the-art on AIME and code benchmarks.

Avg rating: 6.50 (6–8) · Etash Kumar Guha et al.

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.

Avg rating: 6.50 (6–8) · Taishi Nakamura et al.

P-GenRM: Personalized Generative Reward Model with Test-time User-based Scaling

P-GenRM transforms user preferences into adaptive personas and scoring rubrics with test-time scaling for personalized reward modeling.

Avg rating: 4.67 (4–6) · Pinyi Zhang et al.

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models

Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.

Avg rating: 6.50 (6–8) · Federico Danieli et al.

Partition Generative Modeling: Masked Modeling Without Masks

Partition Generative Models replace masking with partitioning for efficient parallel generation, achieving higher throughput than masked generative models.

Avg rating: 7.00 (6–8) · Justin Deschenaux et al.

Pre-training under infinite compute

Shows optimal weight decay is 30x larger than standard practice; ensembling achieves lower loss asymptote enabling data-efficient pre-training at scale.

Avg rating: 7.50 (6–8) · Konwoo Kim et al.

Premise Selection for a Lean Hammer

LeanHammer combines neural premise selection with symbolic automation for first end-to-end hammer in Lean proof assistant.

Avg rating: 6.50 (4–8) · Thomas Zhu et al.

RAIN-Merging: A Gradient-Free Method to Enhance Instruction Following in Large Reasoning Models with Preserved Thinking Format

Proposes RAIN-Merging to merge instruction-tuned and reasoning models while preserving structured thinking format.

Avg rating: 6.50 (4–8) · Zhehao Huang et al.