Trends across the 2026 Orals

Themes identified by Claude Sonnet 4.6 across the author-stated future-work and limitations sections of every oral paper. Each theme cites the papers that contribute to it. See how AI content is labeled.

Theme · 10 papers

Reinforcement learning as the primary post-training axis for LLMs

Across papers on reasoning, code generation, and instruction following, authors consistently identify RL-based fine-tuning—GRPO, PPO variants, trust-region methods—as the key next step beyond supervised learning. Multiple papers note current experiments are limited to small models (≤14B) or narrow task distributions and call for scaling RL to larger architectures, multi-turn settings, and agentic interaction. The convergence on RL as the dominant lever for eliciting complex behaviors marks a field-wide shift away from pure SFT pipelines.

“Apply framework to multi-turn RL, agentic interaction, and long-form reasoning”
— The Art of Scaling Reinforcement Learning Compute for LLMs

“Scale TROLL to larger models and Mixture-of-Experts architectures”
— TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

“Scaling method to larger architectures with adequate compute”
— Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

10 contributing papers

The Art of Scaling Reinforcement Learning Compute for LLMs

ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.

Avg rating: 7.50 (6–8) · Fnu Devvrit et al.

The Coverage Principle: How Pre-Training Enables Post-Training

Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.

Avg rating: 7.33 (6–8) · Fan Chen et al.

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

SparseRL leverages deep RL and pretrained models to generate high-performance CUDA code for sparse matrix operations.

Avg rating: 6.00 (4–8) · Yaoyu Wang et al.

TROLL: Trust Regions Improve Reinforcement Learning for Large Language Models

TROLL replaces PPO clip objective with differentiable trust region projection for more stable and efficient LLM reward fine-tuning.

Avg rating: 6.50 (4–10) · Philipp Becker et al.

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

DECS framework reduces reasoning model overthinking by decoupling necessary from redundant tokens via curriculum scheduling.

Avg rating: 6.50 (2–10) · Shuyang Jiang et al.

Token-Importance Guided Direct Preference Optimization

Proposes token-importance guided DPO with gradient attribution weighting and triplet loss for fine-grained LLM alignment.

Avg rating: 6.50 (4–8) · Ning Yang et al.

FALCON: Few-step Accurate Likelihoods for Continuous Flows

Triple-BERT addresses order dispatching via centralized SARL with action decomposition and BERT-based attention.

Avg rating: 7.00 (4–10) · Danyal Rehman et al.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Avg rating: 6.50 (4–8) · Ran Xu et al.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.

Avg rating: 6.50 (6–8) · Karen Ullrich et al.

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.

Avg rating: 6.80 (6–10) · Zhaoyang Liu et al.

Theme · 10 papers

Extending models to longer contexts and sequences at inference time

A recurring limitation across language, vision, and audio models is the mismatch between training-time sequence lengths and deployment demands. Authors frequently cite the need for improved positional encodings, KV-cache compression, and memory architectures to handle longer inputs without retraining. Several papers also note the theoretical gap between results proved for short sequences and real-world long-context behaviors, and call for both architectural and theoretical work to bridge it.

“Develop more advanced memory architectures and training strategies for enhancing long-context capabilities of LLMs”
— MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

“Characterize how different positional embedding schemes affect minimum training length”
— Quantitative Bounds for Length Generalization in Transformers

“Call on LLM builders to prioritize multi-turn reliability, as known remediations for simpler settings prove ineffective”
— LLMs Get Lost In Multi-Turn Conversation

10 contributing papers

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

MemAgent uses RL-trained memory modules to enable LLMs to extrapolate from 8K to 3.5M token contexts with minimal performance degradation.

Avg rating: 6.50 (4–8) · Hongli Yu et al.

ThinKV: Thought-Adaptive KV Cache Compression for Efficient Reasoning Models

Compresses KV cache in reasoning models via thought-adaptive quantization and eviction achieving near-lossless accuracy.

Avg rating: 6.00 (4–8) · Akshat Ramachandran et al.

MrRoPE: Mixed-radix Rotary Position Embedding

MrRoPE generalizes RoPE-extension via radix system conversion, achieving train-short-test-long with doubled effective context window.

Avg rating: 6.50 (6–8) · Qingyuan Tian et al.

Quantitative Bounds for Length Generalization in Transformers

Quantitative bounds show training length required for length generalization depends on periodicity, locality, alphabet size, and model norms.

Avg rating: 7.00 (6–8) · Zachary Izzo et al.

LLMs Get Lost In Multi-Turn Conversation

Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.

Avg rating: 8.00 (6–10) · Philippe Laban et al.

To Infinity and Beyond: Tool-Use Unlocks Length Generalization in State Space Models

Shows tool-use enables state space models to achieve length generalization previously limited by fixed-size memory.

Avg rating: 7.00 (4–8) · Eran Malach et al.

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.

Avg rating: 6.00 (2–8) · Artyom Sorokin et al.

Intrinsic Entropy of Context Length Scaling in LLMs

Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.

Avg rating: 5.50 (2–10) · Jingzhe Shi et al.

Temporal superposition and feature geometry of RNNs under memory demands

Studies temporal superposition in RNNs showing how memory demands affect representational geometry and RNNs learn different encoding strategies.

Avg rating: 7.50 (6–8) · Pratyaksh Sharma et al.

SWINGARENA: Adversarial Programming Arena for Long-context GitHub Issue Solving

SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.

Avg rating: 6.00 (4–8) · Wendong XU et al.

Theme · 11 papers

Extending unimodal or text-centric models to video, audio, and additional modalities

Papers working on language, image, or audio systems uniformly name video understanding and generation as an outstanding next step, and many additionally flag missing modalities such as 3D, tabular, radar, or multilingual data. The pattern is consistent: benchmarks, reward models, and generation pipelines built for text or static images are noted as needing systematic generalization to richer temporal and sensory modalities. This reflects a convergence on omni-modal systems as the target architecture.

“Incorporate video understanding and generation tasks into evaluation system”
— Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

“Extend ADP beyond text to images, screen recordings, and multimodal data”
— Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

“Build unified audio representation for more scalable joint training”
— UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

11 contributing papers

FRABench and UFEval: Unified Fine-grained Evaluation with Task and Aspect Generalization

UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.

Avg rating: 5.50 (4–8) · Shibo Hong et al.

Omni-Reward: Towards Generalist Omni-Modal Reward Modeling with Free-Form Preferences

Omni-Reward addresses modality imbalance and preference rigidity with omni-modal reward modeling framework.

Avg rating: 6.50 (6–8) · Zhuoran Jin et al.

Reasoning as Representation: Rethinking Visual Reinforcement Learning in Image Quality Assessment

RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.

Avg rating: 5.00 (4–6) · Shijie Zhao et al.

InfoTok: Adaptive Discrete Video Tokenizer via Information-Theoretic Compression

InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.

Avg rating: 7.33 (6–8) · Haotian Ye et al.

Latent Speech-Text Transformer

Aggregates speech tokens into latent patches for efficient speech-text modeling with cross-modal alignment.

Avg rating: 6.00 (2–10) · Yen-Ju Lu et al.

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

UALM unified audio language model handles understanding, text-to-audio generation, and multimodal reasoning in single model with UALM-Reason for cross-modal generative reasoning.

Avg rating: 6.00 (2–8) · Jinchuan Tian et al.

Revela: Dense Retriever Learning via Language Modeling

Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.

Avg rating: 6.50 (6–8) · Fengyu Cai et al.

Agent Data Protocol: Unifying Datasets for Diverse, Effective Fine-tuning of LLM Agents

ADP lightweight protocol unifies 13 heterogeneous agent datasets into single training schema achieving 20% average performance gain over base models.

Avg rating: 6.50 (4–8) · Yueqi Song et al.

Learning to See Before Seeing: Demystifying LLM Visual Priors from Language Pre-training

Systematic study reveals LLMs acquire visual perception priors from diverse data and reasoning priors from code/math corpora.

Avg rating: 7.00 (6–8) · Junlin Han et al.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Avg rating: 6.50 (4–8) · Ran Xu et al.

mCLM: A Modular Chemical Language Model that Generates Functional and Makeable Molecules

mCLM uses modular chemical language combining natural language and molecular building blocks for function-aware synthesis.

Avg rating: 5.50 (2–8) · Carl Edwards et al.

Theme · 10 papers

Scalable benchmarks and simulation environments for training and evaluating AI agents

Multiple papers building evaluation suites or agent training environments note that current benchmarks are limited to short, simple tasks and that simulation fidelity, coverage of realistic adversarial conditions, and scalable automated evaluation remain open problems. Authors call for environments that support RL fine-tuning loops, multi-step interaction, and adversarial injection, pointing toward a community-wide infrastructure effort to support the next generation of agent research.

“Use OpenApps for scaling agent training pipelines via RL fine-tuning”
— OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

“Developing fresh benchmark problems using latest scientific knowledge, contamination-resistant and past training cutoff dates”
— AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

“Enable agents to learn through trial and error inside simulator rather than imitating recorded examples”
— SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

10 contributing papers

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

CyberGym benchmarks AI agents on 1,507 real-world vulnerabilities discovering 34 zero-days, showing top models achieve only 22% success on PoC generation.

Avg rating: 7.00 (6–8) · Zhun Wang et al.

AstaBench: Rigorous Benchmarking of AI Agents with a Scientific Research Suite

Presents AstaBench, comprehensive benchmark suite with production-grade tools for rigorous evaluation of AI agents on scientific research tasks.

Avg rating: 7.00 (6–8) · Jonathan Bragg et al.

OpenApps: Simulating Environment Variations to Measure UI Agent Reliability

OpenApps testbed reveals UI agent reliability varies drastically across app variations despite stable within-environment performance.

Avg rating: 6.50 (6–8) · Karen Ullrich et al.

SimuHome: A Temporal- and Environment-Aware Benchmark for Smart Home LLM Agents

SimuHome introduces Matter protocol-grounded smart home simulator and 600-episode benchmark evaluating LLM agents on device control and workflow scheduling.

Avg rating: 6.00 (4–8) · Gyuhyeon Seo et al.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.

Avg rating: 6.00 (4–8) · Zeyi Liao et al.

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA scales open-source computer use agents with cross-platform dataset and dual-loop data pipeline.

Avg rating: 6.80 (6–10) · Zhaoyang Liu et al.

MedAgentGym: A Scalable Agentic Training Environment for Code-Centric Reasoning in Biomedical Data Science

MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.

Avg rating: 6.50 (4–8) · Ran Xu et al.

EditBench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Introduces EditBench benchmark for real-world LLM code editing with 545 problems from actual developer usage.

Avg rating: 7.50 (6–10) · Wayne Chi et al.

BIRD-INTERACT: Re-imagining Text-to-SQL Evaluation via Lens of Dynamic Interactions

BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.

Avg rating: 7.50 (6–8) · Nan Huo et al.

CounselBench: A Large-Scale Expert Evaluation and Adversarial Benchmarking of Large Language Models in Mental Health Question Answering

CounselBench large-scale benchmark with 2000 expert evaluations and 120 adversarial questions for evaluating LLMs in mental health question answering.

Avg rating: 6.67 (6–8) · Yahan Li et al.

Theme · 10 papers

Detecting and defending against adversarial manipulation of LLMs at fine-tuning and inference

Papers on backdoor attacks, steganographic triggers, reward hacking, and model fingerprinting share a common structure: a threat is demonstrated at small scale and authors call for mitigation strategies, extension to larger models, and evaluation against adaptive adversaries. The repeated limitation that experiments are confined to models ≤3B–13B reveals a gap between the studied threat surface and the deployed model scale, and points toward scalable detection and defense as a pressing research direction.

“Develop technical mitigations for finetuning-activated attacks”
— Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

“Evaluate TRACE on more realistic, heterogeneous loopholes”
— Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

“Redesign deception benchmarks using statistical methods for detecting deception rather than assuming correctness of LLM responses”
— Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

10 contributing papers

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.

Avg rating: 6.00 (6–6) · Guangnian Wan et al.

Watch your steps: Dormant Adversarial Behaviors that Activate upon LLM Finetuning

FAB enables adversaries to create compromised LLMs that exhibit dormant adversarial behaviors triggered only during downstream finetuning.

Avg rating: 6.50 (4–8) · Thibaud Gloaguen et al.

Beyond Prompt-Induced Lies: Investigating LLM Deception on Benign Prompts

Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.

Avg rating: 6.67 (6–8) · Zhaomin Wu et al.

RedTeamCUA: Realistic Adversarial Testing of Computer-Use Agents in Hybrid Web-OS Environments

Introduces RedTeamCUA framework with hybrid web-OS sandbox for adversarial testing of computer-use agents.

Avg rating: 6.00 (4–8) · Zeyi Liao et al.

Online Learning and Equilibrium Computation with Ranking Feedback

MRT systematically stress tests LLM agent monitoring revealing agent awareness dominates and hybrid scaffolding enables weak-to-strong.

Avg rating: 5.50 (2–8) · Mingyang Liu et al.

Sequences of Logits Reveal the Low Rank Structure of Language Models

Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.

Avg rating: 7.33 (6–8) · Noah Golowich et al.

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

Avg rating: 7.50 (6–8) · Xinpeng Wang et al.

Every Language Model Has a Forgery-Resistant Signature

Ellipse signatures function as forgery-resistant model output identifiers based on high-dimensional geometric constraints.

Avg rating: 6.00 (4–8) · Matthew Finlayson et al.

LLM Fingerprinting via Semantically Conditioned Watermarks

Introduces semantically conditioned watermarks for robust and stealthy LLM fingerprinting robust to deployment scenarios.

Avg rating: 6.50 (6–8) · Thibaud Gloaguen et al.

LLM DNA: Tracing Model Evolution via Functional Representations

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

Avg rating: 5.50 (4–6) · Zhaomin Wu et al.

Theme · 10 papers

Mechanistic interpretability: better tools for attributing and auditing model internals

Authors working on sparse autoencoders, causal interventions, computational graphs, and topological analysis of representations consistently note that current tools are insufficiently precise, do not scale to large models, and rely on unvalidated assumptions about linear representations. Future work directions coalesce around more faithful attribution methods, semantic grounding of identified features, and automated verification pipelines, suggesting that interpretability is moving from exploratory to engineering-grade.

“Improve interpretability tools such as more faithful sparse autoencoders and more precise attribution methods”
— Verifying Chain-of-Thought Reasoning via Its Computational Graph

“Use learned features as state trackers for detecting significant changes in model behavior”
— Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

“Extract concepts and features from low-rank representation space for model-agnostic interpretability”
— Sequences of Logits Reveal the Low Rank Structure of Language Models

10 contributing papers

Verifying Chain-of-Thought Reasoning via Its Computational Graph

CRV uses attribution graphs as execution traces to verify chain-of-thought reasoning with white-box mechanistic analysis of computation failures.

Avg rating: 6.50 (4–8) · Zheng Zhao et al.

Sequences of Logits Reveal the Low Rank Structure of Language Models

Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.

Avg rating: 7.33 (6–8) · Noah Golowich et al.

Temporal Sparse Autoencoders: Leveraging the Sequential Nature of Language for Interpretability

Temporal Sparse Autoencoders incorporate contrastive loss encouraging consistent feature activations over adjacent tokens to discover semantic concepts.

Avg rating: 6.50 (4–10) · Usha Bhalla et al.

Exploratory Causal Inference in SAEnce

Uses sparse autoencoders and foundation models to discover unknown causal effects in scientific trials.

Avg rating: 7.00 (4–8) · Tommaso Mencattini et al.

What's In My Human Feedback? Learning Interpretable Descriptions of Preference Data

WIMHF uses sparse autoencoders to extract human-interpretable features from preference data, enabling better understanding and curation of human feedback.

Avg rating: 6.50 (4–8) · Rajiv Movva et al.

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.

Avg rating: 6.00 (4–8) · Aideen Fay et al.

Is it Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.

Avg rating: 7.50 (6–8) · Xinpeng Wang et al.

LLM DNA: Tracing Model Evolution via Functional Representations

LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.

Avg rating: 5.50 (4–6) · Zhaomin Wu et al.

How Do Transformers Learn to Associate Tokens: Gradient Leading Terms Bring Mechanistic Interpretability

Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.

Avg rating: 7.20 (6–8) · Shawn Im et al.

Addressing divergent representations from causal interventions on neural networks

Study of causal interventions showing they produce out-of-distribution representations, proposing Counterfactual Latent loss to mitigate harmful divergences.

Avg rating: 5.20 (4–8) · Satchel Grant et al.

Theme · 8 papers

World models for embodied and robotic agents: dynamics, planning, and sim-to-real transfer

Papers on latent world models, compositional planning, and robot learning share a common roadblock: models trained on narrow domains (robotics, video games) must be generalized to diverse real-world settings, and current approaches lack reliable dynamics modeling and multi-modal conditioning. Authors point toward physics-guided training, integration of language and reward signals, and evaluation on real robotic hardware as the required next steps, reflecting a convergence on interactive simulation as foundational infrastructure for embodied AI.

“Enable unified multi-modal conditioning with simultaneous action, language and image signals”
— Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling

“Physics-guided motion generation and physics-aware reinforcement post-training for precise dynamics modeling”
— World-In-World: World Models in a Closed-Loop World

“Benchmark on large-scale real robotic datasets”
— TD-JEPA: Latent-predictive Representations for Zero-Shot Reinforcement Learning

8 contributing papers