$p\textrm{-less}$ Sampling: A Robust Hyperparameter-Free Approach for LLM Decoding
p-less sampling dynamically sets truncation threshold using information theory for hyperparameter-free LLM decoding with robust quality at high temperatures.
Language models, chain-of-thought, reasoning, RLHF, alignment post-training, and evaluation of LLM capabilities.
p-less sampling dynamically sets truncation threshold using information theory for hyperparameter-free LLM decoding with robust quality at high temperatures.
FFDP framework scales image registration to 100μm human brain MRI volumes using IO-aware kernels and distributed tensor sharding.
Large-scale study comparing LLM-graph interaction modes for node classification, finding code generation outperforms prompting on long-text and high-degree graphs.
AdAEM dynamically generates value-assessment questions for LLMs by probing internal value boundaries using in-context optimization.
ADP lightweight protocol unifies 13 heterogeneous agent datasets into single training schema achieving 20% average performance gain over base models.
AutoEP uses LLM reasoning with real-time landscape analysis to dynamically control metaheuristic algorithms without training.
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Framework detects self-initiated deception in LLMs via statistical metrics showing both deceptive intention and behavior correlate with task difficulty.
BIRD-INTERACT benchmark evaluates LLMs on dynamic multi-turn text-to-SQL tasks with function-driven user simulator and dual interaction settings.
WebDevJudge benchmark reveals significant LLM-as-judge gaps due to failures in functional equivalence and feasibility verification.
CounselBench large-scale benchmark with 2000 expert evaluations and 120 adversarial questions for evaluating LLMs in mental health question answering.
Expert-Router Coupling loss tightly couples MoE router decisions with expert capabilities by treating router embeddings as proxy tokens.
CoTAR replaces transformer attention with centralized MLP module for efficient medical time series modeling, reducing complexity to linear.
DepthLM shows VLMs can match pure vision models in metric depth estimation with text-based supervised finetuning and visual prompting without architecture changes.
Prophet identifies early answer convergence in diffusion language models to accelerate decoding by 3.4x on reasoning tasks.
Characterizes distributional equivalence for linear non-Gaussian latent-variable cyclic causal models without structural assumptions.
EmotionThinker reformulates speech emotion recognition as deep reasoning with prosody enhancement and specialized reinforcement learning.
Common Corpus releases 2 trillion permissively-licensed tokens for open-science LLM pre-training covering diverse languages.
ReaSyn iteratively refines synthetic pathways bidirectionally with discrete flow models for synthesizable molecular design.
Reveals long sequence modeling degrades gene expression prediction; proximal epigenomic signals with confounding mitigation suffice.
FIRE balances stability-plasticity tradeoff using Frobenius error and isometry deviation constraints without heavy hyperparameter tuning.
Accelerates video LLMs via training-free spatiotemporal token merging, retaining 99.1% performance with 10% of tokens.
Proposes FlashWorld generating high-quality 3D scenes in seconds using dual-mode diffusion with cross-mode distillation.
UFEval provides unified fine-grained evaluation of multimodal LLM outputs with aspect and task generalization.
Characterizes in-context learning capabilities of Mamba, showing it learns optimal Laplacian smoothing estimator.
Gaia2 benchmarks LLM agents in asynchronous dynamic environments with action-level verification for RL training.
OmniVerifier provides universal visual verification for multimodal reasoning and introduces sequential test-time scaling for image generation and editing.
GEPA uses genetic-Pareto selection with natural language reflection to outperform RL-based prompt optimization with 35x fewer rollouts.
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Gradient-aware diagnostic tool using saliency to identify hallucination patterns, proposing SGRS and LocoRE interventions to reduce output errors.
HATSolver uses hierarchical attention transformers to compute Gröbner bases for multivariate polynomial systems more efficiently than flat attention models.
Gradient leading-term analysis reveals how semantic associations emerge in transformers as compositions of bigram, interchangeability, and context mapping functions.
Study reveals incompatibility between ascending quality curriculum and decaying learning rate in LLM pretraining, proposing moderated decay and model averaging solutions.
Work establishes meta-evaluation measures showing many micro-benchmarks cannot reliably rank similar-performing models.
HGM identifies metaproductivity-performance mismatch and uses clade-based lineage metrics to guide self-improving coding agents.
In-Place TTT framework enables LLMs to perform test-time training by adapting MLP projection matrices with alignment to next-token prediction.
AgentFlow trainable in-the-flow agentic system using Flow-GRPO for on-policy learning with long-horizon sparse rewards.
InfoTok achieves adaptive video tokenization using information-theoretic compression and ELBO-based routing.
Avatar generation framework using MLLM semantic planning and specialized MMDiT for coherent character animations aligned with multimodal context.
Theory of context length scaling through Intrinsic Entropy explaining optimal context length and training dataset size relationship.
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Detects implicit reward hacking by measuring reasoning effort through truncated CoT analysis.
Systematic study reveals LLMs acquire visual perception priors from diverse data and reasoning priors from code/math corpora.
HyCa uses hybrid ODE solvers with dimension-wise caching strategies to accelerate diffusion transformers by 5-6x without retraining.
LLM DNA low-dimensional functional representation reveals evolutionary relationships among 305 LLMs through phylogenetic analysis.
Study showing LLMs exhibit 39% average performance drop in multi-turn conversations, failing to recover from wrong contextual assumptions.
LongWriter-Zero applies RL from scratch to achieve ultra-long text generation without synthetic training data.
LoongRL uses emergent plan-retrieve-reason-recheck pattern trained on long-context tasks to generalize beyond training length.
Mamba-3 achieves 1.8 percentage point accuracy gain over Mamba-2 via expressive recurrence, complex-valued state updates, and MIMO formulation.
MC-Search benchmark evaluates multimodal agentic RAG with step-wise reasoning chains and introduces Search-Align for improved planning.
mCLM uses modular chemical language combining natural language and molecular building blocks for function-aware synthesis.
MedAgentGym provides scalable sandbox environment with 72K biomedical tasks for training code-centric LLM agents with RL.
MoEs with optimal activation rates surpass dense LLMs under equal resource constraints (parameters, compute, data) with data reuse strategy.
MF-GIA framework enables graph neural networks to perform in-context learning across heterogeneous domains without modality assumptions using gradient fingerprints.
MNPO extends Nash learning to multiplayer regime for aligning LLMs with heterogeneous human preferences via n-player game formulation.
Camera-Aware MLLM framework improves spatial reasoning by injecting camera parameters and using geometric augmentation.
Theoretical analysis shows difficult examples hurt unsupervised contrastive learning generalization more than supervised settings.
OpenThoughts releases open-source datasets and models for training reasoning tasks, achieving state-of-the-art on AIME and code benchmarks.
MoE sparsity investigation reveals optimal balance between active FLOPs and tokens-per-parameter for reasoning versus memorization.
P-GenRM transforms user preferences into adaptive personas and scoring rubrics with test-time scaling for personalized reward modeling.
Enables parallel training of nonlinear RNNs via Newton's method achieving 665x speedup over sequential application.
Partition Generative Models replace masking with partitioning for efficient parallel generation, achieving higher throughput than masked generative models.
Shows optimal weight decay is 30x larger than standard practice; ensembling achieves lower loss asymptote enabling data-efficient pre-training at scale.
LeanHammer combines neural premise selection with symbolic automation for first end-to-end hammer in Lean proof assistant.
Proposes RAIN-Merging to merge instruction-tuned and reasoning models while preserving structured thinking format.
RALI framework aligns images to text representations from reasoning MLLMs using contrastive learning, achieving comparable image quality assessment performance with <5% parameters.
Power sampling algorithm elicits strong reasoning from base models at inference time via MCMC without additional training.
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.
Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.
SafeDPO reformulates safety alignment as closed-form objective, achieving strong safety-helpfulness trade-offs without auxiliary models.
SSPO achieves data efficiency in preference optimization by pseudo-labeling unpaired data using theoretically-grounded reward thresholds.
Extended logit matrices reveal low-rank structure of language models enabling linear generation from unrelated prompts.
Develops methods for LMs to ask informative questions and make decisions under uncertainty using Bayesian Experimental Design.
SimuHome introduces Matter protocol-grounded smart home simulator and 600-episode benchmark evaluating LLM agents on device control and workflow scheduling.
Proves length-generalizable softmax transformers with chain-of-thought and relative positional encoding are Turing-complete.
SwingArena evaluates LLMs on GitHub issue solving via adversarial framework modeling submitter-reviewer collaboration with retrieval-augmented code generation.
LoRA-Pre low-rank optimizer reduces momentum matrix memory via online linear learner decomposition while maintaining optimization performance.
ScaleRL provides principled framework for predicting RL compute scaling in LLMs through 400,000 GPU-hour study.
Develops theory linking pre-training coverage to post-training success through model scaling and practical algorithms.
Polar Express computes polar decomposition with minimax-optimized update rules for efficient GPU-friendly training.
Uses persistent homology to characterize topological compression in LLM latent spaces induced by adversarial inputs.
Compresses KV cache in reasoning models via thought-adaptive quantization and eviction achieving near-lossless accuracy.
VC-STaR mitigates visual hallucinations through contrastive VQA pairs for self-improving visual reasoning.
Shows tool-use enables state space models to achieve length generalization previously limited by fixed-size memory.
Proposes token-importance guided DPO with gradient attribution weighting and triplet loss for fine-grained LLM alignment.
Proposes train-before-test approach showing model potential rankings transfer across benchmarks better than direct evaluation.
Proves transformers with unique-hard attention are exponentially more succinct than finite automata and LTL formulas but verification is EXPSPACE-complete.
Veritas deepfake detector uses pattern-aware reasoning via MLLMs to achieve superior generalization across unseen forgery techniques and data domains.
Vid-LLM is a video-based 3D multimodal LLM that extracts geometric cues from videos without external 3D data for 3D scene understanding.
Proposes visual planning paradigm using purely visual representations for reasoning in spatially-grounded tasks.
VLMs employ position IDs as content-independent spatial indices to solve visual binding across object features.
FAB enables adversaries to create compromised LLMs that exhibit dormant adversarial behaviors triggered only during downstream finetuning.
Creates first unified audio-visual embedding space for text, audio, and video with hierarchical fusion and prompt-awareness.
FALCON enables few-step flow-based sampling with accurate likelihoods for efficient Boltzmann distribution sampling.
AuxDPO introduces auxiliary variables mitigating DPO misspecification and moving toward RLHF solutions.
WSM establishes theoretical connection between LR decay and model merging for improved LLM pre-training.