ICLR 2026 Orals

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Artyom Sorokin, Nazar Buzun, Aleksandr Anokhin, Egor KONSTANTINOVICH VEDERNIKOV, Petr Anokhin, Mikhail Burtsev, Evgeny Burnaev

Reinforcement Learning & Agents Thu, Apr 23 · 4:27 PM–4:37 PM · Amphitheater Avg rating: 6.00 (2–8)

Abstract

Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks BabiLong and RULER for contexts up to 10M tokens. Code is available at: https://github.com/griver/Q-RAG.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Fine-tunes only embedder model for multi-step retrieval via reinforcement learning in latent embedding space
  • Compute-efficient training on single A100 GPU versus clusters of 8 for RL-based multi-step retrievers
  • Achieves state-of-the-art results on BabiLong, RULER, Musique and HotpotQA with minimal performance degradation at ultra-long scales
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Reinforcement learning
  • Multi-step retrieval
  • Embedder fine-tuning
  • Value-based training
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • BabiLong
  • RULER
  • Musique
  • HotpotQA
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Use structured LLM feedback as reward signal
    from the paper
  • Strengthen compositional and temporal reasoning directly in embedding space
    from the paper
  • Explore tighter integration with generation while preserving efficiency and scalability
    from the paper

Author keywords

  • Reinforcement Learning
  • RL
  • QA
  • Long-context
  • RAG
  • NLP

Related orals

Something off? Let us know →