Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Artyom Sorokin, Nazar Buzun, Aleksandr Anokhin, Egor KONSTANTINOVICH VEDERNIKOV, Petr Anokhin, Mikhail Burtsev, Evgeny Burnaev

Reinforcement Learning & Agents Thu, Apr 23 · 4:27 PM–4:37 PM · Amphitheater Avg rating: 6.00 (2–8)

OpenReview ↗ arXiv ↗ PDF ↗ iclr.cc ↗

Abstract

Retrieval-Augmented Generation (RAG) methods enhance LLM performance by efficiently filtering relevant context for LLMs, reducing hallucinations and inference cost. However, most existing RAG methods focus on single-step retrieval, which is often insufficient for answering complex questions that require multi-step search. Recently, multi-step retrieval approaches have emerged, typically involving the fine-tuning of small LLMs to perform multi-step retrieval. This type of fine-tuning is highly resource-intensive and does not enable the use of larger LLMs. In this work, we propose Q-RAG, a novel approach that fine-tunes the Embedder model for multi-step retrieval using reinforcement learning (RL). Q-RAG offers a competitive, resource-efficient alternative to existing multi-step retrieval methods for open-domain question answering and achieves state-of-the-art results on the popular long-context benchmarks BabiLong and RULER for contexts up to 10M tokens. Code is available at: https://github.com/griver/Q-RAG.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Q-RAG fine-tunes embedders for multi-step retrieval using reinforcement learning, achieving state-of-the-art on long-context QA.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Fine-tunes only embedder model for multi-step retrieval via reinforcement learning in latent embedding space
Compute-efficient training on single A100 GPU versus clusters of 8 for RL-based multi-step retrievers
Achieves state-of-the-art results on BabiLong, RULER, Musique and HotpotQA with minimal performance degradation at ultra-long scales

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Reinforcement learning
Multi-step retrieval
Embedder fine-tuning
Value-based training

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

BabiLong
RULER
Musique
HotpotQA

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Use structured LLM feedback as reward signal
from the paper
Strengthen compositional and temporal reasoning directly in embedding space
from the paper
Explore tighter integration with generation while preserving efficiency and scalability
from the paper

Author keywords

Reinforcement Learning
RL
QA
Long-context
RAG
NLP

Something off? Let us know →

Q-RAG: Long Context Multi‑Step Retrieval via Value‑Based Embedder Training

Abstract

Author keywords

Related orals

Mastering Sparse CUDA Generation through Pretrained Models and Deep Reinforcement Learning

Overthinking Reduction with Decoupled Rewards and Curriculum Data Scheduling

MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

DiffusionNFT: Online Diffusion Reinforcement with Forward Process

Hyperparameter Trajectory Inference with Conditional Lagrangian Optimal Transport