ICLR 2026 Orals

Revela: Dense Retriever Learning via Language Modeling

Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

LLMs & Reasoning Thu, Apr 23 · 11:18 AM–11:28 AM · Amphitheater Avg rating: 6.50 (6–8)

Abstract

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers?

To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on both CoIR and BRIGHT. It achieves BEIR's unsupervised SoTA with ~1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)
  • Self-supervised framework coupling dense retrieval with language modeling via in-batch attention
  • Weighted attention mechanism enabling retriever optimization as part of NTP training
  • Eliminates need for annotated or synthetic query-document pairs for training
  • Achieves unsupervised SoTA on BEIR with ~1000x less training data and 10x less compute
Methods used·Auto-generated by claude-haiku-4-5-20251001(?)
  • Self-supervised learning
  • Language modeling
  • Next token prediction
  • In-batch attention
Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)
  • CoIR
  • BRIGHT
  • BEIR
Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)
  • Iterative indexing approach that adaptively groups document chunks by learned representations
    from the paper
  • Scaling up with larger retriever and language model sizes
    from the paper
  • Incorporating more advanced attention mechanisms to enhance retriever learning
    from the paper
  • Extending paradigm to multimodal settings such as images
    from the paper

Author keywords

  • Information Retrieval
  • Unsupervised Learning

Related orals

Something off? Let us know →