Revela: Dense Retriever Learning via Language Modeling

Fengyu Cai, Tong Chen, Xinran Zhao, Sihao Chen, Hongming Zhang, Tongshuang Wu, Iryna Gurevych, Heinz Koeppl

LLMs & Reasoning Thu, Apr 23 · 11:18 AM–11:28 AM · Amphitheater Avg rating: 6.50 (6–8)

Abstract

Dense retrievers play a vital role in accessing external and specialized knowledge to augment language models (LMs). Training dense retrievers typically requires annotated query-document pairs, which are costly to create and scarce in specialized domains (e.g., code) or in complex settings (e.g., requiring reasoning). These practical challenges have sparked growing interest in self-supervised retriever learning. Since LMs are trained to capture token-level dependencies through a self-supervised learning objective (i.e., next token prediction), we can analogously cast retrieval as learning dependencies among chunks of tokens. This analogy naturally leads to the question: How can we adapt self‑supervised learning objectives in the spirit of language modeling to train retrievers?

To answer this question, we introduce Revela, a unified and scalable training framework for self-supervised retriever learning via language modeling. Revela models semantic dependencies among documents by conditioning next token prediction on local and cross-document context through an in-batch attention mechanism. This attention is weighted by retriever-computed similarity scores, enabling the retriever to be optimized as part of language modeling. We evaluate Revela on domain-specific (CoIR), reasoning-intensive (BRIGHT), and general-domain (BEIR) benchmarks across various retriever backbones. Without annotated or synthetic query-document pairs, Revela surpasses larger supervised models and proprietary APIs on both CoIR and BRIGHT. It achieves BEIR's unsupervised SoTA with ~1000x less training data and 10x less compute. Performance increases with batch size and model size, highlighting Revela's scalability and its promise for self‑supervised retriever learning.

One-sentence summary·Auto-generated by claude-haiku-4-5-20251001(?)

Revela enables self-supervised retriever learning by adapting language modeling objectives, achieving unsupervised SoTA on multiple retrieval benchmarks.

Contributions·Auto-generated by claude-haiku-4-5-20251001(?)

Self-supervised framework coupling dense retrieval with language modeling via in-batch attention
Weighted attention mechanism enabling retriever optimization as part of NTP training
Eliminates need for annotated or synthetic query-document pairs for training
Achieves unsupervised SoTA on BEIR with ~1000x less training data and 10x less compute

Methods used·Auto-generated by claude-haiku-4-5-20251001(?)

Self-supervised learning
Language modeling
Next token prediction
In-batch attention

Datasets used·Auto-generated by claude-haiku-4-5-20251001(?)

CoIR
BRIGHT
BEIR

Limitations (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Authors did not state explicit limitations.

Future work (author-stated)·Auto-generated by claude-haiku-4-5-20251001(?)

Iterative indexing approach that adaptively groups document chunks by learned representations
from the paper
Scaling up with larger retriever and language model sizes
from the paper
Incorporating more advanced attention mechanisms to enhance retriever learning
from the paper
Extending paradigm to multimodal settings such as images
from the paper

Author keywords

Information Retrieval
Unsupervised Learning

Something off? Let us know →

Revela: Dense Retriever Learning via Language Modeling

Abstract

Author keywords

Related orals

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer

Invisible Safety Threat: Malicious Finetuning for LLM via Steganography

Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents

RefineStat: Efficient Exploration for Probabilistic Program Synthesis