Scaling Atomistic Protein Binder Design with Generative Pretraining and Test-Time Compute
Kieran Didi, Zuobai Zhang, Guoqing Zhou, Danny Reidenbach, Zhonglin Cao, Sooyoung Cha, Tomas Geffner, Christian Dallago, Jian Tang, Michael M. Bronstein, Martin Steinegger, Emine Kucukbenli, Arash Vahdat, Karsten Kreis
We introduce a novel method for state-of-the-art structure-based protein binder design that combines flow matching-based generative pretraining with inference-time compute scaling techniques.
Abstract
Protein interaction modeling is central to protein design, which has been transformed by machine learning with applications in drug discovery and beyond. In this landscape, structure-based de novo binder design is cast as either conditional generative modeling or sequence optimization via structure predictors (``hallucination''). We argue that this is a false dichotomy and propose Proteina-Complexa, a novel fully atomistic binder generation method unifying both paradigms. We extend recent flow-based latent protein generation architectures and leverage the domain-domain interactions of monomeric computationally predicted protein structures to construct Teddymer, a new large-scale dataset of synthetic binder-target pairs for pretraining. Combined with high-quality experimental multimers, this enables training a strong base model. We then perform inference-time optimization with this generative prior, unifying the strengths of previously distinct generative and hallucination methods. Proteina-Complexa sets a new state of the art in computational binder design benchmarks: it delivers markedly higher in-silico success rates than existing generative approaches, and our novel test-time optimization strategies greatly outperform previous hallucination methods under normalized compute budgets. We also demonstrate interface hydrogen bond optimization, fold class-guided binder generation, and extensions to small molecule targets and enzyme design tasks, again surpassing prior methods. Code, models and new data will be publicly released.
Proteina-Complexa unifies generative modeling and hallucination for atomistic binder design via pretraining on Teddymer and test-time optimization.
- Novel fully atomistic binder generation method unifying conditional generation and sequence optimization
- Teddymer large-scale dataset of synthetic binder-target pairs for pretraining
- Inference-time optimization strategies achieving state-of-the-art de novo binder design
- Protein design
- Flow-based generative models
- Test-time optimization
- Computational modeling
- Teddymer
- AlphaFoldDB
- TED
Focus on protein and small molecule targets; extending to DNA and RNA would require additional work
from the paperEvaluations limited to in-silico metrics without experimental validation
from the paperDoes not target additional molecular properties like specificity or thermostability
from the paper
Train single unified model capable of targeting and generating different molecular modalities
from the paperConduct experimental validation of generated binders in wet lab
from the paperIntegrate computational predictors for specificity and thermostability
from the paper
Author keywords
- binder design
- protein design
- flow matching
- hallucination
- inference-time scaling
- generative modeling
- diffusion models
Related orals
Universal Inverse Distillation for Matching Models with Real-Data Supervision (No GANs)
RealUID provides universal distillation for matching models without GANs, incorporating real data into one-step generator training.
GLASS Flows: Efficient Inference for Reward Alignment of Flow and Diffusion Models
GLASS Flows samples Markov transitions via inner flow matching models to improve inference-time reward alignment in flow and diffusion models.
Neon: Negative Extrapolation From Self-Training Improves Image Generation
Neon inverts model degradation from self-training by extrapolating away from it, improving generative models with minimal compute.
Generative Human Geometry Distribution
Introduces distribution-over-distribution model combining geometry distributions with two-stage flow matching for human 3D generation.
Cross-Domain Lossy Compression via Rate- and Classification-Constrained Optimal Transport
Cross-domain lossy compression unifies rate and classification constraints via optimal transport framework.