Extending Sequence Length is Not All You Need: Effective Integration of Multimodal Signals for Gene Expression Prediction
Zhao Yang, Yi Duan, Jiwei Zhu, Ying Ba, Chuan Cao, Bing Su
Abstract
Gene expression prediction, which predicts mRNA expression levels from DNA sequences, presents significant challenges. Previous works often focus on extending input sequence length to locate distal enhancers, which may influence target genes from hundreds of kilobases away. Our work first reveals that for current models, long sequence modeling can decrease performance. Even carefully designed algorithms only mitigate the performance degradation caused by long sequences. Instead, we find that proximal multimodal epigenomic signals near target genes prove more essential. Hence we focus on how to better integrate these signals, which has been overlooked. We find that different signal types serve distinct biological roles, with some directly marking active regulatory elements while others reflect background chromatin patterns that may introduce confounding effects. Simple concatenation may lead models to develop spurious associations with these background patterns. To address this challenge, we propose Prism, a framework that learns multiple combinations of high-dimensional epigenomic features to represent distinct background chromatin states and uses backdoor adjustment to mitigate confounding effects. Our experimental results demonstrate that proper modeling of multimodal epigenomic signals achieves state-of-the-art performance using only short sequences for gene expression prediction.
Reveals long sequence modeling degrades gene expression prediction; proximal epigenomic signals with confounding mitigation suffice.
- Demonstrates current models suffer performance degradation with extended sequences for gene expression prediction
- Identifies proximal multimodal epigenomic signals as more essential than distal enhancers
- Proposes Prism framework using backdoor adjustment to mitigate confounding from background chromatin patterns
- Achieves state-of-the-art performance using only short sequences with minimal computational overhead
- Backdoor adjustment
- Causal inference
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- dna language model
- gene expression prediction
- multimodal information integration
- causal intervention
Related orals
Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models
Benchmarks practical privacy risks in differential privacy-adapted LLMs, revealing distribution shifts and model choice impact effectiveness.
Half-order Fine-Tuning for Diffusion Model: A Recursive Likelihood Ratio Optimizer
Proposes Recursive Likelihood Ratio optimizer for efficient fine-tuning of diffusion models with lower variance gradient estimation.
Invisible Safety Threat: Malicious Finetuning for LLM via Steganography
Demonstrates LLMs can be finetuned to generate harmful steganographically-hidden outputs while appearing benign to safety systems.
Reducing Belief Deviation in Reinforcement Learning for Active Reasoning of LLM Agents
Proposes T3 algorithm to detect belief deviation in LLM agents and truncate trajectories for improved reinforcement learning in active reasoning tasks.
RefineStat: Efficient Exploration for Probabilistic Program Synthesis
RefineStat enforces semantic constraints and applies diagnostic-aware refinement for synthesizing valid probabilistic programs from smaller language models.