Seeing Through the Brain: New Insights from Decoding Visual Stimuli with fMRI
Zheng Huang, Enpei Zhang, Weikang Qiu, Yinghao Cai, Carl Yang, Elynn Chen, Xiang Zhang, Rex Ying, Dawei Zhou, Yujun Yan
We present PRISM, a framework to decode visual stimuli from fMRI with language model alignment
Abstract
Understanding how the brain encodes visual information is a central challenge in neuroscience and machine learning. A promising approach is to reconstruct visual stimuli—essentially images—from functional Magnetic Resonance Imaging (fMRI) signals. This involves two stages: transforming fMRI signals into a latent space and then using a pre-trained generative model to reconstruct images. The reconstruction quality depends on how similar the latent space is to the structure of neural activity and how well the generative model produces images from that space. Yet, it remains unclear which type of latent space best supports this transformation and how it should be organized to represent visual stimuli effectively.
We present two key findings. First, fMRI signals are more similar to the text space of a language model than to either a vision-based space or a joint text–image space. Second, text representations and the generative model should be adapted to capture the compositional nature of visual stimuli, including objects, their detailed attributes, and relationships. Building on these insights, we propose PRISM, a model that Projects fMRI sIgnals into a Structured text space as an interMediate representation for visual stimuli reconstruction. It includes an object-centric diffusion module that generates images by composing individual objects to reduce object detection errors, and an attribute/relationship search module that automatically identifies key attributes and relationships that best align with the neural activity. Extensive experiments on real-world datasets demonstrate that our framework outperforms existing methods, achieving up to an 6% reduction in perceptual loss. These results highlight the importance of using structured text as an intermediate space to bridge fMRI signals and image reconstruction. Codes are available at https://github.com/GraphmindDartmouth/PRISM.
PRISM framework projects fMRI signals into structured text space for visual stimulus reconstruction with object-centric diffusion and attribute search modules.
- Discovery that fMRI signals align more closely with language model text space than vision or joint representations
- Object-centric diffusion module generating images by composing individual objects
- Attribute/relationship search module automatically identifying neural-aligned attributes
- fMRI decoding
- text space projection
- diffusion models
- object composition
Authors did not state explicit limitations.
Authors did not state explicit future directions.
Author keywords
- Neuroscience
- Functional Magnetic Resonance Imaging
- Image reconstruction
- Reconstruction
Related orals
Improving Diffusion Models for Class-imbalanced Training Data via Capacity Manipulation
Capacity manipulation improves diffusion models' handling of class-imbalanced data by reserving capacity for minority classes via low-rank decomposition.
Depth Anything 3: Recovering the Visual Space from Any Views
DA3 predicts spatially consistent 3D geometry from arbitrary camera views using plain transformer and depth-ray targets.
Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator
VIST3A stitches text-to-video models with 3D reconstruction systems and aligns them via reward finetuning for high-quality text-to-3D generation.
Radiometrically Consistent Gaussian Surfels for Inverse Rendering
RadioGS introduces radiometric consistency supervision for inverse rendering to accurately model indirect illumination in Gaussian-based representations.
True Self-Supervised Novel View Synthesis is Transferable
Presents XFactor, first geometry-free self-supervised model for transferable novel view synthesis without 3D inductive biases.