One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale [PDF]
This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be ...
Fan Bao+9 more
semanticscholar +1 more source
Multi-Modal Self-Supervised Learning for Recommendation [PDF]
The online emergence of multi-modal sharing platforms (e.g., TikTok, Youtube) is powering personalized recommender systems to incorporate various modalities (e.g., visual, textual and acoustic) into the latent user representations.
Wei Wei+3 more
semanticscholar +1 more source
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning [PDF]
We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g.
Weixin Liang+4 more
semanticscholar +1 more source
Cross-modal Memory Networks for Radiology Report Generation [PDF]
Medical imaging plays a significant role in clinical practice of medical diagnosis, where the text reports of the images are essential in understanding them and facilitating later treatments.
Zhihong Chen+3 more
semanticscholar +1 more source
CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers [PDF]
Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality ( ${X}$ -modality). However,
Huayao Liu+4 more
semanticscholar +1 more source
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding [PDF]
Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object classification, segmentation and detection is often laborious owing to the irregular structure of point clouds.
Mohamed Afham+5 more
semanticscholar +1 more source
Delivering Arbitrary-Modal Semantic Segmentation [PDF]
Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the Deliver arbitrary-modal segmentation benchmark, covering Depth, LiDAR ...
Jiaming Zhang+8 more
semanticscholar +1 more source
Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning with Multimodal Models [PDF]
The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be ...
Zhiqiu Lin+4 more
semanticscholar +1 more source
Collaborative Diffusion for Multi-Modal Face Generation and Editing [PDF]
Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users'
Ziqi Huang+3 more
semanticscholar +1 more source
Multi-Modal Learning with Missing Modality via Shared-Specific Feature Modelling [PDF]
The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate ...
Hu Wang+5 more
semanticscholar +1 more source