CMX: Cross-Modal Fusion for RGB-X Semantic Segmentation With Transformers [PDF]
Scene understanding based on image segmentation is a crucial component of autonomous vehicles. Pixel-wise semantic segmentation of RGB images can be advanced by exploiting complementary features from the supplementary modality ( ${X}$ -modality). However,
Huayao Liu +4 more
semanticscholar +1 more source
Cross-modal Memory Networks for Radiology Report Generation [PDF]
Medical imaging plays a significant role in clinical practice of medical diagnosis, where the text reports of the images are essential in understanding them and facilitating later treatments.
Zhihong Chen +3 more
semanticscholar +1 more source
CrossPoint: Self-Supervised Cross-Modal Contrastive Learning for 3D Point Cloud Understanding [PDF]
Manual annotation of large-scale point cloud dataset for varying tasks such as 3D object classification, segmentation and detection is often laborious owing to the irregular structure of point clouds.
Mohamed Afham +5 more
semanticscholar +1 more source
Delivering Arbitrary-Modal Semantic Segmentation [PDF]
Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the Deliver arbitrary-modal segmentation benchmark, covering Depth, LiDAR ...
Jiaming Zhang +8 more
semanticscholar +1 more source
Multi-Modal Self-Supervised Learning for Recommendation [PDF]
The online emergence of multi-modal sharing platforms (e.g., TikTok, Youtube) is powering personalized recommender systems to incorporate various modalities (e.g., visual, textual and acoustic) into the latent user representations.
Wei Wei +3 more
semanticscholar +1 more source
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [PDF]
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration ...
Haiyang Xu +14 more
semanticscholar +1 more source
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale [PDF]
This paper proposes a unified diffusion framework (dubbed UniDiffuser) to fit all distributions relevant to a set of multi-modal data in one model. Our key insight is -- learning diffusion models for marginal, conditional, and joint distributions can be ...
Fan Bao +9 more
semanticscholar +1 more source
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [PDF]
Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied.
Chenyang Lyu +7 more
semanticscholar +1 more source
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [PDF]
Large language models have become a potential pathway toward achieving artificial general intelligence. Recent works on multi-modal large language models have demonstrated their effectiveness in handling visual modalities.
Zhen-fei Yin +11 more
semanticscholar +1 more source
Multi-Modal Fusion Transformer for End-to-End Autonomous Driving [PDF]
How should representations from complementary sensors be integrated for autonomous driving? Geometry-based sensor fusion has shown great promise for perception tasks such as object detection and motion forecasting.
Aditya Prakash +2 more
semanticscholar +1 more source

