Results 11 to 20 of about 2,484,152 (338)
Bootstrap Latent Representations for Multi-modal Recommendation [PDF]
This paper studies the multi-modal recommendation problem, where the item multi-modality information (e.g., images and textual descriptions) is exploited to improve the recommendation accuracy.
Xin Zhou +7 more
openalex +3 more sources
MMBench: Is Your Multi-modal Model an All-around Player? [PDF]
Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development ...
Yuanzhan Liu +11 more
semanticscholar +1 more source
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [PDF]
With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models.
Kunchang Li +11 more
semanticscholar +1 more source
MaPLe: Multi-modal Prompt Learning [PDF]
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well ...
Muhammad Uzair Khattak +4 more
semanticscholar +1 more source
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [PDF]
Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT.
Dong Zhang +6 more
semanticscholar +1 more source
Visual Prompt Multi-Modal Tracking [PDF]
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based ...
Jiawen Zhu +4 more
semanticscholar +1 more source
UniXcoder: Unified Cross-Modal Pre-training for Code Representation [PDF]
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models.
Daya Guo +5 more
semanticscholar +1 more source
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding [PDF]
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of ...
Aishwarya Kamath +5 more
semanticscholar +1 more source
Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning [PDF]
We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g.
Weixin Liang +4 more
semanticscholar +1 more source
MIMIC-IT: Multi-Modal In-Context Instruction Tuning [PDF]
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and ...
Bo Li +7 more
semanticscholar +1 more source

