Results 11 to 20 of about 1,234,147 (346)
MMBench: Is Your Multi-modal Model an All-around Player? [PDF]
Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development ...
Yuanzhan Liu+11 more
semanticscholar +1 more source
MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [PDF]
With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models.
Kunchang Li+11 more
semanticscholar +1 more source
SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [PDF]
Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT.
Dong Zhang+6 more
semanticscholar +1 more source
Visual Prompt Multi-Modal Tracking [PDF]
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based ...
Jiawen Zhu+4 more
semanticscholar +1 more source
MIMIC-IT: Multi-Modal In-Context Instruction Tuning [PDF]
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and ...
Bo Li+7 more
semanticscholar +1 more source
MaPLe: Multi-modal Prompt Learning [PDF]
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well ...
Muhammad Uzair Khattak+4 more
semanticscholar +1 more source
UniXcoder: Unified Cross-Modal Pre-training for Code Representation [PDF]
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models.
Daya Guo+5 more
semanticscholar +1 more source
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [PDF]
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration ...
Haiyang Xu+14 more
semanticscholar +1 more source
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [PDF]
Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied.
Chenyang Lyu+7 more
semanticscholar +1 more source
MDETR - Modulated Detection for End-to-End Multi-Modal Understanding [PDF]
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of ...
Aishwarya Kamath+5 more
semanticscholar +1 more source