Results 11 to 20 of about 1,234,147 (346)

MMBench: Is Your Multi-modal Model an All-around Player? [PDF]

open access: yesEuropean Conference on Computer Vision, 2023
Large vision-language models (VLMs) have recently achieved remarkable progress, exhibiting impressive multimodal perception and reasoning abilities. However, effectively evaluating these large VLMs remains a major challenge, hindering future development ...
Yuanzhan Liu   +11 more
semanticscholar   +1 more source

MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [PDF]

open access: yesComputer Vision and Pattern Recognition, 2023
With the rapid development of Multi-modal Large language Models (MLLMs), a number of diagnostic bench-marks have recently emerged to evaluate the comprehension capabilities of these models.
Kunchang Li   +11 more
semanticscholar   +1 more source

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities [PDF]

open access: yesConference on Empirical Methods in Natural Language Processing, 2023
Multi-modal large language models are regarded as a crucial step towards Artificial General Intelligence (AGI) and have garnered significant interest with the emergence of ChatGPT.
Dong Zhang   +6 more
semanticscholar   +1 more source

Visual Prompt Multi-Modal Tracking [PDF]

open access: yesComputer Vision and Pattern Recognition, 2023
Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based ...
Jiawen Zhu   +4 more
semanticscholar   +1 more source

MIMIC-IT: Multi-Modal In-Context Instruction Tuning [PDF]

open access: yesarXiv.org, 2023
High-quality instructions and responses are essential for the zero-shot performance of large language models on interactive natural language tasks. For interactive vision-language tasks involving intricate visual scenes, a large quantity of diverse and ...
Bo Li   +7 more
semanticscholar   +1 more source

MaPLe: Multi-modal Prompt Learning [PDF]

open access: yesComputer Vision and Pattern Recognition, 2022
Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well ...
Muhammad Uzair Khattak   +4 more
semanticscholar   +1 more source

UniXcoder: Unified Cross-Modal Pre-training for Code Representation [PDF]

open access: yesAnnual Meeting of the Association for Computational Linguistics, 2022
Pre-trained models for programming languages have recently demonstrated great success on code intelligence. To support both code-related understanding and generation tasks, recent works attempt to pre-train unified encoder-decoder models.
Daya Guo   +5 more
semanticscholar   +1 more source

mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video [PDF]

open access: yesInternational Conference on Machine Learning, 2023
Recent years have witnessed a big convergence of language, vision, and multi-modal pretraining. In this work, we present mPLUG-2, a new unified paradigm with modularized design for multi-modal pretraining, which can benefit from modality collaboration ...
Haiyang Xu   +14 more
semanticscholar   +1 more source

Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration [PDF]

open access: yesarXiv.org, 2023
Although instruction-tuned large language models (LLMs) have exhibited remarkable capabilities across various NLP tasks, their effectiveness on other data modalities beyond text has not been fully studied.
Chenyang Lyu   +7 more
semanticscholar   +1 more source

MDETR - Modulated Detection for End-to-End Multi-Modal Understanding [PDF]

open access: yesIEEE International Conference on Computer Vision, 2021
Multi-modal reasoning systems rely on a pre-trained object detector to extract regions of interest from the image. However, this crucial module is typically used as a black box, trained independently of the downstream task and on a fixed vocabulary of ...
Aishwarya Kamath   +5 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy