Results 331 to 340 of about 1,322,987 (378)
Some of the next articles are maybe not open access.
International Conference on Machine Learning
Combining discrete and continuous data is an important capability for generative models. We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that provides the missing link in enabling flow-based generative models to be applied
Andrew Campbell +4 more
semanticscholar +1 more source
Combining discrete and continuous data is an important capability for generative models. We present Discrete Flow Models (DFMs), a new flow-based model of discrete data that provides the missing link in enabling flow-based generative models to be applied
Andrew Campbell +4 more
semanticscholar +1 more source
BLINK: Multimodal Large Language Models Can See but Not Perceive
European Conference on Computer VisionWe introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations.
Xingyu Fu +9 more
semanticscholar +1 more source
MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark
Annual Meeting of the Association for Computational LinguisticsThis paper introduces MMMU-Pro, a robust version of the Massive Multi-discipline Multimodal Understanding and Reasoning (MMMU) benchmark. MMMU-Pro rigorously assesses multimodal models' true understanding and reasoning capabilities through a three-step ...
Xiang Yue +13 more
semanticscholar +1 more source
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Computer Vision and Pattern RecognitionWe introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon.
Chengyue Wu +10 more
semanticscholar +1 more source
AbstractThis chapter considers multimodality in Casey O’Callaghan’s strict sense of that term: a perceptual representation’s object is represented neither by a single sense modality nor merely as a collection of features each of which is represented by a single modality. By definition, multimodal representation cannot be simply a case of layering.
Denise Newfield, Sarah Crinall
openaire +2 more sources
Denise Newfield, Sarah Crinall
openaire +2 more sources
HuatuoGPT-Vision, Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale
arXiv.orgThe rapid development of multimodal large language models (MLLMs), such as GPT-4V, has led to significant advancements. However, these models still face challenges in medical multimodal capabilities due to limitations in the quantity and quality of ...
Junying Chen +11 more
semanticscholar +1 more source
Hallucination of Multimodal Large Language Models: A Survey
arXiv.orgThis survey presents a comprehensive analysis of the phenomenon of hallucination in multimodal large language models (MLLMs), also known as Large Vision-Language Models (LVLMs), which have demonstrated significant advancements and remarkable abilities in
Zechen Bai +6 more
semanticscholar +1 more source
arXiv.org
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning.
Weiyun Wang +10 more
semanticscholar +1 more source
Existing open-source multimodal large language models (MLLMs) generally follow a training process involving pre-training and supervised fine-tuning.
Weiyun Wang +10 more
semanticscholar +1 more source
A Multilevel Multimodal Fusion Transformer for Remote Sensing Semantic Segmentation
IEEE Transactions on Geoscience and Remote SensingAccurate semantic segmentation of remote sensing data plays a crucial role in the success of geoscience research and applications. Recently, multimodal fusion-based segmentation models have attracted much attention due to their outstanding performance as
Xianping Ma +3 more
semanticscholar +1 more source

