Results 311 to 320 of about 1,322,987 (378)
Some of the next articles are maybe not open access.
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
arXiv.orgWe introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual ...
Jinguo Zhu +47 more
semanticscholar +1 more source
arXiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data ...
Zhe Chen +39 more
semanticscholar +1 more source
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data ...
Zhe Chen +39 more
semanticscholar +1 more source
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
arXiv.orgIn this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size.
Xiaokang Chen +7 more
semanticscholar +1 more source
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
arXiv.orgWe introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL)
Weiyun Wang +62 more
semanticscholar +1 more source
Generating Multimodal Grammars for Multimodal Dialogue Processing
IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 2010This paper presents a new multimodal grammar generation system (MGGS) that allows defining a multimodal grammar in a very easy and intuitive way, overcoming the difficulties arising from the textual description of grammar production rules. The novelty of the proposed approach relies in adopting a by example paradigm to define a multimodal grammar. This
D'Ulizia Arianna +2 more
openaire +3 more sources
Emerging Properties in Unified Multimodal Pretraining
arXiv.orgUnifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation ...
Chaorui Deng +11 more
semanticscholar +1 more source
Multimodal Unsupervised Image-to-Image Translation
European Conference on Computer Vision, 2018Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any ...
Xun Huang +3 more
semanticscholar +1 more source
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
arXiv.orgWe introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source ...
Abdelrahman Abouelenin +72 more
semanticscholar +1 more source
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
arXiv.orgDeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs ...
Wenxuan Huang +8 more
semanticscholar +1 more source
How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites
Science China Information SciencesIn this paper, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements.
Zhe Chen +27 more
semanticscholar +1 more source

