Results 311 to 320 of about 1,322,987 (378)
Some of the next articles are maybe not open access.

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

arXiv.org
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual ...
Jinguo Zhu   +47 more
semanticscholar   +1 more source

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

arXiv.org
We introduce InternVL 2.5, an advanced multimodal large language model (MLLM) series that builds upon InternVL 2.0, maintaining its core model architecture while introducing significant enhancements in training and testing strategies as well as data ...
Zhe Chen   +39 more
semanticscholar   +1 more source

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

arXiv.org
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strategy, (2) expanded training data, and (3) scaling to larger model size.
Xiaokang Chen   +7 more
semanticscholar   +1 more source

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

arXiv.org
We introduce InternVL 3.5, a new family of open-source multimodal models that significantly advances versatility, reasoning capability, and inference efficiency along the InternVL series. A key innovation is the Cascade Reinforcement Learning (Cascade RL)
Weiyun Wang   +62 more
semanticscholar   +1 more source

Generating Multimodal Grammars for Multimodal Dialogue Processing

IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 2010
This paper presents a new multimodal grammar generation system (MGGS) that allows defining a multimodal grammar in a very easy and intuitive way, overcoming the difficulties arising from the textual description of grammar production rules. The novelty of the proposed approach relies in adopting a by example paradigm to define a multimodal grammar. This
D'Ulizia Arianna   +2 more
openaire   +3 more sources

Emerging Properties in Unified Multimodal Pretraining

arXiv.org
Unifying multimodal understanding and generation has shown impressive capabilities in cutting-edge proprietary systems. In this work, we introduce BAGEL, an open-source foundational model that natively supports multimodal understanding and generation ...
Chaorui Deng   +11 more
semanticscholar   +1 more source

Multimodal Unsupervised Image-to-Image Translation

European Conference on Computer Vision, 2018
Unsupervised image-to-image translation is an important and challenging problem in computer vision. Given an image in the source domain, the goal is to learn the conditional distribution of corresponding images in the target domain, without seeing any ...
Xun Huang   +3 more
semanticscholar   +1 more source

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

arXiv.org
We introduce Phi-4-Mini and Phi-4-Multimodal, compact yet highly capable language and multimodal models. Phi-4-Mini is a 3.8-billion-parameter language model trained on high-quality web and synthetic data, significantly outperforming recent open-source ...
Abdelrahman Abouelenin   +72 more
semanticscholar   +1 more source

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

arXiv.org
DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs ...
Wenxuan Huang   +8 more
semanticscholar   +1 more source

How far are we to GPT-4V? Closing the gap to commercial multimodal models with open-source suites

Science China Information Sciences
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model (MLLM) to bridge the capability gap between open-source and proprietary commercial models in multimodal understanding. We introduce three simple improvements.
Zhe Chen   +27 more
semanticscholar   +1 more source

Home - About - Disclaimer - Privacy