Results 11 to 20 of about 2,605 (177)
On the Role of Visual Grounding in VQA
Visual Grounding (VG) in VQA refers to a model\u27s proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task.
Reich, Daniel, Schultz, Tanja
core +2 more sources
SQuINTing at VQA Models: Introspecting VQA Models With Sub-Questions [PDF]
Existing VQA datasets contain questions with varying levels of complexity. While the majority of questions in these datasets require perception for recognizing existence, properties, and spatial relationships of entities, a significant portion of questions pose challenges that correspond to reasoning tasks - tasks that can only be answered through a ...
Ramprasaath R. Selvaraju +6 more
openaire +4 more sources
Adversarial VQA: A New Benchmark for Evaluating the Robustness of VQA Models [PDF]
To appear in ICCV 2021; Website: https://adversarialvqa.github.io/
Linjie Li +3 more
openaire +2 more sources
VQA With No Questions-Answers Training [PDF]
Methods for teaching machines to answer visual questions have made significant progress in recent years, but current methods still lack important human capabilities, including integrating new visual classes and concepts in a modular manner, providing explanations for the answers and handling new domains without explicit examples.
Ben Zion Vatashsky, Shimon Ullman
openaire +2 more sources
An Experimental Study of the Vision-Bottleneck in Vqa [PDF]
As in many tasks combining vision and language, both modalities play a crucial role in Visual Question Answering (VQA). To properly solve the task, a given model should both understand the content of the proposed image and the nature of the question. While the fusion between modalities, which is another obviously important part of the problem, has been
Pierre Marza +4 more
openaire +2 more sources
MUST-VQA: MUltilingual Scene-Text VQA
In this paper, we present a framework for Multilingual Scene Text Visual Question Answering that deals with new languages in a zero-shot fashion. Specifically, we consider the task of Scene Text Visual Question Answering (STVQA) in which the question can be asked in different languages and it is not necessarily aligned to the scene text language. Thus,
Emanuele Vivoli +4 more
openaire +2 more sources
Object-Based Reasoning in VQA [PDF]
10 pages, 15 figures, published as a conference paper at 2018 IEEE Winter Conf. on Applications of Computer Vision (WACV'2018)
Mikyas T. Desta +2 more
openaire +2 more sources
VQA: Visual Question Answering [PDF]
The first three authors contributed equally.
Stanislaw Antol +6 more
openaire +3 more sources
How (not) to ensemble LVLMs for VQA
This paper studies ensembling in the era of Large Vision-Language Models (LVLMs). Ensembling is a classical method to combine different models to get increased performance. In the recent work on Encyclopedic-VQA the authors examine a wide variety of models to solve their task: from vanilla LVLMs, to models including the caption as extra context, to ...
Lisa Alazraki +5 more
openaire +3 more sources

