Results 11 to 20 of about 34,250 (255)
Vision-language models for medical report generation and visual question answering: a review [PDF]
Ghulam Rasool, Rasool Ghulam
exaly +2 more sources
Multi-Module Co-Attention Model for Visual Question Answering [PDF]
Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure ...
ZOU Pinrong, XIAO Feng, ZHANG Wenjuan, ZHANG Wanyu, WANG Chenyang
doaj +1 more source
A Comprehensive Review and Open Challenges on Visual Question Answering Models
Users are now able to actively interact with images and pose different questions based on images, thanks to recent developments in artificial intelligence. In turn, a response in a natural language answer is expected.
Fasi Ahamad Shaik +4 more
doaj +1 more source
SBVQA 2.0: Robust End-to-End Speech-Based Visual Question Answering for Open-Ended Questions
Speech-based Visual Question Answering (SBVQA) is a challenging task that aims to answer spoken questions about images. The challenges of this task involve the variability of speakers, the different recording environments, as well as the various objects ...
Faris Alasmary, Saad Al-Ahmadi
doaj +1 more source
Question-Agnostic Attention for Visual Question Answering [PDF]
Visual Question Answering (VQA) models employ attention mechanisms to discover image locations that are most relevant for answering a specific question. For this purpose, several multimodal fusion strategies have been proposed, ranging from relatively simple operations (e.g., linear sum) to more complex ones (e.g., Block).
Moshiur R. Farazi +2 more
openaire +2 more sources
VQA: Visual Question Answering [PDF]
The first three authors contributed equally.
Stanislaw Antol +6 more
openaire +3 more sources
Knowledge-based Visual Question Answering:A Survey [PDF]
As an important presentation form of the completeness of artificial intelligence and the visual Turing test,visual question answering(VQA),coupled with its potential application value,has received extensive attention from computer vision and na-tural ...
WANG Ruiping, WU Shihong, ZHANG Meihang, WANG Xiaoping
doaj +1 more source
TASTA: Text‐Assisted Spatial and Temporal Attention Network for Video Question Answering
Video question answering (VideoQA) is a typical task that integrates language and vision. The key for VideoQA is to extract relevant and effective visual information for answering a specific question. Information selection is believed to be necessary for
Tian Wang +5 more
doaj +1 more source
Generative Visual Question Answering
Multi-modal tasks involving vision and language in deep learning continue to rise in popularity and are leading to the development of newer models that can generalize beyond the extent of their training data. The current models lack temporal generalization which enables models to adapt to changes in future data.
Ethan Shen, Scotty Singh, Bhavesh Kumar
openaire +2 more sources
Counterfactual Mix-Up for Visual Question Answering
Counterfactuals have been shown to be a powerful method in Visual Question Answering in the alleviation of Visual Question Answering’s unimodal bias. However, existing counterfactual methods tend to generate samples that are not diverse or require
Jae Won Cho +3 more
doaj +1 more source

