Results 141 to 150 of about 10,385 (175)
Some of the next articles are maybe not open access.
Learning Comprehensive Visual Grounding for Video Captioning
IEEE transactions on circuits and systems for video technology (Print)The grounding accuracy of existing video captioners is still behind the expectation. The majority of existing methods perform grounded video captioning on sparse entity annotations.
Wenhui Jiang +5 more
semanticscholar +1 more source
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning
arXiv.orgVision-language pre-training has significantly elevated performance across a wide range of image-language applications. Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources, which hinders the ...
Lin Xu +5 more
semanticscholar +1 more source
Action-aware Linguistic Skeleton Optimization Network for Non-autoregressive Video Captioning
ACM Trans. Multim. Comput. Commun. Appl.Non-autoregressive video captioning methods generate visual words in parallel but often overlook semantic correlations among them, especially regarding verbs, leading to lower caption quality.
Shuqin Chen +6 more
semanticscholar +1 more source
arXiv.org
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on
Yunlong Tang +18 more
semanticscholar +1 more source
We present CAT-V (Caption AnyThing in Video), a training-free framework for fine-grained object-centric video captioning that enables detailed descriptions of user-selected objects through time. CAT-V integrates three key components: a Segmenter based on
Yunlong Tang +18 more
semanticscholar +1 more source
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative
International Conference on Learning RepresentationsExisting video captioning benchmarks and models lack causal-temporal narrative, which is sequences of events linked through cause and effect, unfolding over time and driven by characters or agents.
Asmar Nadeem +5 more
semanticscholar +1 more source
MoS2: Mixture of Scale and Shift Experts for Text-Only Video Captioning
ACM MultimediaVideo captioning is a challenging task and typically requires paired video-text data for training. However, manually annotating coherent textual descriptions for videos is laborious and time-consuming.
Heng Jia +5 more
semanticscholar +1 more source
EvCap: Element-Aware Video Captioning
IEEE transactions on circuits and systems for video technology (Print)Video captioning is a multi-modal task across computer vision and natural language processing. Previous methods generally follow two paradigms, i.e. template-based and sequence-based. Template-based methods can generate relatively accurate elements (e.g.
Sheng Liu +4 more
semanticscholar +1 more source
RETTA: Retrieval-enhanced test-time adaptation for zero-shot video captioning
Pattern RecognitionDespite the significant progress of fully-supervised video captioning, zero-shot methods remain much less explored. In this paper, we propose a novel zero-shot video captioning framework named Retrieval-Enhanced Test-Time Adaptation (RETTA), which takes ...
Yunchuan Ma +6 more
semanticscholar +1 more source
Traffic Scenario Understanding and Video Captioning via Guidance Attention Captioning Network
IEEE transactions on intelligent transportation systems (Print)Describing a traffic scenario from the driver’s perspective is a challenging process for Advanced Driving Assistance System (ADAS), involving different sub-tasks of detection, tracking, segmentation, etc.
Chunsheng Liu +6 more
semanticscholar +1 more source
AAAI Conference on Artificial Intelligence
Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event,
Shiping Ge +6 more
semanticscholar +1 more source
Weakly-Supervised Dense Video Captioning (WSDVC) aims to localize and describe all events of interest in a video without requiring annotations of event boundaries. This setting poses a great challenge in accurately locating the temporal location of event,
Shiping Ge +6 more
semanticscholar +1 more source

