Results 21 to 30 of about 5,597 (161)

Dense Procedure Captioning in Narrated Instructional Videos [PDF]

open access: yesProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019
Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of step-wise clips with description.
Botian Shi   +6 more
openaire   +1 more source

Understanding Video Narratives Through Dense Captioning with Linguistic Modules, Contextual Semantics, and Caption Selection

open access: yesAI
Dense video captioning involves identifying, localizing, and describing multiple events within a video. Capturing temporal and contextual dependencies between events is essential for generating coherent and accurate captions.
Dvijesh Bhatt, Priyank Thakkar
doaj   +1 more source

Dense Video Object Captioning from Disjoint Supervision

open access: yes, 2023
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language.
Zhou, Xingyi   +3 more
openaire   +2 more sources

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

open access: yes, 2018
We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video.
He, Kun   +5 more
core   +1 more source

Areas of Attention for Image Captioning [PDF]

open access: yes, 2016
We propose "Areas of Attention", a novel attention-based model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions.
Lucas, Thomas   +3 more
core   +5 more sources

Streaming Dense Video Captioning

open access: yes2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a
Zhou, Xingyi   +7 more
openaire   +2 more sources

An Efficient Framework for Dense Video Captioning

open access: yesProceedings of the AAAI Conference on Artificial Intelligence, 2020
Dense video captioning is an extremely challenging task since an accurate and faithful description of events in a video requires a holistic knowledge of the video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first proposing event boundaries from a video and then captioning on a subset of
Maitreya Suin, A. N. Rajagopalan
openaire   +2 more sources

Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation

open access: yes, 2016
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose
Laaksonen, Jorma, Shetty, Rakshith
core   +1 more source

Beyond Caption To Narrative: Video Captioning With Multiple Sentences

open access: yes, 2016
Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image captioning process
Harada, Tatsuya   +2 more
core   +1 more source

Video Captioning via Hierarchical Reinforcement Learning

open access: yes, 2018
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is ...
Chen, Wenhu   +4 more
core   +1 more source

Home - About - Disclaimer - Privacy