Results 21 to 30 of about 5,597 (161)
Dense Procedure Captioning in Narrated Instructional Videos [PDF]
Understanding narrated instructional videos is important for both research and real-world web applications. Motivated by video dense captioning, we propose a model to generate procedure captions from narrated instructional videos which are a sequence of step-wise clips with description.
Botian Shi +6 more
openaire +1 more source
Dense video captioning involves identifying, localizing, and describing multiple events within a video. Capturing temporal and contextual dependencies between events is essential for generating coherent and accurate captions.
Dvijesh Bhatt, Priyank Thakkar
doaj +1 more source
Dense Video Object Captioning from Disjoint Supervision
We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video. This task unifies spatial and temporal localization in video, whilst also requiring fine-grained visual understanding that is best described by natural language.
Zhou, Xingyi +3 more
openaire +2 more sources
Multilevel Language and Vision Integration for Text-to-Clip Retrieval
We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video.
He, Kun +5 more
core +1 more source
Areas of Attention for Image Captioning [PDF]
We propose "Areas of Attention", a novel attention-based model for automatic image captioning. Our approach models the dependencies between image regions, caption words, and the state of an RNN language model, using three pairwise interactions.
Lucas, Thomas +3 more
core +5 more sources
Streaming Dense Video Captioning
An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video. Current state-of-the-art models, however, process a fixed number of downsampled frames, and make a
Zhou, Xingyi +7 more
openaire +2 more sources
An Efficient Framework for Dense Video Captioning
Dense video captioning is an extremely challenging task since an accurate and faithful description of events in a video requires a holistic knowledge of the video contents as well as contextual reasoning of individual events. Most existing approaches handle this problem by first proposing event boundaries from a video and then captioning on a subset of
Maitreya Suin, A. N. Rajagopalan
openaire +2 more sources
Frame- and Segment-Level Features and Candidate Pool Evaluation for Video Caption Generation
We present our submission to the Microsoft Video to Language Challenge of generating short captions describing videos in the challenge dataset. Our model is based on the encoder--decoder pipeline, popular in image and video captioning systems. We propose
Laaksonen, Jorma, Shetty, Rakshith
core +1 more source
Beyond Caption To Narrative: Video Captioning With Multiple Sentences
Recent advances in image captioning task have led to increasing interests in video captioning task. However, most works on video captioning are focused on generating single input of aggregated features, which hardly deviates from image captioning process
Harada, Tatsuya +2 more
core +1 more source
Video Captioning via Hierarchical Reinforcement Learning
Video captioning is the task of automatically generating a textual description of the actions in a video. Although previous work (e.g. sequence-to-sequence model) has shown promising results in abstracting a coarse description of a short video, it is ...
Chen, Wenhu +4 more
core +1 more source

