Video Representation
https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_Learning_Procedure-Aware_Video_Representation_From_Instructional_Videos_and_Their_Narrations_CVPR_2023_paper.pdf
Proposes a method that jointly learns a video representation to encode step concepts and a deep probabilistic model to capture temporal dependencies and immense individual variations in step ordering. Representation: matching a video clip to its corresponding narration. Model: predict distribution of video representation for a missing step, given steps in its vicinity.

Method
Input: clips . Each captues a potential action step and time. We assume sentences is associated with videos, describes action.
Learn video representation that encodes both action step and temporal dependencies. Representation: video encoder that extracts representation from (). Probabilistic model: conditional probability . models the temporal dependencies among steps.
Overview: leverage pretrained text encoder that is fixed during learning, extend the idea of masked token modeling, populated in NLP. For randomly sample clip, train the model and predict the distribution of from the probabilistic model. Align with expecation with corresponding text emvedding and match to their text embedding . The method seeks to characterize the distribution of instead of predict.

It is important to know that the model uses a diffusion process to model p(x_j | {x_i}{i \neq j). Which gradually add noise to input xj over T steps.
Last updated