Video Representation

https://openaccess.thecvf.com/content/CVPR2023/papers/Yu_Learning_Procedure-Aware_Video_Representation_From_Instructional_Videos_and_Their_Narrations_CVPR_2023_paper.pdf

Proposes a method that jointly learns a video representation to encode step concepts and a deep probabilistic model to capture temporal dependencies and immense individual variations in step ordering. Representation: matching a video clip to its corresponding narration. Model: predict distribution of video representation for a missing step, given steps in its vicinity.

Model outcome

Method

Input: NN clips {v1,v2,,vN}\{v_1, v_2, \dots, v_N\}. Each viv_i captues a potential action step and time. We assume sentences {s1,s2,,sN}\{s_1, s_2, \dots, s_N\} is associated with videos, sis_i describes action.

Learn video representation that encodes both action step and temporal dependencies. Representation: video encoder ff that extracts representation xix_i from viv_i (xi=f(vi)x_i = f(v_i)). Probabilistic model: conditional probability p(xj=f(vj){xi=f(vi)}ij)jp(x_j = f(v_j) | \{ x_i = f(v_i) \} _{i \neq j} ) \forall j. p(xj{xi}ij)p(x_j | \{ x_i \} _{i \neq j} ) models the temporal dependencies among steps.

Overview: leverage pretrained text encoder gg that is fixed during learning, extend the idea of masked token modeling, populated in NLP. For randomly sample clip, train the model and predict the distribution of xj=f(vj)x_j = f(v_j) from the probabilistic model. Align with expecation E(xj)\mathbb{E} (x_j) with corresponding text emvedding yj=g(sj)y_j = g(s_j) and match {xi=f(vi)}ij\{ x_i = f(v_i) \}_{i \neq j} to their text embedding {yi=g(si)}ij\{ y_i = g(s_i) \}_{i\neq j}. The method seeks to characterize the distribution of xjx_j instead of predict.

method

It is important to know that the model uses a diffusion process to model p(x_j | {x_i}{i \neq j). Which gradually add noise to input xj over T steps.

Last updated