Proposes a method that jointly learns a video representation to encode step concepts and a deep probabilistic model to capture temporal dependencies and immense individual variations in step ordering. Representation: matching a video clip to its corresponding narration. Model: predict distribution of video representation for a missing step, given steps in its vicinity.
Model outcome
Method
Input: N clips {v1,v2,…,vN}. Each vi captues a potential action step and time. We assume sentences {s1,s2,…,sN} is associated with videos, si describes action.
Learn video representation that encodes both action step and temporal dependencies. Representation: video encoder f that extracts representation xi from vi (xi=f(vi)). Probabilistic model: conditional probability p(xj=f(vj)∣{xi=f(vi)}i=j)∀j. p(xj∣{xi}i=j) models the temporal dependencies among steps.
Overview: leverage pretrained text encoder g that is fixed during learning, extend the idea of masked token modeling, populated in NLP. For randomly sample clip, train the model and predict the distribution of xj=f(vj) from the probabilistic model. Align with expecation E(xj) with corresponding text emvedding yj=g(sj) and match {xi=f(vi)}i=j to their text embedding {yi=g(si)}i=j. The method seeks to characterize the distribution of xj instead of predict.
method
It is important to know that the model uses a diffusion process to model p(x_j | {x_i}{i \neq j). Which gradually add noise to input xj over T steps.