# Video Representation

Proposes a method that jointly learns a video representation to encode step concepts and a deep probabilistic model to capture temporal dependencies and immense individual variations in step ordering. Representation: matching a video clip to its corresponding narration. Model: predict distribution of video representation for a missing step, given steps in its vicinity.

<figure><img src="/files/d15BBgzsyzKXllmq39wj" alt=""><figcaption><p>Model outcome</p></figcaption></figure>

Method

Input: $$N$$ clips $${v\_1, v\_2, \dots, v\_N}$$. Each $$v\_i$$ captues a potential action step and time. We assume sentences $${s\_1, s\_2, \dots, s\_N}$$ is associated with videos, $$s\_i$$ describes action.

Learn video representation that encodes both action step and temporal dependencies. Representation: video encoder $$f$$ that extracts representation $$x\_i$$ from $$v\_i$$ ($$x\_i = f(v\_i)$$). Probabilistic model: conditional probability $$p(x\_j = f(v\_j) | { x\_i = f(v\_i) } \_{i \neq j} ) \forall j$$. $$p(x\_j | { x\_i } \_{i \neq j} )$$ models the temporal dependencies among steps.

Overview: leverage pretrained text encoder $$g$$ that is fixed during learning, extend the idea of masked token modeling, populated in NLP. For randomly sample clip, train the model and predict the distribution of $$x\_j = f(v\_j)$$ from the probabilistic model. Align with expecation $$\mathbb{E} (x\_j)$$ with corresponding text emvedding $$y\_j = g(s\_j)$$ and match $${ x\_i = f(v\_i) }*{i \neq j}$$ to their text embedding $${ y\_i = g(s\_i) }*{i\neq j}$$. The method seeks to characterize the distribution of $$x\_j$$ instead of predict.&#x20;

<figure><img src="/files/f76QZyghHXCCyLtmxpg5" alt=""><figcaption><p>method</p></figcaption></figure>

It is important to know that the model uses a diffusion process to model p(x\_j | {x\_i}*{i \neq j). Which gradually add noise to input x*j over T steps.&#x20;


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/reading-notes/ml-readings/video-representation.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
