# Lecture 8

## VAE

PixelRNN explicity parameterizes density function with nn, so we train maximize likelihood of training data

$$
p\_W(x) = \prod\_{t=1}^Tp\_W(x\_t|x\_1, \dots,x\_{t-1})
$$

VAE define an intractable density that we cannot explicitly compute or optimize, we can optimize a lower bound for the density.

### Autoencoders (Non variational)

Unsupervised method for learning feature vectors from raw data x, without labels.

Features should extract useful information that we can use for downstream taks.

Originally: Linear + nonlinearilty (sigmoid)

Later: Deep, fully-connected

Later: ReLU CNN

input data x -> Encoder -> features Z

We use features to reconstruct the input data with a decoder.

features Z -> Decoder -> Reconstructed data X^.

Loss: l2 distance between input and reconstructed data

$$
||\hat{X} - X||\_2^2
$$

Features need to be lower dimensional than data.

After training, throw away decoder and use encoder for a downstream task.

Autoencoders learn latent features for data without any labels.&#x20;

Can use features to initialize a supervised model.

Not probabilistic: No way to sample new data from learned model.

### Variational Autoencoders

AE: learn latent z from rwa data and sample from model to generate new data.

&#x20;Assume training data $${x^(i)}\_{i=1}^N$$ is generated from unobserved representation z.&#x20;

intuition: x is image, z is latent factors used to generate x.

After training: sample new data like this:

Sample $$z$$ from prior $$p\_{\theta^\*}(z)$$. Assume simple prior p(z)

Sample $$x$$ from conditional $$p\_{\theta^\*}(x|z^{(i)})$$ Represent p(x|z) with a neural network

Decoder must be probabilistic: decoder inputs $$z$$, output mean $$\mu\_{x|z}$$ and covariance $$\Sigma\_{x|z}$$.

Sample X from Gaussian with mean and covariance.

How to train this model?

Maximize likelihood of data.

If we could observe z from each x, the could train a generative model p(x|z)

We don't observe z, so need to marginalize:

$$
p\_\theta(x) = \int p\_\theta(x,z)dz = \int p\_\theta(x|z)p\_\theta(z)dz
$$

We can compute $$p\_\theta(x|z)$$ with decoder network.We assumed gaussian prior $$p\_\theta(z)$$. But impossible to integrate over all z.

Could also try Bayes rule, but we cannot compute $$p\_\theta(z|x)$$.

Solution: train another network to learn $$q\_\Phi(z|x) \approx p\_\theta(z|x)$$ . (Encoder)

Then

$$
p\_\theta(x) = \frac{p\_\theta(x|z)p\_\theta(z)}{p\_\theta(z|x)} \approx \frac{p\_\theta(x|z)p\_\theta(z)}{q\_\Phi(z|x)}
$$

<figure><img src="/files/8rV1d3LHYoXqzvBu89Wk" alt=""><figcaption><p>VAE</p></figcaption></figure>

VAE

$$\log p\_\theta(x) = \log \frac{p\_\theta(x|z)p(z)}{p\_\theta(z|x)}$$ Bayes Rule

$$= \log \frac{p(x|z)p(z)q\Phi(z|x)}{p\_\theta(z|x)q\_\Phi(z|x)}$$ Multiply top and bot by $$q\_\Phi(z|x)$$

\= $$\log p\_\theta(x|z) - \log \frac{q\_\Phi(z|x)}{p(z)} +  \log \frac{q\_\Phi(z|x)}{p\_\theta(z|x)}$$ Split using rules for logarithms.

$$\log p\_\theta(x) = E\_{z \sim q\_\Phi(z|x)} \[\log p\_\theta(x)]$$ We can wrap in expecations since it doesn't depend on z

$$= E\_z\[\log p\_\theta(x|z)] - E\_z\[\log \frac{q\_\Phi(z|x)}{p(z)}] + E\_z\[\log \frac{q\_\Phi (z|x)}{p\_\theta(z|x)}]$$

$$= E\_{z \sim q\_\Phi(z|x)}\[\log p\_\theta(x|z)] - D\_{KL}(q\_\Phi(z|x), p(z)) + D\_{KL}(q\_\Phi(z|x), p\_\theta(z|x))$$

Part1: data reconstruction

Part2: KL divergence between prior, and samples from encoder network

Part3: KL divergence between encoder and posterior of decoder. KL>=0, so this term gives a lower bound on the data likelihood.

$$\log p\_\theta(x) \geq E\_{z \sim q\_\Phi(z|x)} \[\log p\_\theta(x|z)] - D\_{KL}(q\_\Phi(z|x), p(z))$$

Jointly train the encoder q and decoder p to maximize the variationla lower bound on the data likelihood. Alsso called Evidence Lower Bound (ELBo)

<figure><img src="/files/GOyQjCrrwKzlzHj7JbBC" alt=""><figcaption><p>VAE</p></figcaption></figure>

3. sample code z from encoder output.
4. Run sample code through decoder to get a distribution over data samples.
5. Original input data should be likely under distribution output from (4).
6. Can sample a reconstruction from (4)

### VAE: Generating Data

1. Sample z from prior p(z)
2. Run sample z through decoder to get distribution over data x.
3. Sample from distribution in (2) to generate data.

Diagonal prior on p(z) causes dimensions of z to be independent.

### VAE: Edit images

1. run input through encoder to get a distribution over latent codes.
2. Sample code z from encoder output.
3. Modify some dimension of sampled code.
4. Run modified z through decoder to get a distribution over data sample.
5. Sample new data from (4).

<figure><img src="/files/ZmQhNlRTxw4LPLb9mZZR" alt=""><figcaption><p>Summary</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/course_notes/cs-839-notes/lecture-8.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
