Lecture 8
VAE
PixelRNN explicity parameterizes density function with nn, so we train maximize likelihood of training data
VAE define an intractable density that we cannot explicitly compute or optimize, we can optimize a lower bound for the density.
Autoencoders (Non variational)
Unsupervised method for learning feature vectors from raw data x, without labels.
Features should extract useful information that we can use for downstream taks.
Originally: Linear + nonlinearilty (sigmoid)
Later: Deep, fully-connected
Later: ReLU CNN
input data x -> Encoder -> features Z
We use features to reconstruct the input data with a decoder.
features Z -> Decoder -> Reconstructed data X^.
Loss: l2 distance between input and reconstructed data
Features need to be lower dimensional than data.
After training, throw away decoder and use encoder for a downstream task.
Autoencoders learn latent features for data without any labels.
Can use features to initialize a supervised model.
Not probabilistic: No way to sample new data from learned model.
Variational Autoencoders
AE: learn latent z from rwa data and sample from model to generate new data.
Assume training data is generated from unobserved representation z.
intuition: x is image, z is latent factors used to generate x.
After training: sample new data like this:
Sample from prior . Assume simple prior p(z)
Sample from conditional Represent p(x|z) with a neural network
Decoder must be probabilistic: decoder inputs , output mean and covariance .
Sample X from Gaussian with mean and covariance.
How to train this model?
Maximize likelihood of data.
If we could observe z from each x, the could train a generative model p(x|z)
We don't observe z, so need to marginalize:
We can compute with decoder network.We assumed gaussian prior . But impossible to integrate over all z.
Could also try Bayes rule, but we cannot compute .
Solution: train another network to learn . (Encoder)
Then

VAE
Bayes Rule
Multiply top and bot by
= Split using rules for logarithms.
We can wrap in expecations since it doesn't depend on z
Part1: data reconstruction
Part2: KL divergence between prior, and samples from encoder network
Part3: KL divergence between encoder and posterior of decoder. KL>=0, so this term gives a lower bound on the data likelihood.
Jointly train the encoder q and decoder p to maximize the variationla lower bound on the data likelihood. Alsso called Evidence Lower Bound (ELBo)

sample code z from encoder output.
Run sample code through decoder to get a distribution over data samples.
Original input data should be likely under distribution output from (4).
Can sample a reconstruction from (4)
VAE: Generating Data
Sample z from prior p(z)
Run sample z through decoder to get distribution over data x.
Sample from distribution in (2) to generate data.
Diagonal prior on p(z) causes dimensions of z to be independent.
VAE: Edit images
run input through encoder to get a distribution over latent codes.
Sample code z from encoder output.
Modify some dimension of sampled code.
Run modified z through decoder to get a distribution over data sample.
Sample new data from (4).

Last updated