Lecture 8
VAE
PixelRNN explicity parameterizes density function with nn, so we train maximize likelihood of training data
VAE define an intractable density that we cannot explicitly compute or optimize, we can optimize a lower bound for the density.
Autoencoders (Non variational)
Unsupervised method for learning feature vectors from raw data x, without labels.
Features should extract useful information that we can use for downstream taks.
Originally: Linear + nonlinearilty (sigmoid)
Later: Deep, fully-connected
Later: ReLU CNN
input data x -> Encoder -> features Z
We use features to reconstruct the input data with a decoder.
features Z -> Decoder -> Reconstructed data X^.
Loss: l2 distance between input and reconstructed data
Features need to be lower dimensional than data.
After training, throw away decoder and use encoder for a downstream task.
Autoencoders learn latent features for data without any labels.
Can use features to initialize a supervised model.
Not probabilistic: No way to sample new data from learned model.
Variational Autoencoders
AE: learn latent z from rwa data and sample from model to generate new data.
Assume training data {x(i)}i=1N is generated from unobserved representation z.
intuition: x is image, z is latent factors used to generate x.
After training: sample new data like this:
Sample z from prior pθ∗(z). Assume simple prior p(z)
Sample x from conditional pθ∗(x∣z(i)) Represent p(x|z) with a neural network
Decoder must be probabilistic: decoder inputs z, output mean μx∣z and covariance Σx∣z.
Sample X from Gaussian with mean and covariance.
How to train this model?
Maximize likelihood of data.
If we could observe z from each x, the could train a generative model p(x|z)
We don't observe z, so need to marginalize:
We can compute pθ(x∣z) with decoder network.We assumed gaussian prior pθ(z). But impossible to integrate over all z.
Could also try Bayes rule, but we cannot compute pθ(z∣x).
Solution: train another network to learn qΦ(z∣x)≈pθ(z∣x) . (Encoder)
Then

VAE
logpθ(x)=logpθ(z∣x)pθ(x∣z)p(z) Bayes Rule
=logpθ(z∣x)qΦ(z∣x)p(x∣z)p(z)qΦ(z∣x) Multiply top and bot by qΦ(z∣x)
= logpθ(x∣z)−logp(z)qΦ(z∣x)+logpθ(z∣x)qΦ(z∣x) Split using rules for logarithms.
logpθ(x)=Ez∼qΦ(z∣x)[logpθ(x)] We can wrap in expecations since it doesn't depend on z
=Ez[logpθ(x∣z)]−Ez[logp(z)qΦ(z∣x)]+Ez[logpθ(z∣x)qΦ(z∣x)]
=Ez∼qΦ(z∣x)[logpθ(x∣z)]−DKL(qΦ(z∣x),p(z))+DKL(qΦ(z∣x),pθ(z∣x))
Part1: data reconstruction
Part2: KL divergence between prior, and samples from encoder network
Part3: KL divergence between encoder and posterior of decoder. KL>=0, so this term gives a lower bound on the data likelihood.
logpθ(x)≥Ez∼qΦ(z∣x)[logpθ(x∣z)]−DKL(qΦ(z∣x),p(z))
Jointly train the encoder q and decoder p to maximize the variationla lower bound on the data likelihood. Alsso called Evidence Lower Bound (ELBo)

sample code z from encoder output.
Run sample code through decoder to get a distribution over data samples.
Original input data should be likely under distribution output from (4).
Can sample a reconstruction from (4)
VAE: Generating Data
Sample z from prior p(z)
Run sample z through decoder to get distribution over data x.
Sample from distribution in (2) to generate data.
Diagonal prior on p(z) causes dimensions of z to be independent.
VAE: Edit images
run input through encoder to get a distribution over latent codes.
Sample code z from encoder output.
Modify some dimension of sampled code.
Run modified z through decoder to get a distribution over data sample.
Sample new data from (4).

Last updated