Lecutre 9

GANs

Give up modeling p(x), but allows to draw samples from p(x)

Assume we have data xi drawn from distribution pdata(x)p_{data}(x), we want to sample from pdatap_{data}

Idea: Introduce a latent variable z with simple prior p(z). Sample zp(z)z \sim p(z) and pass to Generator Network x=G(z)x = G(z). Then x is sample from the generator distribution pGp_G. Want pG=pdatap_G = p_{data}

GANs

Training Objective

Objective

Train using alternating gradient updates.

=minGmaxDV(G,D) = \min_G \max_D V(G,D)

For t in 1, ..., T:

  1. update D. D=D+αDVDD = D + \alpha_D \frac{\partial V}{\partial D}

  2. update G. G=GαGVGG = G - \alpha_G \frac{\partial V}{\partial G}

At start of training, generator is very bad and discriminator can easily tell real/fake so D(G(z)) close to 0.

Problem: vanishing gradient for G.

Solution: G is trained to minimize log(1-D(G(z)). Instead, train G to minimize -log(D(G(z)). Then G gets strong gradient at start of training.

It achieves global minimum when pG=pdatap_G = p_{data}

minGmaxD(Expdata[logD(x)]+Ezp(z)[log(1D(G(z)))])\min_G \max_D (E_{x \sim p_{data}}[\log D(x)] + E_{z \sim p(z)} [\log (1-D(G(z)))])

=minGmaxD(Expdata[logD(x)]+ExpG[log(1D(x))]) =\min_G \max_D (E_{x\sim p_{data}}[\log D(x)] + E_{x \sim p_G} [\log (1-D(x))]) Change of variable on second term.

=minGmaxDx(pdata(x)logD(x)+pG(x)log(1D(x)))dx= \min_G \max_D \int_x (p_{data}(x) \log D(x) + p_G (x) \log (1-D(x))) dx Definition of Expectation.

=minGxmaxD(pdata(x)logD(x)+pG(x)log(1D(x)))dx = \min_G \int_x max_D(p_{data}(x) \log D(x) + p_G (x) \log (1-D(x))) dx push max_D inside integral.

f(y)=alogyblog(1y)f(y) = a\log y - b \log (1-y) . a = p_data (x) y = D(x), b = p_G(x). (Side computation to compute max)

f(y)=ayb1yf^\prime (y) = \frac{a}{y} - \frac{b}{1-y}, f(y)=0    y=aa+bf^\prime (y) = 0 \iff y = \frac{a}{a+b} (local max)

Optimal Discriminator: DG(x)=pdata(x)pdata(x)+pG(x)D_G^* (x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}

=minGx(pdata(x)logDG(x)+pG(x)log(1DG(x)))dx= \min_G \int_x (p_data(x) \log D_G^* (x) + p_G(x) \log (1-D_G^*(x)))dx

=minGx(pdata(x)logpdata(x)pdata(x)+pG(x)+pG(x)logpG(x)pdata(x)+pG(x)dx = \min_G \int_x (p_data(x) \log \frac{p_{data}(x)}{p_{data}(x) + p_G(x)} + p_G(x) \log \frac{p_G(x)}{p_{data}(x) + p_G(x)} dx

=minG(Expdata[logpdata(x)pdata(x)+pG(x)]+ExpG[logpG(x)pdata(x)+pG(x)])= \min_G (E_{x \sim p_{data}} [\log \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}] + E_{x\sim p_G} [\log \frac{p_G(x)}{p_{data}(x) + p_G(x)}]) (definition of expectation)

=minG(Expdata[log22pdata(x)pdata(x)+pG(x)]+ExpG[log22pG(x)pdata(x)+pG(x)])= \min_G (E_{x \sim p_{data}} [\log \frac{2}{2} \frac{p_{data}(x)}{p_{data}(x) + p_G(x)}] + E_{x\sim p_G} [\log \frac{2}{2} \frac{p_G(x)}{p_{data}(x) + p_G(x)}]) (Multiply by a constant)

=minG(Expdata[log2pdata(x)pdata(x)+pG(x)]+ExpG[log2pG(x)pdata(x)+pG(x)]log4)= \min_G (E_{x \sim p_{data}} [\log \frac{2 \cdot p_{data}(x)}{p_{data}(x) + p_G(x)}] + E_{x\sim p_G} [\log \frac{2 \cdot p_G(x)}{p_{data}(x) + p_G(x)}] - \log 4)

KL Divergence: KL(p,q)=Exp[logp(x)q(x)]KL(p,q) = E_{x \sim p} [\log \frac{p(x)}{q(x)}]

=minG(KL(pdata,pdata+pG2)+KL(pG,pdata+pG2)log4)= \min_G (KL(p_{data}, \frac{p_{data} + p_G}{2}) + KL(p_G, \frac{p_{data} + p_G}{2}) - \log 4)

Jensen-Shannon Divergence: JSD(p,q)=12KL(p,p+q2)+12KL(q,p+q2)JSD(p,q) = \frac{1}{2}KL(p, \frac{p+q}{2}) + \frac{1}{2} KL (q, \frac{p+q}{2})

=minG(2JSD(pdata,pG)log4)=\min_G (2\cdot JSD(p_{data}, p_G) - \log 4)

JDS is always nonnegative, and zero only when two distribution are equal, thus p_data = p_G is global min.

Summary: Global min and max happens when:

  1. DG(x)=pdata(x)pdata(x)+pG(x)D_G^* (x) = \frac{p_{data}(x)}{p_{data}(x) + p_G(x)} (Optimal discriminator for any G)

  2. pG(x)=pdata(x)p_G(x) = p_{data}(x) (Optimal generator for optimal D)

Caveats:

  1. G and D are neural nets with fixed architecture, we don't know whether they can actually represent the optimal D and G.

  2. This tells nothing about convergence to the optimal solution.

Conditional GANs

Learn p(x|y) instead of p(y). Make generator and discriminator both take label y as additional input.

Batch Normalization

Last updated