GANs
Give up modeling p(x), but allows to draw samples from p(x)
Assume we have data xi drawn from distribution pdata(x), we want to sample from pdata
Idea: Introduce a latent variable z with simple prior p(z). Sample z∼p(z) and pass to Generator Network x=G(z). Then x is sample from the generator distribution pG. Want pG=pdata
Training Objective
Train using alternating gradient updates.
=minGmaxDV(G,D)
For t in 1, ..., T:
update D. D=D+αD∂D∂V
update G. G=G−αG∂G∂V
At start of training, generator is very bad and discriminator can easily tell real/fake so D(G(z)) close to 0.
Problem: vanishing gradient for G.
Solution: G is trained to minimize log(1-D(G(z)). Instead, train G to minimize -log(D(G(z)). Then G gets strong gradient at start of training.
It achieves global minimum when pG=pdata
minGmaxD(Ex∼pdata[logD(x)]+Ez∼p(z)[log(1−D(G(z)))])
=minGmaxD(Ex∼pdata[logD(x)]+Ex∼pG[log(1−D(x))]) Change of variable on second term.
=minGmaxD∫x(pdata(x)logD(x)+pG(x)log(1−D(x)))dx Definition of Expectation.
=minG∫xmaxD(pdata(x)logD(x)+pG(x)log(1−D(x)))dx push max_D inside integral.
f(y)=alogy−blog(1−y) . a = p_data (x) y = D(x), b = p_G(x). (Side computation to compute max)
f′(y)=ya−1−yb, f′(y)=0⟺y=a+ba (local max)
Optimal Discriminator: DG∗(x)=pdata(x)+pG(x)pdata(x)
=minG∫x(pdata(x)logDG∗(x)+pG(x)log(1−DG∗(x)))dx
=minG∫x(pdata(x)logpdata(x)+pG(x)pdata(x)+pG(x)logpdata(x)+pG(x)pG(x)dx
=minG(Ex∼pdata[logpdata(x)+pG(x)pdata(x)]+Ex∼pG[logpdata(x)+pG(x)pG(x)]) (definition of expectation)
=minG(Ex∼pdata[log22pdata(x)+pG(x)pdata(x)]+Ex∼pG[log22pdata(x)+pG(x)pG(x)]) (Multiply by a constant)
=minG(Ex∼pdata[logpdata(x)+pG(x)2⋅pdata(x)]+Ex∼pG[logpdata(x)+pG(x)2⋅pG(x)]−log4)
KL Divergence: KL(p,q)=Ex∼p[logq(x)p(x)]
=minG(KL(pdata,2pdata+pG)+KL(pG,2pdata+pG)−log4)
Jensen-Shannon Divergence: JSD(p,q)=21KL(p,2p+q)+21KL(q,2p+q)
=minG(2⋅JSD(pdata,pG)−log4)
JDS is always nonnegative, and zero only when two distribution are equal, thus p_data = p_G is global min.
Summary: Global min and max happens when:
DG∗(x)=pdata(x)+pG(x)pdata(x) (Optimal discriminator for any G)
pG(x)=pdata(x) (Optimal generator for optimal D)
Caveats:
G and D are neural nets with fixed architecture, we don't know whether they can actually represent the optimal D and G.
This tells nothing about convergence to the optimal solution.
Conditional GANs
Learn p(x|y) instead of p(y). Make generator and discriminator both take label y as additional input.