Conditional Generation
Generating images, sound, text, conditioned on latent or observable attributes: sketches, speaker style, music type, instrument.
Pix2Pix
Architecture of a Conditional GAN (image from Isola, Zhu, Zhou, Efros 2016):
data:image/s3,"s3://crabby-images/97918/979181993001d9f6d4b0f694cc32697e7459c155" alt=""
Example of building translation (image from Chistopher Hesse)
data:image/s3,"s3://crabby-images/ba864/ba86490c4013d36f07d0d949ac55e3fe92d3b358" alt=""
A very good tutorial by Hesse. His illustration of the Pix2Pix architecture.
data:image/s3,"s3://crabby-images/8369e/8369e28be927403baa8e1c2a1ce0f2ba0b726e9a" alt=""
Keys Ideas:
- Generator is encoder-decoder
- BatchNorm, ReLU
- Skip connections between encoder and decoder
- Discriminator stacks Input/Output on channel axis
Links:
Conditional PixelCNN, Gated PixelCNN
Conditional PixelCNNs, also called Gated PixelCNNs, build on Pixel Pixel CNNs, which were introduced in the PixelRNN paper (Oord, Kalchbrenner, Kavukcuoglu (2016) - Pixel RNNs).
Reminder on PixelRNN
data:image/s3,"s3://crabby-images/25a9c/25a9c019b96c1db372b3ca9e348e74c31a849cfe" alt=""
- Row LSTM: condition each row on above row, using 1D convolution -> triangular receptive field
- Diagonal BiLSTM: using a skew trick for parallelization, each pixel depends on a 45-rotated halfspace
- PixelCNN: Like BiLSTM but uses masked convolution to limit receptive field.
The generative process for PixelCNN is as follows:
- For i=1..N, For j=1..M:
- (Sample pixel )
- For l=1..L: # increasing layers
- using masked convolution, convolve upper, left, and upper-left pixel of layer to get activations of layer .
- Now at the last layer , which combines information from all the effective receptive field, compute the distribution . Sample from it to generate pixel .
- Now other pixels to the bottom and right of have their dependencies satisfied, and can be sample as well.
This generative process is very slow when implemented naively because a full forward pass is required just to sample a single pixel.
However, training and validating are fully parallel because teacher forcing is used: the ground-truth pixels are used to compute the activations, instead of the generated pixels. Then a single pass allows to train weights for all pixels.
Ideas:
Improvements in Gated PixelCNN
Ideas:
- Replace ReLU with gated activation unit
data:image/s3,"s3://crabby-images/c7eeb/c7eeb55b7f464914ca73ac34d193a7c3f4ca6444" alt=""
Applications:
- Image completion
- Image interpolation
data:image/s3,"s3://crabby-images/d7688/d76888e0baeec41f6288a1ca9ef0288be921cc39" alt=""
- Class-conditional sampling
data:image/s3,"s3://crabby-images/09780/0978004e2a2eaba664aaa66417b0b6a41eef0307" alt=""
Links:
Pixel VAE