Lecture 6
Attention and Transformers
Replace Convolution with "Local Attention"
Convolution: output at each position is inner product of conv kernel with receptive field in input.


RNN
Works on ordered sequences. Good at long sequences, not parallelizable.
Key idea: has an internal state.
We process sequence of vectors by applying recurrence formula:
Where is new state, is some function with parameters , is old state and x_t is input at some time step.
RNN graph:




Sequence to Sequence (Many to one) + (One to many)

Convolution on input.
Works on multidimensional grids. Bad at long sequences, need to stack many conv layers for input to see the whole sequence. Highly parallel: each output can be computed parallel.
Self attention
Works on set of vectors, good at long sequences, after one attention layer, each output sees all inputs. Highly parallel. Very memory intensive.
Transformer
All vectors interact with each other. + MLP independently on each vector, plus residual connection.

A transformer is a sequence of transformer blocks.
Adding attention to RNN models let them look different parts of the input at each timestep.
Generalized self-attention is new, powerful neural network primitive.
Transformers are new nerual network model that only uses attention.
Last updated