Lecture 5
Regularization
Data Augmentation
Horizontal flips
Random crops and scales.
Color: contrast and brightness
Attention and Transformer
RNNs
Input and Output:
one to one
one to many
many to one
many to many
many to many (aligned)
Sequence to Sequence
Input: Sequence
Output: Sequence
Encoder:

From initial hidden state predict: initial decoder state , context vector
Decoder:
Attention
Compute alignment scores , is an MLP.
Normalize aligment scores to get attention weights
context vector as

Repeat: use to compute new context vector , use to compute
Use a different context vector in each timestep of decorder. Input sequence not bottlenecked through single vector. At each timestep of decoder, context vector looks at different part of sequence.
Decoder doesn't use the fact that form an ordered sequence, treat them as unordered. Can use similar architecture given any set of vectors
Images with RNNs and Attention

Each timestep of decoder uses a different context vector that looks at different part of image.
Uses in image captioning.
Attention Layer
Inputs:
Query vector: (shape: )
Input vectors: (shape: )
Similarity function:
Computation:
similarities: (shape: )
Attention weights: (shape )
Output vecotr: (shape )
Change 1: could use dot product for similarity.
Change 2: use scaled dot product
Change 3: multiple query vectors
(shape: )
Change 4: Separate key and value.
Key Matrix: (Shape: )
Value Matrix: (Shape: )
Key Vectors: (Shape: )
Value Vectors: (Shape: )
Final Attention Layer becomes:
Inputs:
Query vector: (shape: )
Input vectors: (shape: )
Key Matrix: (Shape: )
Value Matrix: (Shape: )
Computation:
Key Vectors: (Shape: )
Value Vectors: (Shape: )
(shape: ) Similarities
Attention weights: (shape )
Output vecotr: (shape: )

Self Attention Layer
One query vector per input vector

Consider permutting the input vectors.
Queries and key will be the same, but permutted.Same for similarities, attention and output.
Self attention layer is Permutation Equivariant.
Self-attention layer works on sets of vectors.
Self-attention doesn't know the order of the vectors it is processing. So add positional encoding to the input. Can be learned lookup table, or fixed function
Masked Self-Attention Layer
Don't let vectors look ahead in sequence. Used for language modeling (predict next word).

Multihead Self-Attention
Use H independent Attention Heads in parallel.
Run self attention in parallel on each set of input vectors (different weights per head).

Last updated