Lecture 5
Regularization
Data Augmentation
Horizontal flips
Random crops and scales.
Color: contrast and brightness
Attention and Transformer
RNNs
Input and Output:
one to one
one to many
many to one
many to many
many to many (aligned)
Sequence to Sequence
Input: Sequence x1β,β¦,xTβ
Output: Sequence y1β,β¦,yTβ²β
Encoder: htβ=fwβ(xtβ,htβ1β)

From initial hidden state predict: initial decoder state s0β, context vector c
Decoder: stβ=guβ(ytβ1β,stβ1β,c)
Attention
Compute alignment scores et,iβ=fattβ(stβ1β,hiβ), fattβ is an MLP.
Normalize aligment scores to get attention weights 0<at,iβ<1,βiβat,iβ=1
context vector as ctβ=βiβat,iβhiβ

Repeat: use s1β to compute new context vector c2β, use c2β to compute s2β,y2β
Use a different context vector in each timestep of decorder. Input sequence not bottlenecked through single vector. At each timestep of decoder, context vector looks at different part of sequence.
Decoder doesn't use the fact that hiβ form an ordered sequence, treat them as unordered. Can use similar architecture given any set of vectors {hiβ}
Images with RNNs and Attention

Each timestep of decoder uses a different context vector that looks at different part of image.
Uses in image captioning.
Attention Layer
Inputs:
Query vector: q (shape: Dqβ)
Input vectors: X(shape: NXβΓDXβ)
Similarity function: fattβ
Computation:
similarities: e (shape: NXβ) eiβ=faβtt(q,Xiβ)
Attention weights: a=softmax(e) (shape NXβ)
Output vecotr: y=βiβaxβXiβ (shape DXβ)
Change 1: could use dot product for similarity.
eiβ=qβ Xiβ
Change 2: use scaled dot product
eiβ=qβ DQββXiββ
Change 3: multiple query vectors
E=DQββQXTβ (shape: NQβΓNXβ) Ei,jβ=DQββ(Qiββ Xjβ)β
Change 4: Separate key and value.
Key Matrix: WKβ (Shape: DXβΓDQβ)
Value Matrix: WVβ (Shape: DXβΓDVβ)
Key Vectors: K=XWKβ (Shape: NXβΓDQβ)
Value Vectors: V=XWVβ (Shape: NXβΓDVβ)
Final Attention Layer becomes:
Inputs:
Query vector: q (shape: Dqβ)
Input vectors: X(shape: NXβΓDXβ)
Key Matrix: WKβ (Shape: DXβΓDQβ)
Value Matrix: WVβ (Shape: DXβΓDVβ)
Computation:
Key Vectors: K=XWKβ (Shape: NXβΓDQβ)
Value Vectors: V=XWVβ (Shape: NXβΓDVβ)
E=DQββQXTβ (shape: NQβΓNXβ) Ei,jβ=DQββ(Qiββ Xjβ)β Similarities
Attention weights: a=softmax(e) (shape NXβ)
Output vecotr: Y=AV (shape: NQβΓDVβ) Yiβ=βjβAi,jβVjβ

Self Attention Layer
One query vector per input vector

Consider permutting the input vectors.
Queries and key will be the same, but permutted.Same for similarities, attention and output.
Self attention layer is Permutation Equivariant. f(s(x))=s(f(x))
Self-attention layer works on sets of vectors.
Self-attention doesn't know the order of the vectors it is processing. So add positional encoding E to the input. Can be learned lookup table, or fixed function
Masked Self-Attention Layer
Don't let vectors look ahead in sequence. Used for language modeling (predict next word).

Multihead Self-Attention
Use H independent Attention Heads in parallel.
Run self attention in parallel on each set of input vectors (different weights per head).

Last updated