Lecture 5

Regularization

Data Augmentation

  • Horizontal flips

  • Random crops and scales.

  • Color: contrast and brightness

Attention and Transformer

RNNs

Input and Output:

  • one to one

  • one to many

  • many to one

  • many to many

  • many to many (aligned)

Sequence to Sequence

Input: Sequence x1,,xTx_1, \dots, x_T

Output: Sequence y1,,yTy_1, \dots, y_{T^\prime}

Encoder: ht=fw(xt,ht1)h_t = f_w(x_t, h_{t-1})

From initial hidden state predict: initial decoder state s0s_0, context vector cc

Decoder: st=gu(yt1,st1,c)s_t = g_u(y_{t-1}, s_{t-1}, c)

Attention

Compute alignment scores et,i=fatt(st1,hi)e_{t,i} = f_{att}(s_{t-1}, h_i), fattf_{att} is an MLP.

Normalize aligment scores to get attention weights 0<at,i<1,iat,i=10 < a_{t,i} < 1, \sum_i a_{t,i} = 1

context vector as ct=iat,ihic_t = \sum_i a_{t,i} h_i

With attention

Repeat: use s1s_1 to compute new context vector c2c_2, use c2c_2 to compute s2,y2s_2, y_2

Use a different context vector in each timestep of decorder. Input sequence not bottlenecked through single vector. At each timestep of decoder, context vector looks at different part of sequence.

Decoder doesn't use the fact that hih_i form an ordered sequence, treat them as unordered. Can use similar architecture given any set of vectors {hi}\{h_i\}

Images with RNNs and Attention

Each timestep of decoder uses a different context vector that looks at different part of image.

Uses in image captioning.

Attention Layer

Inputs:

Query vector: qq (shape: DqD_q)

Input vectors: XX(shape: NX×DXN_X \times D_X)

Similarity function: fattf_{att}

Computation:

similarities: ee (shape: NXN_X) ei=fatt(q,Xi)e_i = f_att(q, X_i)

Attention weights: a=softmax(e)a = softmax(e) (shape NXN_X)

Output vecotr: y=iaxXiy = \sum_i a_xX_i (shape DXD_X)

Change 1: could use dot product for similarity.

ei=qXie_i = q \cdot X_i

Change 2: use scaled dot product

ei=qXiDQe_i = q \cdot \frac{X_i}{\sqrt{D_Q}}

Change 3: multiple query vectors

E=QXTDQE = \frac{QX^T}{\sqrt{D_Q}} (shape: NQ×NXN_Q \times N_X) Ei,j=(QiXj)DQE_{i,j} = \frac{(Q_i \cdot X_j)}{\sqrt{D_Q}}

Change 4: Separate key and value.

Key Matrix: WKW_K (Shape: DX×DQD_X \times D_Q)

Value Matrix: WVW_V (Shape: DX×DVD_X \times D_V)

Key Vectors: K=XWKK = XW_K (Shape: NX×DQN_X \times D_Q)

Value Vectors: V=XWVV = XW_V (Shape: NX×DVN_X \times D_V)

Final Attention Layer becomes:

Inputs:

Query vector: qq (shape: DqD_q)

Input vectors: XX(shape: NX×DXN_X \times D_X)

Key Matrix: WKW_K (Shape: DX×DQD_X \times D_Q)

Value Matrix: WVW_V (Shape: DX×DVD_X \times D_V)

Computation:

Key Vectors: K=XWKK = XW_K (Shape: NX×DQN_X \times D_Q)

Value Vectors: V=XWVV = XW_V (Shape: NX×DVN_X \times D_V)

E=QXTDQE = \frac{QX^T}{\sqrt{D_Q}} (shape: NQ×NXN_Q \times N_X) Ei,j=(QiXj)DQE_{i,j} = \frac{(Q_i \cdot X_j)}{\sqrt{D_Q}} Similarities

Attention weights: a=softmax(e)a = softmax(e) (shape NXN_X)

Output vecotr: Y=AVY = AV (shape: NQ×DVN_Q \times D_V) Yi=jAi,jVjY_i = \sum_j A_{i,j}V_j

Attention Layer

Self Attention Layer

One query vector per input vector

Self Attention

Consider permutting the input vectors.

Queries and key will be the same, but permutted.Same for similarities, attention and output.

Self attention layer is Permutation Equivariant. f(s(x))=s(f(x))f(s(x)) = s(f(x))

Self-attention layer works on sets of vectors.

Self-attention doesn't know the order of the vectors it is processing. So add positional encoding EE to the input. Can be learned lookup table, or fixed function

Masked Self-Attention Layer

Don't let vectors look ahead in sequence. Used for language modeling (predict next word).

masked

Multihead Self-Attention

Use H independent Attention Heads in parallel.

Run self attention in parallel on each set of input vectors (different weights per head).

multihead

Last updated