# Lecture 5

## Regularization

**Data Augmentation**

* Horizontal flips
* Random crops and scales.
* Color: contrast and brightness

## Attention and Transformer

&#x20;**RNNs**

Input and Output:

* one to one
* one to many
* many to one
* many to many
* many to many (aligned)

Sequence to Sequence

Input: Sequence $$x\_1, \dots, x\_T$$

Output: Sequence $$y\_1, \dots, y\_{T^\prime}$$

Encoder: $$h\_t = f\_w(x\_t, h\_{t-1})$$

<figure><img src="/files/WJRXjTjHmhovJvJkzH0V" alt=""><figcaption></figcaption></figure>

From initial hidden state predict: initial decoder state $$s\_0$$, context vector $$c$$

Decoder: $$s\_t = g\_u(y\_{t-1}, s\_{t-1}, c)$$

### Attention

Compute alignment scores $$e\_{t,i} = f\_{att}(s\_{t-1}, h\_i)$$, $$f\_{att}$$ is an MLP.

Normalize aligment scores to get attention weights $$0 < a\_{t,i} < 1, \sum\_i a\_{t,i} = 1$$

context vector as $$c\_t = \sum\_i a\_{t,i} h\_i$$

<figure><img src="/files/kdkemWPjCQQZbl2jckSi" alt=""><figcaption><p>With attention</p></figcaption></figure>

Repeat: use $$s\_1$$ to compute new context vector $$c\_2$$, use $$c\_2$$ to compute $$s\_2, y\_2$$

Use a different context vector in each timestep of decorder. Input sequence not bottlenecked through single vector. At each timestep of decoder, context vector looks at different part of sequence.

Decoder doesn't use the fact that $$h\_i$$ form an ordered sequence, treat them as unordered. Can use similar architecture given any set of vectors $${h\_i}$$

### Images with RNNs and Attention

<figure><img src="/files/2ye5ybn1larTuY4yEyfw" alt=""><figcaption></figcaption></figure>

Each timestep of decoder uses a different context vector that looks at different part of image.

Uses in image captioning.

&#x20;**Attention Layer**

Inputs:

Query vector: $$q$$ (shape: $$D\_q$$)

Input vectors: $$X$$(shape: $$N\_X \times D\_X$$)

Similarity function: $$f\_{att}$$

Computation:

similarities: $$e$$ (shape: $$N\_X$$) $$e\_i = f\_att(q, X\_i)$$

Attention weights: $$a = softmax(e)$$ (shape $$N\_X$$)

Output vecotr: $$y = \sum\_i a\_xX\_i$$ (shape $$D\_X$$)

Change 1: could use dot product for similarity.

$$e\_i = q \cdot X\_i$$

Change 2: use scaled dot product

$$e\_i = q \cdot \frac{X\_i}{\sqrt{D\_Q}}$$

Change 3: multiple query vectors

$$E = \frac{QX^T}{\sqrt{D\_Q}}$$ (shape: $$N\_Q \times N\_X$$) $$E\_{i,j} = \frac{(Q\_i \cdot X\_j)}{\sqrt{D\_Q}}$$

Change 4: Separate key and value.

Key Matrix: $$W\_K$$ (Shape: $$D\_X \times D\_Q$$)

Value Matrix: $$W\_V$$ (Shape: $$D\_X \times D\_V$$)

Key Vectors: $$K = XW\_K$$ (Shape: $$N\_X \times D\_Q$$)

Value Vectors: $$V = XW\_V$$ (Shape: $$N\_X \times D\_V$$)

Final Attention Layer becomes:

Inputs:

Query vector: $$q$$ (shape: $$D\_q$$)

Input vectors: $$X$$(shape: $$N\_X \times D\_X$$)

Key Matrix: $$W\_K$$ (Shape: $$D\_X \times D\_Q$$)

Value Matrix: $$W\_V$$ (Shape: $$D\_X \times D\_V$$)

Computation:

Key Vectors: $$K = XW\_K$$ (Shape: $$N\_X \times D\_Q$$)

Value Vectors: $$V = XW\_V$$ (Shape: $$N\_X \times D\_V$$)

$$E = \frac{QX^T}{\sqrt{D\_Q}}$$ (shape: $$N\_Q \times N\_X$$) $$E\_{i,j} = \frac{(Q\_i \cdot X\_j)}{\sqrt{D\_Q}}$$ Similarities

Attention weights: $$a = softmax(e)$$ (shape $$N\_X$$)

Output vecotr: $$Y = AV$$ (shape: $$N\_Q \times D\_V$$) $$Y\_i = \sum\_j A\_{i,j}V\_j$$

<figure><img src="/files/rQIMjefrbUXdWiMjTxSy" alt=""><figcaption><p>Attention Layer</p></figcaption></figure>

&#x20;**Self Attention Layer**

One query vector per input vector

<figure><img src="/files/pp0QGplPEcTGws6xHTfv" alt=""><figcaption><p>Self Attention</p></figcaption></figure>

Consider permutting the input vectors.

Queries and key will be the same, but permutted.Same for similarities, attention and output.

Self attention layer is Permutation Equivariant. $$f(s(x)) = s(f(x))$$

Self-attention layer works on sets of vectors.

Self-attention doesn't know the order of the vectors it is processing. So add positional encoding $$E$$ to the input. Can be learned lookup table, or fixed function

&#x20;**Masked Self-Attention Layer**&#x20;

Don't let vectors look ahead in sequence. Used for language modeling (predict next word).

<figure><img src="/files/5QU5eOTsf0WLjswcyROW" alt=""><figcaption><p>masked</p></figcaption></figure>

&#x20;**Multihead Self-Attention**

Use H independent Attention Heads in parallel.

Run self attention in parallel on each set of input vectors (different weights per head).

<figure><img src="/files/atycRtK7Ckaw1gQwnPRE" alt=""><figcaption><p>multihead</p></figcaption></figure>


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/course_notes/cs-839-notes/lecture-5.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
