# Lecture 6

## Attention and Transformers

Replace Convolution with "Local Attention"

Convolution: output at each position is inner product of conv kernel with receptive field in input.

<figure><img src="/files/wMEjfhXvGM971Uet3E2I" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/VV0VVFwgnifePOjKG13R" alt=""><figcaption></figcaption></figure>

## RNN

Works on ordered sequences. Good at long sequences, not parallelizable.

Key idea: has an internal state.

We process sequence of vectors by applying recurrence formula:

$$
h\_t = f\_w(h\_{t-1},x\_t)
$$

Where $$h\_t$$ is new state, $$f\_w$$ is some function with parameters $$w$$, $$h\_{t-1}$$is old state and x\_t is input at some time step.

RNN graph:

<figure><img src="/files/ib6OOjFnnLl6Tu7gL3va" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/AAujFPyFbIcqEWmQzaut" alt=""><figcaption><p>Many to Many</p></figcaption></figure>

<figure><img src="/files/LE4TMrUBK3diCgUj3XkT" alt=""><figcaption></figcaption></figure>

<figure><img src="/files/oFBTwMpUbnPOOu2h7hqN" alt=""><figcaption></figcaption></figure>

Sequence to Sequence (Many to one) + (One to many)

<figure><img src="/files/q9bBSvNksHROxq7tNjHY" alt=""><figcaption></figcaption></figure>

### Convolution on input.

Works on multidimensional grids. Bad at long sequences, need to stack many conv layers for input to see the whole sequence. Highly parallel: each output can be computed parallel.

### Self attention

Works on set of vectors, good at long sequences, after one attention layer, each output sees all inputs. Highly parallel. Very memory intensive.

### Transformer

All vectors interact with each other. + MLP independently on each vector, plus residual connection.

<figure><img src="/files/XP0XvWTapKt6wdGzh6ZQ" alt=""><figcaption></figcaption></figure>

A transformer is a sequence of transformer blocks.

Adding attention to RNN models let them look different parts of the input at each timestep.

Generalized self-attention is new, powerful neural network primitive.

Transformers are new nerual network model that only uses attention.


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/course_notes/cs-839-notes/lecture-6.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
