# Attention is all you need

Proposed Transformer, solely based on attention mechanisms, dispensing recurrence and convolutions entirely.&#x20;

Self-attention: attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.&#x20;

### Model architecture

Encoder: maps input sequence $$(x\_1, \dots, x\_n)$$ to a sequence of representations $$(z\_1, \dots, z\_n)$$. Consuming previously generated symbols as additional input when generating next.

**Encoder and Decoder Stacks**

The encoder is composed of stack of $$N = 6$$ identical layers. Each layer has 2 sublayers. First is a multi-head self-attention mechanism, second is  a position-wise FFN. Employ a residual connection around each of 2 sub-layers, followed by layer normalization. Output of each sub-layer is LayerNorm($$x + Sublayer(x))$$ $$Sublayer(x)$$ is a function by sublayer itself.

Decoder is also 6 layers, also insert a third layer in addition, which performs multi-head attention over output of the encoder  stack. Similar to encoder, we employ residual connection around each sublayers. Also prevent positions from attending to subsequence position - masking. Enusures predictions from position $$i$$ can depend only on known output position less than $$i$$

**Attention**

Function mapping a query and set of key-value pairs to an output, where query, keys, values and output are all vectors.

$$
Attention(Q,K,V) = softmax(\frac{QK^T}{\sqrt{d\_k}})V
$$

Here, dk is dimension of keys. d\_v is dimension of values. (Divide to avoid softmax issue with extremely small gradients)

Multihead attention.

Allows model to jointly attend to information from different representation subspaces at different positions.&#x20;

$$MultiHead(Q, K, V) = Concat(head\_1, \dots, head\_h)W^O$$

Where $$head\_i = Attention(QW\_i^Q, KW\_i^K, VW\_i^V)$$. Where projections are parameter matrixs W.

In this we deploy h = 8 parallel attention layers or head.

### FFN

$$
FFN(x) = \max (0, xW\_1 +b\_1)W\_2 + b\_2
$$

Positional encoding: use positional encodings to input embeddings at the bottoms of the encoder and decoder stacks.  (In order to make use of order of sequence).


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://tianyi0216.gitbook.io/blog/reading-notes/ml-readings/attention-is-all-you-need.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
