Attention is all you need
Proposed Transformer, solely based on attention mechanisms, dispensing recurrence and convolutions entirely.
Self-attention: attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence.
Model architecture
Encoder: maps input sequence to a sequence of representations . Consuming previously generated symbols as additional input when generating next.
Encoder and Decoder Stacks
The encoder is composed of stack of identical layers. Each layer has 2 sublayers. First is a multi-head self-attention mechanism, second is a position-wise FFN. Employ a residual connection around each of 2 sub-layers, followed by layer normalization. Output of each sub-layer is LayerNorm( is a function by sublayer itself.
Decoder is also 6 layers, also insert a third layer in addition, which performs multi-head attention over output of the encoder stack. Similar to encoder, we employ residual connection around each sublayers. Also prevent positions from attending to subsequence position - masking. Enusures predictions from position can depend only on known output position less than
Attention
Function mapping a query and set of key-value pairs to an output, where query, keys, values and output are all vectors.
Here, dk is dimension of keys. d_v is dimension of values. (Divide to avoid softmax issue with extremely small gradients)
Multihead attention.
Allows model to jointly attend to information from different representation subspaces at different positions.
Where . Where projections are parameter matrixs W.
In this we deploy h = 8 parallel attention layers or head.
FFN
Positional encoding: use positional encodings to input embeddings at the bottoms of the encoder and decoder stacks. (In order to make use of order of sequence).
Last updated