Transformer for Image
https://arxiv.org/pdf/2010.11929.pdf
Transformer: computationally efficient, scalability, and can train over 100B parameters.
This paper applies transformer directly to images. On large datasets, BiT attains excellent result.

Method:
ViT uses patch embedding, flatten the patches and map to D dimensions with trainable linear projection. MLP.
The biggest difference is the encoding process, which is shown here.
Last updated