Transformer for Image

https://arxiv.org/pdf/2010.11929.pdf

Transformer: computationally efficient, scalability, and can train over 100B parameters.

This paper applies transformer directly to images. On large datasets, BiT attains excellent result.

Encoding

Method:

ViT uses patch embedding, flatten the patches and map to D dimensions with trainable linear projection. MLP.

The biggest difference is the encoding process, which is shown here.

Last updated