Property | Data |
---|---|
Created | 2023-02-22 |
Updated | 2023-02-22 |
Author | @YiTing, @Aiden |
Tags | #study |
Title | Venue | Year | Code |
---|---|---|---|
ViT: Transformer for image recognition at scale | ICLR | '21 | ✓, ✓ |
In vision, attention is either applied in conjunction with convolutional networks
, or used to replace certain components of convolutional networks while keeping their overall structure in place.
We show that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on image classification tasks.
This is the first paper applys transformer in CV with scalability.
Thanks to Transformers’ computational efficiency and scalability, it has become possible to train models of unprecedented size, with the models and datasets growing, there is still no sign of saturating performance.
This paper experiment with applying a standard Transformer
directly to images, with the fewest possible modifications by split an image into patches and provide the sequence of linear embeddings of these patches as an input to a Transformer:
- Image patches are treated the same way as tokens (words) in an NLP application.
- The model be trained on image classification in supervised fashion.
-
Locality
: closer object would have similar features -
Translastion equivariance
:$$ (fog)(x) \equiv (gof)(x) $$ equivariance
: if the input is changed in a certain way, the output will change in the same way, maintaining the relationship between the input and output.
Therefore, transformer don't generalize well when trained on insufficient amounts of data.
However, if the models are trained on larger datasets (14M-300M images). Finding that large scale training trumps inductive bias. (
The training resolution is 224, so the shape of a image is 224 ⨉ 224 ⨉ 3, and then patching the image by 16 ⨉ 16 ⨉ 3, we get 196 patched images.
With the extra learnable [class]
embedding, so the sequence shape is