May 29, 2025
This post is an annotated version of the paper An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale by Alexey Dosovitskiy et al. The paper introduces Vision Transformers (ViT), a novel approach to image classification that applies transformer models, originally designed for natural language processing, to image data.
I have given my self notes for future reference.
Back to Home