Re-implementation of Vision Transformer
- Image Preprocessing
- Patching
- Flattening
- Linear Projection
- Position Embedding
- CLS Token
- Transformer
- Attention
- Feedforward
- MLP Head Classifier
- MNIST (98%)
- CIFAR-10 (75.25%)
- Tiny ImageNet (44%) (Same param count as ResNet18)
- Patch + Position Embedding (Extra learnable class embedding)
- Linear Projection of Flattened Patches
- Transformer Encoder
- MLP Head (Contains GeLU)
- Class (Bird, Ball, Car)
- Embedded Patches
- Norm (Layer Norm?)
- MHA
- Add (Identity + Activation)
- Norm (Layer Norm?)
- MLP
- Add (Identity + Activation)
- Reshape image x { R(HxWxC) into eq of flattened 2d patches xp { R(N*(P2 * C)) where (H, W) is res, C is channel count, (P, P) is res of each image patch, N = HW / P2 is resulting number of patches, also servces as effective input seq len for Transformer. Transformer uses constant latent vector size D through all layers, patches flattened to map to D dimensions with trainable linear projection, output of this projection are patch embeddings.