Vision Transformer

About

Re-implementation of Vision Transformer

TODO

Datasets

MNIST (98%)
CIFAR-10 (75.25%)
Tiny ImageNet (44%) (Same param count as ResNet18)

Model

Arch

Patch + Position Embedding (Extra learnable class embedding)
Linear Projection of Flattened Patches
Transformer Encoder
MLP Head (Contains GeLU)
Class (Bird, Ball, Car)

Transformer Encoder

Embedded Patches
Norm (Layer Norm?)
MHA
Add (Identity + Activation)
Norm (Layer Norm?)
MLP
Add (Identity + Activation)

Image Handling

Reshape image x { R(HxWxC) into eq of flattened 2d patches xp { R(N*(P2 * C)) where (H, W) is res, C is channel count, (P, P) is res of each image patch, N = HW / P2 is resulting number of patches, also servces as effective input seq len for Transformer. Transformer uses constant latent vector size D through all layers, patches flattened to map to D dimensions with trainable linear projection, output of this projection are patch embeddings.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.ipynb		main.ipynb
tiny_imagenet.py		tiny_imagenet.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision Transformer

About

TODO

Datasets

Model

Arch

Transformer Encoder

Image Handling

About

Releases

Packages

Languages

License

MiscellaneousStuff/vision-transformer

Folders and files

Latest commit

History

Repository files navigation

Vision Transformer

About

TODO

Datasets

Model

Arch

Transformer Encoder

Image Handling

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages