Just a pure transformer encoder used as a text classifier. The embedded sequences directly go through the model and are classified.
Like the transformer, but the word embeddings go through a convolutional layer similar to the design in ViT, before going through the transformer layers.
Similar to ViT-Lite, but lacking the class token and using sequence pooling instead.
CVT with a more complicated convolutional layer, similar to the vision model.
Our base model is in pure PyTorch and Torchvision. No extra packages are required. Please refer to PyTorch's Getting Started page for detailed instructions.
For each model (transformer/vit/cvt/cct) sizes 2, 4 and 6 are available.
from src.text import text_cct_2
model = text_cct_2(kernel_size=1)
For kernel size, we have found that sizes 1, 2 and 4 perform best.
You can even go further and create your own custom variant by importing the classes (i.e. TextCCT
).
Model | Kernel size | AGNews | TREC | # Params |
CCT-2 | 1 | 93.45% | 91.00% | 0.238M |
2 | 93.51% | 91.80% | 0.276M | |
4 | 93.80% | 91.00% | 0.353M | |
CCT-4 | 1 | 93.55% | 91.80% | 0.436M |
2 | 93.24% | 93.60% | 0.475M | |
4 | 93.09% | 93.00% | 0.551M | |
CCT-6 | 1 | 93.78% | 91.60% | 3.237M |
2 | 93.33% | 92.20% | 3.313M | |
4 | 92.95% | 92.80% | 3.467M |
@article{hassani2021escaping,
title = {Escaping the Big Data Paradigm with Compact Transformers},
author = {Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu Abuduweili and Jiachen Li and Humphrey Shi},
year = 2021,
url = {https://arxiv.org/abs/2104.05704},
eprint = {2104.05704},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}