This repo provides an up-to-date list of progress made in deep learning vision architectures/image classification, which includes but not limited to paper (backbone design, loss deisgn, tricks etc), datasets, codebases, frameworks and etc. Please feel free to open an issue to add new progress.
Note: The papers are grouped by published year. In each group, the papers are sorted by their citations. In addition, the paper with underline means a milestone in the field. The third-party code prefers PyTorch
. The architectures searched by NAS are not included in this repo, please refer to my another repo awesome-architecture-search.
- MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
Cited by 25
ICLR
2022
Apple
MobileViT
PDF
Official Code (Stars 513)
TL;DR: This paper present MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers.
-
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Cited by 4.1k
ICLR
2021
Google Research, Brain Team
Vision Transformer
ViT
PDF
Official Code (Stars 5.2k)
TL;DR: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In this context, the authors seek to directly apply a pure transformer to sequences of image patches (called Vision Transformer), which performs very well on image classification tasks. -
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Cited by 1.4k
ICCV
2021
Microsoft Research Asia
Swin Transformer
PDF
Official Code (Stars 8.2k)
TL;DR: This paper presents a new vision Transformer, called Swin Transformer, whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection. -
Res2Net: A New Multi-Scale Backbone Architecture
Cited by 865
TPAMI
2021
Nankai University
Res2Net
PDF
Official Code (Stars 881)
TL;DR: The authors propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer. -
RepVGG: Making VGG-style ConvNets Great Again
Cited by 160
CVPR
2021
Tsinghua University
MEGVII Technology
RepVGG
PDF
Official Code (Stars 2.4k)
TL;DR: The authors propose a simple but powerful architecture named RepVGG, which has a multi-branch topology in the training and single-branch topology (VGG-like style) in the inference. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique.
-
Self-Training With Noisy Student Improves ImageNet Classification
Cited by 1.1k
CVPR
2020
Google Research
NoisyStudent
PDF
Official Code (Stars 670)
TL;DR: The authors present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet. To achieve this result, they first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. They then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher. -
RandAugment: Practical automated data augmentation with a reduced search space
Cited by 866
CVPR
2020
Google Research, Brain Team
RandAugment
Data Augmentation
PDF
Third-party Code (Stars 11.7k)
TL;DR: The authors propose a simplified search space for data augmentation that vastly reduces the computational expense of automated augmentation, and permits the removal of a separate proxy task. Despite the simplifications, our method achieves equal or better performance over previous automated augmentation strategies.
-
Searching for MobileNetV3
Cited by 2.1k
ICCV
2019
Google AI
Google Brain
MobileNetV3
PDF
Official Code (Stars 73.7k)
Third-party Code (Stars 11.7k)
TL;DR: This paper presents the next generation of MobileNets (MobileNetV3) based on a combination of complementary architecture search techniques as well as a novel architecture design. -
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features
Cited by 1.5k
ICCV
2019
Clova AI Research, NAVER Corp.
CutMix
Data Augmentation
PDF
Official Code (Stars 1.0k)
TL;DR: Prior works have proved to be effective for guiding the model to attend on less discriminative parts of objects (e.g. leg as opposed to head of a person). The authors therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches. -
AutoAugment: Learning Augmentation Policies from Data
Cited by 958
CVPR
2019
Google Brain
AutoAugment
Data Augmentation
PDF
Third-party Code (Stars 11.7k)
TL;DR: Data augmentation is an effective technique for improving the accuracy of modern image classifiers. However, current data augmentation implementations are manually designed. In this paper, the authors describe a simple procedure called AutoAugment to automatically search for improved data augmentation policies. -
Selective Kernel Networks
Cited by 883
CVPR
2019
Nanjing University of Science and Technology
Momenta
SKNet
PDF
Official Code (Stars 504)
TL;DR: The authors propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches.
-
Squeeze-and-Excitation Networks
Cited by 12.8k
CVPR
2018
Momenta
University of Oxford
SENet
PDF
Official Code (Stars 2.9k)
TL;DR: Based on the benefit of enhancing spatial encoding in prior works, the authors propose a novel architectural unit, which we term the “Squeezeand-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels. -
MobileNetV2: Inverted Residuals and Linear Bottlenecks
Cited by 9.5k
CVPR
2018
Google Inc.
MobileNetV2
PDF
Official Code (Stars 73.7k)
Third-party Code (Stars 11.7k)
TL;DR: Based on MobileNetV1, the authors devise a new mobile architecture, MobileNetV2, which is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers. -
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices
Cited by 4.0k
CVPR
2018
Megvii Inc (Face++)
ShuffleNetV1
PDF
Third-party Code (Stars 1.3k)
TL;DR: The authors introduce an extremely computation-efficient CNN architecture named ShuffleNet, which utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy. -
mixup: Beyond Empirical Risk Minimization
Cited by 3.7k
ICLR
2018
MIT
FAIR
MixUP
Data Augmentation
PDF
Official Code (Stars 946)
TL;DR: The authors propose mixup, a simple learning principle/data augmentation, which trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples. -
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design
Cited by 2.2k
ECCV
2018
Megvii Inc (Face++)
Tsinghua University
ShuffleNetV2
PDF
Third-party Code (Stars 11.7k)
TL;DR: Prior architecture design is mostly guided by the indirect metric of computation complexity (i.e., FLOPs). In contrast, the authors proposes to use the direct metric (i.e., speed on the target platform) and derives several practical guidelines for efficient network (ShuffleNetV2) design from the empirical observations.
-
Densely Connected Convolutional Networks
Cited by 25.1k
CVPR
2017
Cornell University
Tsinghua University
Facebook AI Research
DenseNet
PDF
Official Code (Stars 4.5k)
Third-party Code (Stars 11.7k)
TL;DR: The authors observe that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. Based on this, they introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. -
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications
Cited by 13.2k
arXiv
2017
Google Inc.
MobileNetV1
PDF
Official Code (Stars 73.7k)
TL;DR: The authors present a class of efficient models called MobileNets for mobile and embedded vision applications, which is a streamlined architecture with depthwise separable convolutions. -
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning
Cited by 10.8k
AAAI
2017
Google Inc.
IneptionV4
PDF
Official Code (Stars 73.7k)
TL;DR: The authors propose IneptionV4 by combining Inception architectures with residual connections. Moreover, the authors seek to check if Inception can be more efficient with deeper and wider structure. -
Xception: Deep Learning with Depthwise Separable Convolutions
Cited by 8.8k
CVPR
2017
Google Inc.
Xception
PDF
Third-party Code (Stars 8.5k)
TL;DR: The authors present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution). -
Aggregated Residual Transformations for Deep Neural Networks
Cited by 6.8k
CVPR
2017
UC San Diego
Facebook AI Research
ResNeXt
PDF
Official Code (Stars 1.8k)
TL;DR: This paper presents a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology. -
Improved Regularization of Convolutional Neural Networks with Cutout
Cited by 1.7k
arXiv
2017
University of Guelph
Cutout
Data Augmentation
PDF
Official Code (Stars 456)
TL;DR: The authors show that the simple regularization technique of randomly masking out square regions of input during training, called cutout, can be used to improve the robustness and overall performance of convolutional neural networks.
-
Deep Residual Learning for Image Recognition
Cited by 117.8k
CVPR
2016
Microsoft Research
ResNet
PDF
Official Code (Stars 5.9k)
Third-party Code (Stars 11.7k)
TL;DR: This paper presents a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously, which reformulates the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. -
Rethinking the Inception Architecture for Computer Vision
Cited by 19.7k
CVPR
2016
Google Inc.
InceptionV3
PDF
Official Code (Stars 73.7k)
Third-party Code (Stars 11.7k)
TL;DR: With version 1 and version 2 of Inception family, the authors want to explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization. -
SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and< 0.5 MB Model Size
Cited by 5.7k
arXiv
2016
DeepScale
SqueezeNet
PDF
Official Code (Stars 2.1k)
TL;DR: This paper presents a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters.
-
Very Deep Convolutional Networks for Large-Scale Image Recognition
Cited by 78.9k
ICLR
2015
Visual Geometry Group
University of Oxford
VGG
PDF
Third-party Code (Stars 11.7k)
TL;DR: From the empirical results, the authors found that a network (VGG) with increasing depth and very small ( 3x3) convolution filters would lead to a significant performace improvement based on the prior-art configurations. -
Going Deeper with Convolutions
Cited by 39.9k
CVPR
2015
Google Inc.
GoogLeNet
InceptionV1
PDF
Official Code (Stars 73.7k)
TL;DR: The authors propose a deep convolutional neural network architecture codenamed Inception, which adopts multi-branch topology, leading to increasing of the depth and width of the network while keeping the computational budget constant. -
Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Cited by 36.8k
ICML
2015
Google Inc.
Batch Normalization
InceptionV2
PDF
Official Code (Stars 73.7k)
TL;DR: The authors propose Batch Normalization(BN) to alleviate the issue of internal covariate shift, which allows us to use much higher learning rates and be less careful about initialization. With the proposed BN, the authors devise a new architecture called InceptionV2.
- ImageNet Classification with Deep Convolutional Neural Networks
Cited by 108.3k
NeurIPS
2012
University of Toronto
AlexNet
PDF
Third-party Code (Stars 11.7k)
TL;DR: This is a pioneering work that exploits a deep convolutional neural network (AlexNet) for large image classification task (ImageNet), which achieves very impressing performance.
- Transformers in Vision: A Survey
Cited by 305
arXiv
2021
University of Artificial Intelligence
Transformers
Survey
PDF
TL;DR: This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline, which includes fundamental concepts of transformers, extensive applications of transformers in vision, the respective advantages and limitations of popular vision transformers and an analysis on open research directions && possible future works.
- ImageNet
Download Link
TL;DR: ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently, the most common used versions in academia are ImageNet-1k and ImageNet-21k. 1) ImageNet-1k contains 1,281,167 training images, 50,000 validation images of 1000 object classes. 2) ImageNet-21K, which is bigger and more diverse, consists of 14,197,122 images, each tagged in a single-label fashion by one of 21,841 possible classes. The dataset has no official train-validation split, and the classes are not well-balanced - some classes contain only 1-10 samples, while others contain thousands of samples. Lastly, it is recommended to download this dataset from Academic Torrents instead of the official website.How to cite:
Imagenet: A Large-scale Hierarchical Image DatabaseCited by 38.9k
CVPR
2009
Princeton University
ImageNet
PDF
- CIFAR
Download Link
TL;DR: There are two versions of CIFAR dataset: 1) The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. 2) The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. Please refer to the official website for more details.How to cite:
Learning Multiple Layers of Features from Tiny ImagesCited by 15.4k
Tech Report
2009
Alex Krizhevsky
CIFAR-10
CIFAR-100
PDF
- Food-101
Download Link
TL;DR: It is a challenging data set of 101 food categories, with 101,000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels.How to cite:
Food-101–Mining Discriminative Components with Random ForestsCited by 912
ECCV
2014
ETH Z¨urich
F
o
o
d
-
1
0
1
PDF
- rwightman/pytorch-image-models
Stars 18.6k
timm
TL;DR: A collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.
- Paper with Code
paperwithcode
benchmark
leaderboard
TL;DR: This website provides a list of state-of-the-art papers and a leaderboard/benchmark of SoTA on varying datasets.