Awesome - Deep Vision Architecture

This repo provides an up-to-date list of progress made in deep learning vision architectures/image classification, which includes but not limited to paper (backbone design, loss deisgn, tricks etc), datasets, codebases, frameworks and etc. Please feel free to open an issue to add new progress.

Note: The papers are grouped by published year. In each group, the papers are sorted by their citations. In addition, the paper with underline means a milestone in the field. The third-party code prefers PyTorch. The architectures searched by NAS are not included in this repo, please refer to my another repo awesome-architecture-search.

Main Progress
Survey
Datasets
Codebases
Misc

Main Progress

2022 Venues

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer Cited by 25 ICLR 2022 Apple MobileViT PDF Official Code (Stars 513) TL;DR: This paper present MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers.

2021 Venues

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale Cited by 4.1k ICLR 2021 Google Research, Brain Team Vision Transformer ViT PDF Official Code (Stars 5.2k) TL;DR: While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In this context, the authors seek to directly apply a pure transformer to sequences of image patches (called Vision Transformer), which performs very well on image classification tasks.
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows Cited by 1.4k ICCV 2021 Microsoft Research Asia Swin Transformer PDF Official Code (Stars 8.2k) TL;DR: This paper presents a new vision Transformer, called Swin Transformer, whose representation is computed with Shifted windows. The shifted windowing scheme brings greater efficiency by limiting self-attention computation to non-overlapping local windows while also allowing for cross-window connection.
Res2Net: A New Multi-Scale Backbone Architecture Cited by 865 TPAMI 2021 Nankai University Res2Net PDF Official Code (Stars 881) TL;DR: The authors propose a novel building block for CNNs, namely Res2Net, by constructing hierarchical residual-like connections within one single residual block. The Res2Net represents multi-scale features at a granular level and increases the range of receptive fields for each network layer.
RepVGG: Making VGG-style ConvNets Great Again Cited by 160 CVPR 2021 Tsinghua University MEGVII Technology RepVGG PDF Official Code (Stars 2.4k) TL;DR: The authors propose a simple but powerful architecture named RepVGG, which has a multi-branch topology in the training and single-branch topology (VGG-like style) in the inference. Such decoupling of the training-time and inference-time architecture is realized by a structural re-parameterization technique.

2020 Venues

Self-Training With Noisy Student Improves ImageNet Classification Cited by 1.1k CVPR 2020 Google Research NoisyStudent PDF Official Code (Stars 670) TL;DR: The authors present a simple self-training method that achieves 88.4% top-1 accuracy on ImageNet. To achieve this result, they first train an EfficientNet model on labeled ImageNet images and use it as a teacher to generate pseudo labels on 300M unlabeled images. They then train a larger EfficientNet as a student model on the combination of labeled and pseudo labeled images. We iterate this process by putting back the student as the teacher.
RandAugment: Practical automated data augmentation with a reduced search space Cited by 866 CVPR 2020 Google Research, Brain Team RandAugment Data Augmentation PDF Third-party Code (Stars 11.7k) TL;DR: The authors propose a simplified search space for data augmentation that vastly reduces the computational expense of automated augmentation, and permits the removal of a separate proxy task. Despite the simplifications, our method achieves equal or better performance over previous automated augmentation strategies.

2019 Venues

Searching for MobileNetV3 Cited by 2.1k ICCV 2019 Google AI Google Brain MobileNetV3 PDF Official Code (Stars 73.7k) Third-party Code (Stars 11.7k) TL;DR: This paper presents the next generation of MobileNets (MobileNetV3) based on a combination of complementary architecture search techniques as well as a novel architecture design.
CutMix: Regularization Strategy to Train Strong Classifiers with Localizable Features Cited by 1.5k ICCV 2019 Clova AI Research, NAVER Corp. CutMix Data Augmentation PDF Official Code (Stars 1.0k) TL;DR: Prior works have proved to be effective for guiding the model to attend on less discriminative parts of objects (e.g. leg as opposed to head of a person). The authors therefore propose the CutMix augmentation strategy: patches are cut and pasted among training images where the ground truth labels are also mixed proportionally to the area of the patches.
AutoAugment: Learning Augmentation Policies from Data Cited by 958 CVPR 2019 Google Brain AutoAugment Data Augmentation PDF Third-party Code (Stars 11.7k) TL;DR: Data augmentation is an effective technique for improving the accuracy of modern image classifiers. However, current data augmentation implementations are manually designed. In this paper, the authors describe a simple procedure called AutoAugment to automatically search for improved data augmentation policies.
Selective Kernel Networks Cited by 883 CVPR 2019 Nanjing University of Science and Technology Momenta SKNet PDF Official Code (Stars 504) TL;DR: The authors propose a dynamic selection mechanism in CNNs that allows each neuron to adaptively adjust its receptive field size based on multiple scales of input information. A building block called Selective Kernel (SK) unit is designed, in which multiple branches with different kernel sizes are fused using softmax attention that is guided by the information in these branches.

2018 Venues

Squeeze-and-Excitation Networks Cited by 12.8k CVPR 2018 Momenta University of Oxford SENet PDF Official Code (Stars 2.9k) TL;DR: Based on the benefit of enhancing spatial encoding in prior works, the authors propose a novel architectural unit, which we term the “Squeezeand-Excitation” (SE) block, that adaptively recalibrates channel-wise feature responses by explicitly modelling interdependencies between channels.
MobileNetV2: Inverted Residuals and Linear Bottlenecks Cited by 9.5k CVPR 2018 Google Inc. MobileNetV2 PDF Official Code (Stars 73.7k) Third-party Code (Stars 11.7k) TL;DR: Based on MobileNetV1, the authors devise a new mobile architecture, MobileNetV2, which is based on an inverted residual structure where the shortcut connections are between the thin bottleneck layers.
ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices Cited by 4.0k CVPR 2018 Megvii Inc (Face++) ShuffleNetV1 PDF Third-party Code (Stars 1.3k) TL;DR: The authors introduce an extremely computation-efficient CNN architecture named ShuffleNet, which utilizes two new operations, pointwise group convolution and channel shuffle, to greatly reduce computation cost while maintaining accuracy.
mixup: Beyond Empirical Risk Minimization Cited by 3.7k ICLR 2018 MIT FAIR MixUP Data Augmentation PDF Official Code (Stars 946) TL;DR: The authors propose mixup, a simple learning principle/data augmentation, which trains a neural network on convex combinations of pairs of examples and their labels. By doing so, mixup regularizes the neural network to favor simple linear behavior in-between training examples.
ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design Cited by 2.2k ECCV 2018 Megvii Inc (Face++) Tsinghua University ShuffleNetV2 PDF Third-party Code (Stars 11.7k) TL;DR: Prior architecture design is mostly guided by the indirect metric of computation complexity (i.e., FLOPs). In contrast, the authors proposes to use the direct metric (i.e., speed on the target platform) and derives several practical guidelines for efficient network (ShuffleNetV2) design from the empirical observations.

2017 Venues

Densely Connected Convolutional Networks Cited by 25.1k CVPR 2017 Cornell University Tsinghua University Facebook AI Research DenseNet PDF Official Code (Stars 4.5k) Third-party Code (Stars 11.7k) TL;DR: The authors observe that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. Based on this, they introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion.
MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications Cited by 13.2k arXiv 2017 Google Inc. MobileNetV1 PDF Official Code (Stars 73.7k) TL;DR: The authors present a class of efficient models called MobileNets for mobile and embedded vision applications, which is a streamlined architecture with depthwise separable convolutions.
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning Cited by 10.8k AAAI 2017 Google Inc. IneptionV4 PDF Official Code (Stars 73.7k) TL;DR: The authors propose IneptionV4 by combining Inception architectures with residual connections. Moreover, the authors seek to check if Inception can be more efficient with deeper and wider structure.
Xception: Deep Learning with Depthwise Separable Convolutions Cited by 8.8k CVPR 2017 Google Inc. Xception PDF Third-party Code (Stars 8.5k) TL;DR: The authors present an interpretation of Inception modules in convolutional neural networks as being an intermediate step in-between regular convolution and the depthwise separable convolution operation (a depthwise convolution followed by a pointwise convolution).
Aggregated Residual Transformations for Deep Neural Networks Cited by 6.8k CVPR 2017 UC San Diego Facebook AI Research ResNeXt PDF Official Code (Stars 1.8k) TL;DR: This paper presents a simple, highly modularized network architecture for image classification, which is constructed by repeating a building block that aggregates a set of transformations with the same topology.
Improved Regularization of Convolutional Neural Networks with Cutout Cited by 1.7k arXiv 2017 University of Guelph Cutout Data Augmentation PDF Official Code (Stars 456) TL;DR: The authors show that the simple regularization technique of randomly masking out square regions of input during training, called cutout, can be used to improve the robustness and overall performance of convolutional neural networks.

2016 Venues

Deep Residual Learning for Image Recognition Cited by 117.8k CVPR 2016 Microsoft Research ResNet PDF Official Code (Stars 5.9k) Third-party Code (Stars 11.7k) TL;DR: This paper presents a residual learning framework (ResNet) to ease the training of networks that are substantially deeper than those used previously, which reformulates the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions.
Rethinking the Inception Architecture for Computer Vision Cited by 19.7k CVPR 2016 Google Inc. InceptionV3 PDF Official Code (Stars 73.7k) Third-party Code (Stars 11.7k) TL;DR: With version 1 and version 2 of Inception family, the authors want to explore ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
SqueezeNet: AlexNet-level Accuracy with 50x Fewer Parameters and< 0.5 MB Model Size Cited by 5.7k arXiv 2016 DeepScale SqueezeNet PDF Official Code (Stars 2.1k) TL;DR: This paper presents a small DNN architecture called SqueezeNet, which achieves AlexNet-level accuracy on ImageNet with 50x fewer parameters.

2015 Venues

Very Deep Convolutional Networks for Large-Scale Image Recognition Cited by 78.9k ICLR 2015 Visual Geometry Group University of Oxford VGG PDF Third-party Code (Stars 11.7k) TL;DR: From the empirical results, the authors found that a network (VGG) with increasing depth and very small ( 3x3) convolution filters would lead to a significant performace improvement based on the prior-art configurations.
Going Deeper with Convolutions Cited by 39.9k CVPR 2015 Google Inc. GoogLeNet InceptionV1 PDF Official Code (Stars 73.7k) TL;DR: The authors propose a deep convolutional neural network architecture codenamed Inception, which adopts multi-branch topology, leading to increasing of the depth and width of the network while keeping the computational budget constant.
Batch normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift Cited by 36.8k ICML 2015 Google Inc. Batch Normalization InceptionV2 PDF Official Code (Stars 73.7k) TL;DR: The authors propose Batch Normalization(BN) to alleviate the issue of internal covariate shift, which allows us to use much higher learning rates and be less careful about initialization. With the proposed BN, the authors devise a new architecture called InceptionV2.

2012 Venues

ImageNet Classification with Deep Convolutional Neural Networks Cited by 108.3k NeurIPS 2012 University of Toronto AlexNet PDF Third-party Code (Stars 11.7k) TL;DR: This is a pioneering work that exploits a deep convolutional neural network (AlexNet) for large image classification task (ImageNet), which achieves very impressing performance.

Survey

Transformers in Vision: A Survey Cited by 305 arXiv 2021 University of Artificial Intelligence Transformers Survey PDF TL;DR: This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline, which includes fundamental concepts of transformers, extensive applications of transformers in vision, the respective advantages and limitations of popular vision transformers and an analysis on open research directions && possible future works.

Datasets

ImageNet Download Link TL;DR: ImageNet is an image database organized according to the WordNet hierarchy (currently only the nouns), in which each node of the hierarchy is depicted by hundreds and thousands of images. Currently, the most common used versions in academia are ImageNet-1k and ImageNet-21k. 1) ImageNet-1k contains 1,281,167 training images, 50,000 validation images of 1000 object classes. 2) ImageNet-21K, which is bigger and more diverse, consists of 14,197,122 images, each tagged in a single-label fashion by one of 21,841 possible classes. The dataset has no official train-validation split, and the classes are not well-balanced - some classes contain only 1-10 samples, while others contain thousands of samples. Lastly, it is recommended to download this dataset from Academic Torrents instead of the official website. How to cite: Imagenet: A Large-scale Hierarchical Image Database Cited by 38.9k CVPR 2009 Princeton University ImageNet PDF
CIFAR Download Link TL;DR: There are two versions of CIFAR dataset: 1) The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images. 2) The CIFAR-100 is just like the CIFAR-10, except it has 100 classes containing 600 images each. There are 500 training images and 100 testing images per class. Please refer to the official website for more details. How to cite: Learning Multiple Layers of Features from Tiny Images Cited by 15.4k Tech Report 2009 Alex Krizhevsky CIFAR-10 CIFAR-100 PDF
Food-101 Download Link TL;DR: It is a challenging data set of 101 food categories, with 101,000 images. For each class, 250 manually reviewed test images are provided as well as 750 training images. On purpose, the training images were not cleaned, and thus still contain some amount of noise. This comes mostly in the form of intense colors and sometimes wrong labels. All images were rescaled to have a maximum side length of 512 pixels. How to cite: Food-101–Mining Discriminative Components with Random Forests Cited by 912 ECCV 2014 ETH Z¨urich F o o d - 1 0 1 PDF

Codebases

rwightman/pytorch-image-models Stars 18.6k timm TL;DR: A collection of image models, layers, utilities, optimizers, schedulers, data-loaders / augmentations, and reference training / validation scripts that aim to pull together a wide variety of SOTA models with ability to reproduce ImageNet training results.

Misc

Paper with Code paperwithcode benchmark leaderboard TL;DR: This website provides a list of state-of-the-art papers and a leaderboard/benchmark of SoTA on varying datasets.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Awesome - Deep Vision Architecture

Main Progress

2022 Venues

2021 Venues

2020 Venues

2019 Venues

2018 Venues

2017 Venues

2016 Venues

2015 Venues

2012 Venues

Survey

Datasets

Codebases

Misc

About

Releases

Packages

License

chenyaofo/awesome-vision-architecture

Folders and files

Latest commit

History

Repository files navigation

Awesome - Deep Vision Architecture

About

Topics

Resources

License

Stars

Watchers

Forks