This repository provides the PyTorch implementation of the paper HIRL: A General Framework for Hierarchical Image Representation Learning and the re-implementations of multiple superior image self-supervised learning (SSL) methods. This repository contains complete source code and model weights to reproduce the results in the paper.
HIRL is an effective and flexible framework to learn the hierarchical semantic information underlying a large-scale image database. It can be flexibly combined with off-the-shelf image SSL approaches and improve them by learning multiple levels of image semantics. We employ three representative CNN based SSL methods and three representative Vision Transformer based SSL methods as baselines. After adapted to the HIRL framework, the effectiveness of all six baseline methods are improved on diverse downstream tasks.
Note: To minimize the dependencies required to reproduce our results on classification-related downstream tasks, we put the source code of two transfer learning tasks (object detection and instance segmentation on COCO) in the det-seg branch. Please move to that branch for reproducing the results on these two tasks.
- [2022/05/27] The initial release! We release all source code for pre-training and downstream evaluation. We release all pre-trained model weights for (HIRL-)MoCo v2, (HIRL-)SimSiam, (HIRL-)SwAV, (HIRL-)MoCo v3, (HIRL-)DINO and (HIRL-)iBOT.
- [2022/07/07] Release all source code and model weights for BEiT and HIRL-BEiT!
- Incorporate more baseline image SSL methods in this codebase. To add: CAE, MAE, SimMIM, etc.
- Adapt more baselines into the HIRL framework. To add: HIRL-CAE, HIRL-MAE, HIRL-SimMIM, etc.
- Explore other ways to learn hierarchical image representations, except for semantic path discrimination.
Method | Arch. | Epochs | Batch Size | KNN | Linear | Fine-tune | Url | Config |
---|---|---|---|---|---|---|---|---|
MoCo v2 | ResNet-50 | 200 | 256 | 55.74 | 67.60 | 73.14 | model | cfg |
HIRL-MoCo v2 | ResNet-50 | 200 | 256 | 57.56 | 68.40 | 73.86 | model | cfg |
SimSiam | ResNet-50 | 200 | 512 | 60.17 | 69.74 | 72.25 | model | cfg |
HIRL-SimSiam | ResNet-50 | 200 | 512 | 62.68 | 69.81 | 72.88 | model | cfg |
SwAV | ResNet-50 | 200 | 4096 | 63.45 | 72.68 | 76.82 | model | cfg |
HIRL-SwAV | ResNet-50 | 200 | 4096 | 63.99 | 73.43 | 77.18 | model | cfg |
SwAV | ResNet-50 | 800 | 4096 | 64.84 | 73.36 | 77.77 | model | cfg |
HIRL-SwAV | ResNet-50 | 800 | 4096 | 65.43 | 74.80 | 78.05 | model | cfg |
MoCo v3 | ViT-B/16 | 400 | 4096 | 71.29 | 76.44 | 81.94 | model | cfg |
HIRL-MoCo v3 | ViT-B/16 | 400 | 4096 | 71.68 | 75.12 | 82.19 | model | cfg |
DINO | ViT-B/16 | 400 | 1024 | 76.01 | 78.07 | 82.09 | model | cfg |
HIRL-DINO | ViT-B/16 | 400 | 1024 | 76.84 | 78.32 | 83.24 | model | cfg |
iBOT | ViT-B/16 | 400 | 1024 | 76.64 | 79.00 | 82.47 | model | cfg |
HIRL-iBOT | ViT-B/16 | 400 | 1024 | 77.49 | 79.36 | 83.37 | model | cfg |
BEiT | ViT-B/16 | 400 | 2048 | / | / | 83.10 | model | cfg |
HIRL-BEiT | ViT-B/16 | 400 | 2048 | / | / | 83.41 | model | cfg |
This repository is officially tested with the following environments:
- Linux
- Python 3.6+
- PyTorch 1.10.0
- CUDA 11.3
The environment could be prepared in the following steps:
- Create a virtual environment with conda:
conda create -n hirl python=3.7.3 -y
conda activate hirl
- Install PyTorch with the official instructions. For example:
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 cudatoolkit=11.3 -c pytorch -c conda-forge
- Install dependencies:
## install apex for LARC
pip install git+https://github.com/NVIDIA/apex \
--no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext"
## install other dependencies
pip install -r requirements.txt
We support ImageNet ILSVRC 2012 for pre-training, KNN evaluation, linear classification, fine-tuning, semi-supervised evaluation and unsupervised clustering evaluation.
We recommend symlink the dataset folder to ./datasets/ImageNet1K
. The folder structure would be:
datasets/
ImageNet1K/
ILSVRC/
Annotations/
Data/
train/
val/
ImageSets/
meta/
After downloading and unzip the dataset, go to path ./datasets/ImageNet1K/ILSVRC/Data/val/
and move images to labeled sub-folders with this script.
We also support Places205 (resized 256x256 version) for transfer classification experiment.
We recommend symlink the dataset folder to ./datasets/places205
. The folder structure would be:
datasets/
places205/
data/vision/torralba/deeplearning/images256/
trainvalsplit_places205/
train_places205.csv
val_places205.csv
We provide an easy yaml based configuration file system. The config could be modified by command line arguments.
To run an experiment:
python3 launch.py --launch ./tools/train.py -c [config file] [config options]
The config options are in "key=value" format. For example, ouput_dir=your_path batch_size=64
.
Sub module is seperated by .
. For example, optimizer.name=AdamW
modifies the sub key name
in
optimizer
with value AdamW
.
A full example:
python3 launch.py --launch ./tools/train.py -c configs/pretrain/hirl/hirl_mocov2_resnet50_200eps.yaml \
pipeline.num_mlp_layer=3 output_dir=./experiments/pretrain/hirl/mocov2_3layers/
All the pre-training configuration files are in ./configs/pretrain/
. To reproduce the pre-training, please follow the corresponding config file. It is also straight-forward to use customized config files. Suppose the customized config file is stored in ./customized_configs/custom_mocov2.yaml
, the experiment could be launched by :
python3 launch.py --launch ./tools/train.py -c ./customized_configs/custom_mocov2.yaml
launch.py
would automatically find a free port to launch single node experiments. However, some pre-training methods are trained across multiple nodes. In this case, the number of nodes --nn
, node rank --nr
, master port --port
and master address -ma
should be set.
Take two node iBOT pre-training as example, use the following commands at node1 and node2, respectively.
# use this command at node 1
python3 launch.py --nn 2 --nr 0 --port [port] -ma [address of node 0] --launch ./tools/train.py \
-c configs/pretrain/baseline/ibot_vit_base_400eps.yaml
# use this command at node 2
python3 launch.py --nn 2 --nr 1 --port [port] -ma [address of node 0] --launch ./tools/train.py \
-c configs/pretrain/baseline/ibot_vit_base_400eps.yaml
We provide two independent scripts for KNN and unsupervised clustering evaluation on ImageNet.
For downstream evaluation with training process, you can use ./tools/train.py
with specific configs.
In most case, the only required argument is --pretrained
.
Perform KNN evaluation on a pretrained model:
python3 launch.py --launch ./eval_common/eval_knn.py --backbone_prefix backbone --pretrained [pretrained model file in .pth]
Note: set --backbone_prefix model.backbone
for HIRL based models. Set --arch vit_base
for MoCo v3, DINO, iBOT and BEiT.
Perform ImageNet linear classification based on a pretrained model (e.g., MoCo v2):
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/lincls/mocov2/mocov2_resnet50_200eps_lincls.yaml \
pretrained=[pretrained model file in .pth]
The corresponding HIRL-MoCo v2:
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/lincls/mocov2/hirl_mocov2_resnet50_200eps_lincls.yaml \
pretrained=[pretrained model file in .pth]
Perform ImageNet fine-tuning based on a pretrained model (e.g., MoCo v2):
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/finetune/mocov2/mocov2_resnet50_200eps_finetune.yaml \
pretrained=[pretrained model file in .pth]
The corresponding HIRL-MoCo v2:
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/finetune/mocov2/hirl_mocov2_resnet50_200eps_finetune.yaml \
pretrained=[pretrained model file in .pth]
Perform ImageNet semi-supervised learning based on a pretrained model (e.g., MoCo v2):
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/semisup/mocov2/mocov2_resnet50_200eps_semisup_1percent.yaml \
pretrained=[pretrained model file in .pth]
The corresponding HIRL-MoCo v2:
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/imagenet/semisup/mocov2/hirl_mocov2_resnet50_200eps_semisup_1percent.yaml \
pretrained=[pretrained model file in .pth]
Perform Places205 fine-tuning based on a pretrained model (e.g., MoCo v2):
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/places205/finetune/mocov2/mocov2_resnet50_200eps_finetune_places205.yaml \
pretrained=[pretrained model file in .pth]
The corresponding HIRL-MoCo v2:
python3 launch.py --launch ./tools/train.py \
-c configs/downstream/places205/finetune/mocov2/hirl_mocov2_resnet50_finetune_places205.yaml \
pretrained=[pretrained model file in .pth]
Perform clustering evaluation on a baseline model:
python3 launch.py --launch ./eval_common/eval_clustering.py \
--backbone_prefix backbone --pretrained [pretrained model file in .pth]
Note: set --backbone_prefix model.backbone
for HIRL based models. Set --arch vit_base
for MoCo v3, DINO, iBOT and BEiT.
See det-seg branch.
This repository is released under the MIT license as in the LICENSE file.
If you find this repository useful in your research, please cite the following paper:
@article{xu2022hirl,
title={HIRL: A General Framework for Hierarchical Image Representation Learning},
author={Xu, Minghao and Guo, Yuanfan and Zhu, Xuanyu and Li, Jiawen and Sun, Zhenbang and Tang, Jian and Xu, Yi and Ni, Bingbing},
journal={arXiv preprint arXiv:2205.13159},
year={2022}
}
The baseline methods in this codebase are based on the following open-resource projects. We would like to thank the authors for releasing the source code.