Skip to content

Latest commit

 

History

History
 
 

resnet

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

The most efficient recipes for training ResNets on ImageNet

Chat @ Slack License


Mosaic ResNet

This folder contains starter code for training torchvision ResNet architectures using our most efficient training recipes (see our short blog post or detailed blog post for details). These recipes were developed to hit baseline accuracy on ImageNet 7x faster or to maximize ImageNet accuracy over long training durations. Although these recipes were developed for training ResNet on ImageNet, they could be used to train other image classification models on other datasets. Give it a try!

The specific files in this folder are:

  • model.py - Creates a ComposerModel from a torchvision ResNet model
  • data.py - Provides a MosaicML streaming dataset for ImageNet and a PyTorch dataset for a local copy of ImageNet
  • main.py - Trains a ResNet on ImagNet using the Composer Trainer.
  • tests/ - A suite of tests to check each training component
  • yamls/
    • resnet50.yaml - Configuration for a ResNet50 training run to be used as the first argument to main.py
    • mcloud_run.yaml - yaml to use if running on the MosaicML Cloud

Now that you have explored the code, let's jump into the prerequisites for training.

Prerequisites

Here's what you need to train:

  • Docker image with PyTorch 1.12+, e.g. MosaicML's PyTorch image
    • Recommended tag: mosaicml/pytorch:1.12.1_cu116-python3.9-ubuntu20.04
    • The image comes pre-configured with the following dependencies:
      • PyTorch Version: 1.12.1
      • CUDA Version: 11.6
      • Python Version: 3.9
      • Ubuntu Version: 20.04
  • Imagenet Dataset
  • System with NVIDIA GPUs

Installation

To get started, clone this repo and install the requirements:

git clone https://github.com/mosaicml/examples.git
cd examples
pip install ".[resnet]"  # or pip install ".[resnet-cpu]" if no NVIDIA GPU
cd resnet

Dataloader Testing

If you want to train on ImageNet or any other dataset, you'll need to either make it a streaming dataset using this script or a local torchvision ImageFolder. If you are planning to train on ImageNet, download instructions can be found here.

The below command will test if your data is set up appropriately:

# Test locally stored dataset
python data.py path/to/data

# Test remote storage dataset
python data.py s3://my-bucket/my-dir/data /tmp/path/to/local

How to start training

Now that you've installed dependencies and tested your dataset, let's start training!

Please remember: for both train_dataset and eval_dataset, edit the path and (if streaming) local arguments in resnet50.yaml to point to your data.

Single-Node training

We run the main.py script using our composer launcher, which generates a process for each device in a node.

If training on a single node, the composer launcher will autodetect the number of devices, so all you need to do is :

composer main.py yamls/resnet50.yaml

To train with high performance on multi-node clusters, the easiest way is with MosaicML Cloud ;)

But if you really must try this manually on your own cluster, then just provide a few variables to composer either directly via CLI, or via environment variables that can be read. Then launch the appropriate command on each node:

Multi-Node via CLI args

# Using 2 nodes with 8 devices each
# Total world size is 16
# IP Address for Node 0 = [0.0.0.0]

# Node 0
composer --world_size 16 --node_rank 0 --master_addr 0.0.0.0 --master_port 7501 main.py yamls/resnet50.yaml

# Node 1
composer --world_size 16 --node_rank 1 --master_addr 0.0.0.0 --master_port 7501 main.py yamls/resnet50.yaml

Multi-Node via environment variables

# Using 2 nodes with 8 devices each
# Total world size is 16
# IP Address for Node 0 = [0.0.0.0]

# Node 0
# export WORLD_SIZE=16
# export NODE_RANK=0
# export MASTER_ADDR=0.0.0.0
# export MASTER_PORT=7501
composer main.py yamls/resnet50.yaml

# Node 1
# export WORLD_SIZE=16
# export NODE_RANK=1
# export MASTER_ADDR=0.0.0.0
# export MASTER_PORT=7501
composer main.py yamls/resnet50.yaml

Results

You should see logs being printed to your terminal like below. You can also easily enable other experiment trackers like Weights and Biases or CometML, by using Composer's logging integrations.

[epoch=0][batch=16/625]: wall_clock/train: 17.1607
[epoch=0][batch=16/625]: wall_clock/val: 10.9666
[epoch=0][batch=16/625]: wall_clock/total: 28.1273
[epoch=0][batch=16/625]: lr-DecoupledSGDW/group0: 0.0061
[epoch=0][batch=16/625]: trainer/global_step: 16
[epoch=0][batch=16/625]: trainer/batch_idx: 16
[epoch=0][batch=16/625]: memory/alloc_requests: 38424
[epoch=0][batch=16/625]: memory/free_requests: 37690
[epoch=0][batch=16/625]: memory/allocated_mem: 6059054353408
[epoch=0][batch=16/625]: memory/active_mem: 1030876672
[epoch=0][batch=16/625]: memory/inactive_mem: 663622144
[epoch=0][batch=16/625]: memory/reserved_mem: 28137488384
[epoch=0][batch=16/625]: memory/alloc_retries: 3
[epoch=0][batch=16/625]: trainer/grad_accum: 2
[epoch=0][batch=16/625]: loss/train/total: 7.1292
[epoch=0][batch=16/625]: metrics/train/Accuracy: 0.0005
[epoch=0][batch=17/625]: wall_clock/train: 17.8836
[epoch=0][batch=17/625]: wall_clock/val: 10.9666
[epoch=0][batch=17/625]: wall_clock/total: 28.8502
[epoch=0][batch=17/625]: lr-DecoupledSGDW/group0: 0.0066
[epoch=0][batch=17/625]: trainer/global_step: 17
[epoch=0][batch=17/625]: trainer/batch_idx: 17
[epoch=0][batch=17/625]: memory/alloc_requests: 40239
[epoch=0][batch=17/625]: memory/free_requests: 39497
[epoch=0][batch=17/625]: memory/allocated_mem: 6278452575744
[epoch=0][batch=17/625]: memory/active_mem: 1030880768
[epoch=0][batch=17/625]: memory/inactive_mem: 663618048
[epoch=0][batch=17/625]: memory/reserved_mem: 28137488384
[epoch=0][batch=17/625]: memory/alloc_retries: 3
[epoch=0][batch=17/625]: trainer/grad_accum: 2
[epoch=0][batch=17/625]: loss/train/total: 7.1243
[epoch=0][batch=17/625]: metrics/train/Accuracy: 0.0010
train          Epoch   0:    3%|| 17/625 [00:17<07:23,  1.37ba/s, loss/train/total=7.1292]

Using Mosaic Recipes

As described in our ResNet blog post, we cooked up three recipes to train ResNets faster and with higher accuracy:

  • The Mild recipe is for short training runs
  • The Medium recipe is for longer training runs
  • The Hot recipe is for the longest training runs, intended to maximize accuracy

To use a recipe, specify the name using the the recipe_name argument. Specifying a recipe will change several aspects of the training run such as:

  1. Changes the loss function to binary cross entropy instead of standard cross entropy to improve accuracy.
  2. Changes the train crop size to 176 (instead of 224) and the evaluation resize size to 232 (instead of 256). The smaller train crop size increases throughput and accuracy.
  3. Changes the number of training epochs to the optimal value for each training recipe. Feel free to change these in resnet50.yaml to better suite your model and/or dataset.
  4. Specifies unique sets of speedup methods for model training.

Here is an example command to run the mild recipe on a single node:

composer main.py yamls/resnet50.yaml recipe_name=mild

NOTE

The ResNet50 and smaller models are dataloader-bottlenecked when training with our recipes on 8x NVIDIA A100s. This means the model's throughput is limited to the speed the data can be loaded. One way to alleviate this bottleneck is to use the FFCV dataloader format. This tutorial walks you through creating your FFCV dataloader. We also have example code for an ImageNet FFCV dataloader here.

Our best results use FFCV, so an FFCV version of ImageNet is required to exactly match our best results.


Saving and Loading checkpoints

At the bottom of yamls/resnet50.yaml, we provide arguments for saving and loading model weights. Please specify the save_folder or load_path arguments if you need to save or load checkpoints!

Contact Us

If you run into any problems with the code, please file Github issues directly to this repo.