Skip to content

Latest commit

 

History

History
197 lines (169 loc) · 8.5 KB

File metadata and controls

197 lines (169 loc) · 8.5 KB

SSD-ResNet34 BFloat16 training

Description

This document has instructions for running SSD-ResNet34 BFloat16 training using Intel-optimized TensorFlow.

Datasets

SSD-ResNet34 training uses the COCO training dataset. Use the instructions to download and preprocess the dataset.

For accuracy testing, download the COCO validation dataset, using the instructions here.

Quick Start Scripts

Script name Description
bfloat16_training_demo.sh Executes a demo run with a limited number of training steps to test performance. Set the number of steps using the TRAIN_STEPS environment variable (defaults to 100).
bfloat16_training.sh Runs multi-instance training to convergence. Download the backbone model specified in the instructions below and pass that directory path in the BACKBONE_MODEL_DIR environment variable.
bfloat16_training_accuracy.sh Runs the model in eval mode to check accuracy. Specify which checkpoint files to use with the CHECKPOINT_DIR environment variable.

Run the model

Setup your environment using the instructions below, depending on if you are using AI Kit:

Setup using AI Kit Setup without AI Kit

To run using AI Kit you will need:

  • git
  • numactl
  • wget
  • contextlib2
  • cpio
  • Cython
  • horovod
  • jupyter
  • lxml
  • matplotlib
  • numpy>=1.17.4
  • opencv-python
  • openmpi
  • openssh
  • pillow>=9.3.0
  • protobuf-compiler
  • pycocotools
  • tensorflow-addons==0.11.0
  • Activate the tensorflow 2.5.0 conda environment
    conda activate tensorflow

To run without AI Kit you will need:

  • Python 3
  • git
  • numactl
  • wget
  • intel-tensorflow>=2.5.0
  • contextlib2
  • cpio
  • Cython
  • horovod
  • jupyter
  • lxml
  • matplotlib
  • numpy>=1.17.4
  • opencv-python
  • openmpi
  • openssh
  • pillow>=9.3.0
  • protobuf-compiler
  • pycocotools
  • tensorflow-addons==0.11.0
  • A clone of the Model Zoo repo
    git clone https://github.com/IntelAI/models.git

For more information on the dependencies, see the installation instructions for object detection models at the TensorFlow Model Garden repository.

Running SSD-ResNet34 training uses code from the TensorFlow Model Garden. Clone the repo at the commit specified below, and set the TF_MODELS_DIR environment variable to point to that directory. Apply the TF2 patch from the model zoo to the TensorFlow models directory.

# Clone the tensorflow/models repo at the specified commit.
# Please note that required commit for this section is different from the one used for dataset preparation.
git clone https://github.com/tensorflow/models.git tf_models
cd tf_models
export TF_MODELS_DIR=$(pwd)
git checkout 8110bb64ca63c48d0caee9d565e5b4274db2220a

# Apply the patch from the model zoo directory to the TensorFlow Models repo
git apply <model zoo directory>/models/object_detection/tensorflow/ssd-resnet34/training/bfloat16/tf-2.0.diff

# Protobuf compilation from the TF models research directory
cd research
protoc object_detection/protos/*.proto --python_out=.

cd ../..

To run the bfloat16_training_demo.sh quickstart script, set the OUTPUT_DIR (location where you want log and checkpoint files to be written) and DATASET_DIR (path to the COCO training dataset). Use an empty output directory to prevent conflicts with checkpoint files from previous runs. You can optionally set the TRAIN_STEPS (defaults to 100) and MPI_NUM_PROCESSES (defaults to 1).

# cd to your model zoo directory
cd models

export TF_MODELS_DIR=<path to the clone of the TensorFlow models repo>
export DATASET_DIR=<path to the COCO training data>
export OUTPUT_DIR=<path to the directory where the log and checkpoint files will be written>
export TRAIN_STEPS=<optional, defaults to 100>
export MPI_NUM_PROCESSES=<optional, defaults to 1>
# For a custom batch size, set env var `BATCH_SIZE` or it will run with a default value.
export BATCH_SIZE=<customized batch size value>

./quickstart/object_detection/tensorflow/ssd-resnet34/training/cpu/bfloat16/bfloat16_training_demo.sh

To run training and achieve convergence, download the backbone model using the commands below and set your download directory path as the BACKBONE_MODEL_DIR. Again, the DATASET_DIR should point to the COCO training dataset and the OUTPUT_DIR is the location where log and checkpoint files will be written. Use an empty OUTPUT_DIR to prevent conflicts with previously generated checkpoint files. You can optionally set the MPI_NUM_PROCESSES (defaults to 4).

export BACKBONE_MODEL_DIR="$(pwd)/backbone_model"
mkdir -p $BACKBONE_MODEL_DIR
wget -P $BACKBONE_MODEL_DIR https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/checkpoint
wget -P $BACKBONE_MODEL_DIR https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.data-00000-of-00001
wget -P $BACKBONE_MODEL_DIR https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.index
wget -P $BACKBONE_MODEL_DIR https://storage.googleapis.com/intel-optimized-tensorflow/models/ssd-backbone/model.ckpt-28152.meta

# cd to your model zoo directory
cd models

export TF_MODELS_DIR=<path to the clone of the TensorFlow models repo>
export DATASET_DIR=<path to the COCO training data>
export OUTPUT_DIR=<path to the directory where the log file and checkpoints will be written>
export MPI_NUM_PROCESSES=<optional, defaults to 4>
# For a custom batch size, set env var `BATCH_SIZE` or it will run with a default value.
export BATCH_SIZE=<customized batch size value>

./quickstart/object_detection/tensorflow/ssd-resnet34/training/cpu/bfloat16/bfloat16_training.sh

To run in eval mode (to check accuracy), set the CHECKPOINT_DIR to the directory where your checkpoint files are located, set the DATASET_DIR to the COCO validation dataset location, and the OUTPUT_DIR to the location where log files will be written. You can optionally set the MPI_NUM_PROCESSES (defaults to 1).

# cd to your model zoo directory
cd models

export TF_MODELS_DIR=<path to the clone of the TensorFlow models repo>
export DATASET_DIR=<path to the COCO validation data>
export OUTPUT_DIR=<path to the directory where the log file will be written>
export CHECKPOINT_DIR=<directory where your checkpoint files are located>
export MPI_NUM_PROCESSES=<optional, defaults to 1>
# For a custom batch size, set env var `BATCH_SIZE` or it will run with a default value.
export BATCH_SIZE=<customized batch size value>

./quickstart/object_detection/tensorflow/ssd-resnet34/training/cpu/bfloat16/bfloat16_training_accuracy.sh

Additional Resources