This project builds a simple vision model training framework implemented by PyTorch, has the following features:
- Model training
- Model evaluation
- Logging to file / tensorboard / wandb
- Distributed training and evaluation
In this project, the digit classification on MNIST is used as a toy task. You should download MNIST dataset and put it in the ./datasets/MNIST/
folder without decompression.
In file main.py, it's responsible for starting this project running.
Use python
and other parameters to make it process different job, as follows:
# For model training:
python main.py --mode train --outputs-dir ./outputs/ --use-distributed False --exp-name train
# For model evaluation:
python main.py --mode eval --outputs-dir ./outputs/ --use-distributed False --eval-model ./outputs/train/checkpoint_4.pth --exp-name eval
In package models, you can use the method 'build_model()' in models/__init__.py to get the whole network model (ResNet18 in this project as an example).
Usually, different modules of the overall model will be written in different files. Since the sample network is tiny, it only has a single module in models/resnet18.py. In models/utils.py, there are also some utils code about model, such as save_checkpoint()
and load_checkpoint()
. For some tasks and models, their loss function (criterion) may be complex (for example, MOTR), thus they require a separate code file to implement.
Data file reading, data transformation and data loader (Dataset, Sampler, DataLoader) are all included in package data. Their build methods are defined in file data/__init__.py.
For some data, the official data transforms are not sufficient. The custom transforms can be placed in a new file.
Our framework now support basic and common log output formats, such as .txt file, .yaml file, wandb and tensorboard. It is also very simple to use these formats for logging. Basically, you only need to pay attention to the related methods of the Metrics (in file log/log.py) and Logger (in file log/logger.py) classes.
In package utils, there are some general methods that are not fall into the aforementioned categories. Like is_distributed()
to determine whether it is in DDP mode, and so on.
All runtime settings are recorded in a .yaml
file like configs/resnet18_mnist.yaml.
In addition, some settings can be set by script parameters at runtime, such as --batch-size
, --lr
, and so on.
MODE: # "train" or "eval", for the main.py script.
DEVICE: cuda
AVAILABLE_GPUS: 0,1,2,3,4,5,6,7
# DATA:
DATASET: MNIST
DATA_PATH: ./dataset/MNIST/
NUM_CLASSES: 10
NUM_WORKERS: 2
# MODEL:
PRETRAINED:
# Train Setting:
SEED: 42
USE_DISTRIBUTED: False
LR: 0.0001
WEIGHT_DECAY: 0.0001
SCHEDULER_TYPE: MultiStep
SCHEDULER_MILESTONES: [3, ]
SCHEDULER_GAMMA: 0.5
BATCH_SIZE: 256
BATCH_SIZE_AVERAGE: True
EPOCHS: 5
RESUME_MODEL:
RESUME_OPTIMIZER: True
RESUME_SCHEDULER: True
RESUME_STATES: True
# Eval:
EVAL_MODEL:
# Outputs:
OUTPUTS_DIR: ./outputs/temp/
OUTPUTS_PER_STEP: 40
USE_TENSORBOARD: True
USE_WANDB: True
PROJECT_NAME: CV_Framework
EXP_NAME: default
EXP_GROUP: