Skip to content

An Automatic DNN TrainingProblem Detection and Repair System

Notifications You must be signed in to change notification settings

shiningrain/AUTOTRAINER

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

71 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AUTOTRAINER

TL;DR

Automatic detecting and fixing DNN training problems at run time.

For Revision

Thanks to the editor and the reviewers for their thorough work in evaluating our manuscript, and their thoughtful comments that contribute significantly to its improvement. We have now revised the paper in compliance with the comments by the editor and the reviewers. We have updated the experiment results with precision and recall metrics here. We will continue to improve our tools to facilitate the community.

Repo Structure

  1. AutoTrainer: It mainly contains the source codes of the AUTOTRAINER (the folder data and utils), two demo cases. You can find a easy start here. The way to run the demo cases has been shown here.
  2. Motivation: It contains two test cases showing that 1) Training problem occurrence is highly random. 2) The time when a training problem occurs is random. The way to reproduce these cases can be found here.
  1. Supplement_data: It contains the required experiments and corresponding results (e.g., all model details, and the accuracy improvement table). We also provide all necessary experiment data in here and the repair results figures in here
  2. misc.: The README.md shows how to use our demos, the repo structure, the way to reproduce our experiments and our experiment results. And the requirement.text shows all the dependencies of AUTOTRAINER.
- AutoTrainer/                 
    - data/    
    - demo_case/  
        - Gradient_Vanish_Case/
        - Oscillating_Loss_Case/
        - Improper_Activation_case/
    - utils/         
    - reproduce.py             
    - README.md                  
- Motivation/                      
    - DyingReLU/
    - OscillatingLoss/
    - README.md
- Supplement_data/
- README.md
- requirements.txt

Demo

Here we prepare two simple cases which are based on the Circle and Blob dataset on Sklearn. You can just enter the corresponding folder and run demo.py directly to see how AUTOTRAINER solve the problems in these case.

$ pip install -r requirements.txt
$ cd AutoTrainer/demo_case/Gradient_Vanish_Case
# or use `cd AutoTrainer/demo_case/Oscillating_Loss_Case`
$ python demo.py

Results

Effectiveness

avatar

To evaluate the effectiveness of AUTOTRAINER, we run 701 collected model training scripts to test the effectiveness of AUTOTRAINER. From these models, AUTOTRAINER has detected 422 buggy models and 506 training problems. Then AUTORTRAINER tries the candidate solutions and repairs 414 models buggy models, the repaired rate reaches 98.42%.

Additionally, the model accuracy improvement distribution of the 414 repaired buggy models is shown in the above figure. The average accuracy improvement reaches 36.42%. Specifically, over 133 models get an increase of 50% and over 50%. The maximum improvement reaches 90.17%

Efficiency

avatar

To evaluate the efficiency of AUTOTRAINER, we run all 495 model trainings with and without AUTOTRAINER enabled. For normal training, the runtime overhead is closely related to the problem checker frequency which is about to 1%. The above figure shows how this frequency affect the runtime overhead. It is worth mentioning that the runtime overhead on smaller datasets is usually larger (e.g., Blob vs. MNIST in the figure).

For the repaired trainings, AUTOTRAINER takes 1.14 more training time on average. We performed a more profound analysis to understand the overhead of individual components and found that retraining takes over 99% and the rest two parts (i.e., problem checker and repair) take less than 1%. To repair a problem, AUTOTRAINER may try several times, which leads to AUTOTRAINER training several models.

Reproducing results:

  1. Download data from Google Drive.
  2. Find the model and the corresponding configuration you want to reproduce or test. The models has been saved in different directories which are named by the datasets. Each model and its configuration file, and experimental results are placed in a separate subdirectory.
  3. Use the reproduce.py to test and reproduce our experiments. (make sure you have install the environment and read the 'setup' of AUTOTRAINER)
$ cd AutoTrainer
$ python reproduce.py -mp THE_MODEL_PATH -cp THE_CONFIGURATION_PATH
# the result will be saved in the `tmp` direction and the output message will be shown on the terminal.

Using your own data?

  1. Prepare your model and the training configuration file. The training configuration should be a set saved as a pkl file which includes batch size, dataset, max training epoch, loss, optimizer, and its parameters. You can refer to this to complete the configuration file.
  2. Rewrite the get_dataset() in reproduce.py if you need to use your own dataset. You should add the way to load and preprocess your data.
  3. Adust the configuration parameters. These parameters are saved in the params in reproduce.py and they are all set the default value which is mentioned in our paper. You can adjust them according to the learning tasks.
  4. Run the reproduce.py following the guide.

Cite our paper

@inproceedings{DBLP:conf/icse/ZhangZMS21,
  author    = {Xiaoyu Zhang and
               Juan Zhai and
               Shiqing Ma and
               Chao Shen},
  title     = {{AUTOTRAINER:} An Automatic {DNN} Training Problem Detection and Repair
               System},
  booktitle = {43rd {IEEE/ACM} International Conference on Software Engineering,
               {ICSE} 2021, Madrid, Spain, 22-30 May 2021},
  pages     = {359--371},
  publisher = {{IEEE}},
  year      = {2021},
  url       = {https://doi.org/10.1109/ICSE43902.2021.00043},
  doi       = {10.1109/ICSE43902.2021.00043},
  timestamp = {Sat, 06 Aug 2022 22:05:44 +0200},
  biburl    = {https://dblp.org/rec/conf/icse/ZhangZMS21.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Page Views Count

About

An Automatic DNN TrainingProblem Detection and Repair System

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages