PULP-TrainLib is the first Deep Neural Network (DNN) training library for multi-core RISC-V MCUs (as PULP), enabling On-Device Learning on ultra-low-power devices of this class.
PULP-TrainLib features a variety of training primitives and support functions to enable backpropagation-based training on multicore MCUs. More in depth:
- A set of performance-tunable DNN layer primitives for training, based on Matrix Multiplication (MM).
- Commonly used loss functions, like MSE and CrossEntropy.
- SGD-based optimizers.
- Activation (ReLU, etc) and support functions.
PULP-TrainLib is fully released as open-source under Apache License Version 2.0.
To ease the deployment of DNN training tasks on MCUs, PULP-TrainLib is equipped with additional tools:
- TrainLib_Deployer, an automated code-generation tool to generate the C code to validate and train a user-specified DNN model on a PULP architecture.
- AutoTuner, a pre-deployment tool to select the fastest configuration of each layer of a DNN model, according to the shapes of the layer, the training step and the tiling strategy.
If you use any part of PULP-TrainLib , please cite:
@InProceedings{10.1007/978-3-031-15074-6_13,
author="Nadalini, Davide
and Rusci, Manuele
and Tagliavini, Giuseppe
and Ravaglia, Leonardo
and Benini, Luca
and Conti, Francesco",
editor="Orailoglu, Alex
and Reichenbach, Marc
and Jung, Matthias",
title="PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-core MCUs Through Performance-Driven Autotuning",
booktitle="Embedded Computer Systems: Architectures, Modeling, and Simulation",
year="2022",
publisher="Springer International Publishing",
address="Cham",
pages="200--216",
abstract="An open challenge in making Internet-of-Things sensor nodes ``smart'' and self-adaptive is to enable on-chip Deep Neural Network (DNN) training on Ultra-Low-Power (ULP) microcontroller units (MCUs). To this aim, we present a framework, based on PULP-TrainLib, to deploy DNN training tasks on RISC-V-based Parallel-ULP (PULP) MCUs. PULP-TrainLib is a library of parallel software DNN primitives enabling the execution of forward and backward steps on PULP MCUs. To optimize PULP-TrainLib's kernels, we propose a strategy to automatically select and configure (autotune) the fastest among a set of tiling options and optimized floating-point matrix multiplication kernels, according to the tensor shapes of every DNN layer. Results on an 8-core RISC-V MCU show that our auto-tuned primitives improve MAC/clk by up to 2.4{\$}{\$}{\backslash}times {\$}{\$}{\texttimes}compared to ``one-size-fits-all'' matrix multiplication, achieving up to 4.39 MAC/clk - 36.6{\$}{\$}{\backslash}times {\$}{\$}{\texttimes}better than a commercial STM32L4 MCU executing the same DNN layer training workload. Furthermore, our strategy proves to be 30.7{\$}{\$}{\backslash}times {\$}{\$}{\texttimes}faster than AIfES, a state-of-the-art training library for MCUs, while training a complete TinyML model.",
isbn="978-3-031-15074-6"
}
This repository is released under the Apache License Version 2.0.
PULP-TrainLib is the first open-source training library for RISC-V-based multicore MCUs, including a set of performance-tunable DNN layer primitives to enable DNN training on ultra-low-power devices. The training flow of PULP-TrainLib's primitives follows the canonic approach for the backpropagation algorithm, currently considering a streaming approach (batch size = 1). I.e., first we compute the Forward (FW) step to compute the prediction for a given input data. Then, a Backward Step is called to compute, layer-by-layer, the gradient of the loss function with respect to the weights (WG-BW) and the gradient of the loss function with respect to the input (IG-BW). The structure of the training flow of a single layer (e.g. a Fully-Connected) is depicted as follows:
Note that every training step for most of the layers is implemented as a Matrix Multiplication (MM) between tensor data. E.g., for a Conv2D and Fully-Connected Layer, the structure and sizes of the involved matrices can be represented as follows:
Convolutions are implemented as Image-to-Column (or Image-to-Row) pre-processed data + MM. The sizes of the tensors are denoted as (CI, HI, WI) for the input feature map, (CO, CI, Hk, Wk) for the weights, and (CO, HO, WO) for the output feature map. To tune the performances of the training primitives, specific optimizations can be selected case-by-case for the MM algorithm.
The development of C code for running On-Device Learning can be a time-consuming process. To make deployment easier, PULP-TrainLib provides TrainLib_Deployer, a code generation tool which creates all the necessary files to run DNN validation and training on a PULP-based MCU. To minimize the memory occupation, the TrainLib_Deployer adopts a data-reuse approach to store tensors in C arrays. The flow of the TrainLib_Deployer is illustrated as follows:
The input arguments of the TrainLib_Deployer are the architecture of the model to be trained on an MCU and the setup (memory and number of cores) of the target device. Indeed, the tool assumes to run an On-Device Learning routine on an MCU equipped with N parallel cores for computation. While running, the tool takes care of verifying if the model fits the memory. As output, the tool generates a project folder containing the code to run a verification task of the target DNN model on the target device (PyTorch Golden Model, or GM, C code, Makefile).
PULP-TrainLib optimizes the core computational kernel of DNN training primitives - the Matrix Multiplication (or MM) - with various unrolling and parallelization schemes. To select the best optimization for a given training step and tile size, PULP-TrainLib provides an Autotuner, which exhaustively searches for the fastest kernel among the library of available optimized MM kernels. AutoTuner's flow can be represented as follows:
Given the properties of the target device and the layer/training step informations on a generic layer (e.g. 8 cores, 64kB, Conv2D, Forward), AutoTuner exhaustively searches for the fastest tile shape which best fits the specified memory amount and the fastest MM kernel which minimizes the latency on the given tile shape. For further info, readers may refer to "PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning" SAMOS Pre-Print Version.
PULP-TrainLib's library files are located under the lib/
folder (lib's README).
The tests/
folder provides useful tests to try out and verify PULP-TrainLib's layers and functions (tests are performed with respecto to a PyTorch Golden models).
Each test can be customized according to the user specifications and profiles the execution of the layer's primitives with PULP's performance counters.
If further info are needed, please refer to the test's README.
The tools/
folder contains useful tools which ease the usage of PULP-TrainLib, as the TrainLib_Deployer and AutoTuner. For further info, please refer to tools' README.
The assets/
folder contains useful support files for PULP-TrainLib. Inside CI_test_suite, users can find a testing environment that can be used to verify PULP-TrainLib's primitives for Continuous Integration (TO BE COMPLETED).
To learn how to generate the code with our TrainLib_Deployer and more details about the optimizations used in this library, a tutorial repository is available online. This repository contains tutorials and a guide to easily install a conda environment with all the necessary requirements to run PULP-TrainLib.
PULP-TrainLib requires PULP-SDK and the RISC-V GNU GCC TOOLCHAIN to be used and compiled. Please refer to the links to correctly setup your working environment.
To successfully run the tests, Python (>= 3.6) is needed, together with PyTorch (>= 1.9.0). To install the dependencies (with CPU only), run:
python -m pip install argparse
python -m pip install install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
python -m pip install torchsummary
If you require the GPU (CUDA >= 10.2) version for your applications, instead run:
python -m pip install argparse
python -m pip install torch torchvision torchaudio
python -m pip install torchsummary
The tests have been verified using torch version "1.9.0+cpu".
To get started with PULP-TrainLib, just clone this repository on your local PC.
Before compiling any project, source pulp-sdk/configs/pulp_open.sh
from the terminal from which you intend to compile your project.
The configs/
folder is located inside the path to your pulp-sdk directory.
When generating a DNN for PULP with the TrainLib Deployer, make sure to launch the python task from a terminal in which you did not source the pulp_open.sh
.
To add new functionalities, users can follow the naming convention of PULP-TrainLib and provide primitives and a related test inside the tests/
folder. For integrating the new features, we recommend to extend the continuous integration test suite to functionally verify the primitives before the integration.
PULP-TrainLib's repository is organized with these branches:
main
: main branch, targeting PULP architectures.trainlib-tutorial
: branch reserved for tutorial purposes (see https://github.com/dnadalini/PULP-TrainLib-Tutorial).pulp-trainlib-paper
: branch to reproduce the results provided in the paper "PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning".pulp-trainlib-stm32
: this is a PULP-TrainLib port compatible with STM32 and other MCUs (FP32 format only).
Note: checked are complete, unchecked are ongoing
- Forward passes for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- Weight gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- Input gradients for DepthWise, PointWise Convolutions and Conv2D, Fully-Connected (FP32, FP16)
- CWH data layout for DepthWise, PointWise and 2D Convolutions (FP32, FP16)
- HWC data layout for PointWise Convolution (FP32, FP16) and 2D Convolutions (FP32, FP16)
- stride and padding (only naive 2D Convolutions, without im2col+mm optimization)
- ReLU, Sigmoid activation functions (FP32, FP16)
- Gradient Descent optimizer (FP32, FP16)
- Max and Average Pooling (FP32, FP16)
- RNN training primitives (FP32)
- Multihead Self Attention training primitives (FP32)
- Residual connection (FP32, FP16)
- InstanceNorm (FP32, FP16)
- Biases for Conv2D and Fully-Connected (FP32, FP16)
- Padding operators for DepthWise and 2D Convolution (im2col + mm)
- HWC data layout management for DepthWise Convolution (FP32, FP16)
- Stride operators for 2D Convolutions and DepthWise (im2col + mm)
- RNN training primitives (FP16)
- Multihead Self Attention training primitives (FP16)
- Biases for DepthWise and PointWise Convolutions (FP32, FP16)
- Sparse Update (layer-wise) in TrainLib_Deployer
- Partial Im2Col / Im2Row for Conv2D (FP32, FP16)
- Integration of biases in TrainLib-Deployer (Conv2D)
- AutoTuner working with "NUM_TILING_SOLUTIONS = 1"
- Sporadic bugs in "mm_u2" in FP32 (mostly on leftovers)
- Performance bugs in im2col/im2row with DMA loading (performances tend to be less than im2col/im2row with cores)
- Missing integration for RNN / MHSE in TrainLib_Deployer
- FP32 MHSA primitives (Input Grad)
- Missing integration of sigmoid function in TrainLib_Deployer
- Performances of FP16 sigmoid may need to be optimized with FP16 exponenetial (e.g., https://github.com/0xBYTESHIFT/fp16/blob/master/include/half/half.hpp)
- Davide Nadalini ([email protected], [email protected])
- Alberto Dequino ([email protected], [email protected])
- Manuele Rusci ([email protected])
- Francesco Conti ([email protected])
- Cristian Cioflan ([email protected])
- Luca Bompani ([email protected])
- Lan Mei ([email protected])
- Calin Diaconu ([email protected])
- Giacomo Saporetti ([email protected])
- Francesco Conoscenti ([email protected])
- Leonardo Ravaglia ([email protected])
D. Nadalini, M. Rusci, L. Benini, and F. Conti, "Reduced Precision Floating-Point Optimization for Deep Neural Network On-Device Learning on MicroControllers" ArXiv Pre-Print
D. Nadalini, M. Rusci, G. Tagliavini, L. Ravaglia, L. Benini, and F. Conti, "PULP-TrainLib: Enabling On-Device Training for RISC-V Multi-Core MCUs through Performance-Driven Autotuning" SAMOS Pre-Print Version, Springer Published Version