mnist: 28x28x1 input,4->8->16, pad valid
mnist_q_valid.h 2.4KB Flash 1.4KB RAM
suit for MCU have >=16KB Flash, >=2KB RAM
cifar: 32x32x3 input, 32->32->64->1024->10, 5x5 conv
cifar10_q.h 89KB Flash 11KB RAM
suit for MCU have >=128KB Flash, >=20KB RAM
vww96: vww model based on mobile net v1 0.25 96x96x3 input
vww96_q.tmdl 227KB Flash 54KB RAM
suit for MCU have >=256KB Flash, >=64KB RAM
https://mlcommons.org/en/inference-tiny-07/
mbnet128: mobile net v1 0.25 128x128x3 input
mbnet128_0.25_q.tmdl 485KB Flash 96KB RAM
suit for MCU have >=512KB Flash, >=128KB RAM
https://github.com/fchollet/deep-learning-models/releases
model infer time unit is ms;
Sort by performance, compare priority: mbnet128 > vww96 > cifar > mnist
Note1: arduino run another smaller mnist model due to limited memory
Note2: all model record fastest model type's infer time, for example, C906 use FP16 result. Note3: XXX means impossible run this model on that chip
Chip/Board | Core | Flash | RAM | Freq | mbnet | vww96 | cifar | mnist | Note |
---|---|---|---|---|---|---|---|---|---|
BL808's NPU | BLAI | 16MB | 0.8+64MB | 320M | 5 | 3 | <1 | <1 | |
i5-4590T | AMD64 | 256GB | 8GB | 2000M | 7/24 | 5/17 | 0.9/4 | 0.04/<1 | native/wasm |
RK3399's A72 | ARM A72 | 32GB | 4GB | 1800M | 15 | 10 | 3 | 0.07 | |
TH1520 | RV64V(C910) | 16GB | 4GB | 1848M | 17 | 10.4 | 3.6 | 0.15 | fp16 |
RK3399's A53 | ARM A53 | 32GB | 4GB | 1600M | 29 | 19 | 5 | 0.14 | |
D1-H | RV64V | 128GB | 2GB | 1008M | 43 | 22 | 3.5 | 0.29 | |
ZYNQ7010's A9 | ARM A9 | 32GB | 1GB | 667M | 66 | 44 | 7.8 | 0.34 | |
BL808's C906 | RV64V | 16MB | 0.8+64MB | 480M | 81 | 57 | 10 | <1 | |
STM32H750 | ARM CM7 | 1MB | 1024KB | 480M | 94 | 64 | 15 | <1 | |
BL808's E907 | RV32P | 16MB | 0.8+64MB | 320M | 188 | 149 | 35 | <1 | mdl in psram |
F1C200S | ARM926EJ-S | 16MB | 64MB | 608M | 213 | 145 | 38.5 | 0.75 | |
MT7621A | MIPS1004Kc | 128MB | 128MB | 880M | 413 | 244 | 41 | 1 | need confirm |
AT32F403A | ARM CM4 | 1MB | 96KB | 240M | 477 | 136 | 30 | 0.6 | mbnet in 224k ram mode |
STM32G474RE | ARM CM4 | 512KB | 128KB | 170M | XXX | 195 | 43 | 1 | |
CH32V307 | RV32F QingKe V4F |
480KB | 128KB | 144M | XXX | 357 | 64 | 1 | |
STM32F411CE | ARM CM4 | 512KB | 128KB | 150M | 558 | 366 | 75 | 2 | |
W801 | CSKYV2 ck804ef | 2MB | 160+128KB+8MB | 240M | 606 | 364 | 52 | <2 | mdl in psram/(or sram if fit) |
ESP32-S3 | Xtensa LX7 | 8MB | 512KB | 240M | 610 | 381 | 86 | 5 | mdl in flash |
LPC4337 | ARM CM4F | 1MB | 136KB | 204M | 654 | 627 | 91 | 3 | need confirm |
XR806 | ARMv8-M Star-MC1 |
2MB | 288KB | 160M | 712 | 453 | 104 | 1 | |
ESP32 | Xtensa LX6 | 4MB | 520KB | 240M | 755 | 476 | 132 | 2 | mdl in flash |
ACM32F403 | ARM CM33 | 512KB | 192KB | 180M | XXX | 458 | 139 | 2 | |
STM32F767 | ARM CM7 | 2MB | 512KB | 216M | 869 | 640 | 185 | 3 | need confirm |
SC5864B's DSP | HIFI3 | 32MB | 8MB+192KB | 160M | 805 | --- | --- | 15 | need confirm |
STM32L496 | ARM CM4 | 1MB | 320KB | 80M | 809 | 695 | 162 | 3 | |
NRF52832 | ARM CM4 | 512KB | 64KB | 64M | XXX | xxx | 217 | 4 | |
ESP32-S2 | Xtensa LX7 | 4MB | 320KB | 240M | 1079 | 659 | 241 | 2.5 | need confirm |
RP2040 | ARM CM0+ | 16MB | 264KB | 280M | 1211 | 716 | 200 | 2 | overclock 280M |
CH32V203G6 | RV32 QingKe V4B |
32KB | 10KB | 144M | XXX | XXX | XXX | 2.5 | |
ESP32-C3 | RV32 | 4MB | 400KB | 160M | 2370 | 1430 | 127 | 6 | mdl in flash |
MM32F3270 | ARM CM3 | 512KB | 128KB | 96M | 3126 | 2919 | 257 | 11 | |
Nuclei N300 | RV32IMAFDC | 16MB | 512KB | 16M | XXXX | 2946 | 795 | 11 | need confirm |
STM32F103RC | ARM CM3 | 256KB | 48KB | 72M | XXX | XXX | 445 | 7 | |
STM32F103C8 | ARM CM3 | 64KB | 20KB | 72M | XXX | XXX | XXX | 7 | |
CH32V103 | RV32 QingKe V3A |
64KB | 20KB | 72M | XXX | XXX | XXX | 13 | |
SAMD21G18 | ARM CM0+ | 256KB | 32KB | 48M | XXX | XXX | 700 | 14 | seeed XIAO |
STM32G030F6 | ARM CM0+ | 32KB | 8KB | 64M | XXX | XXX | XXX | 18 | |
CM0(Kintex-7) | ARM CM0 | --- | 1024KB | 50M | XXX | XXX | 1362 | 23 | Kintex-7 |
CH582 | RV32 QingKe V4A |
448KB | 32KB | 60M | XXX | XXX | XXX | 31 | |
STC32G12K128 | 80251 | 128KB | 12KB | 35M | XXX | XXX | XXX | 37 | |
PicoRV32(GW2A) | RV32 | 1MB | 64KB | 54M | XXX | XXX | 26935 | 385 | Tang Primer 20K |
Atmega328 | AVR | 32KB | 2KB | 16M | XXX | XXX | XXX | 50(*) |
Normalization to 100M freq to compare CPU efficiency, using cifar model:
Chip/Board | Core | cifar(ms) |
---|---|---|
BL808's NPU | BLAI | 2 |
D1-H | RV64V | 35 |
BL808's C906 | RV64V | 48 |
RK3399's A72 | ARM A72 | 52 |
ZYNQ7010's A9 | ARM A9 | 52 |
TH1520 | RV64V(C910) | 66 |
STM32H750 | ARM CM7 | 72 |
AT32F403A | ARM CM4 | 72 |
STM32G474RE | ARM CM4 | 73 |
RK3399's A53 | ARM A53 | 79 |
CH32V307 | RV32 IMAC | 92 |
BL808's E907 | RV32P | 112 |
STM32F411CE | ARM CM4 | 113 |
W801 | CSKYV2 ck804ef | 125 |
Nuclei N300 | RV32IMAFDC | 127 |
STM32L496 | ARM CM4 | 130 |
NRF52832 | ARM CM4 | 139 |
XR806 | ARMv8-M Star-MC1 |
166 |
ESP32-C3 | RV32 | 203 |
ESP32-S3 | Xtensa LX7 | 206 |
F1C200S | ARM926EJ-S | 234 |
MM32F3270 | ARM CM3 | 247 |
ACM32F403 | ARM CM33 | 250 |
ESP32 | Xtensa LX6 | 317 |
STM32F103RC | ARM cM3 | 320 |
SAMD21G18 | ARM CM0+ | 336 |
MT7621A | MIPS1004Kc | 360 |
RP2040 | ARM CM0+ | 560 |
ESP32-S2 | Xtensa LX7 | 578 |
CM0(Kintex-7) | ARM CM0 | 681 |
PicoRV32(GW2A) | RV32 | 14545 |
mbnet infer time under different input size
BL808 C906 core 480M, use RV64V, FP16 model
input size | infer time |
---|---|
96x 96 | 60ms |
128x128 | 81ms |
160x160 | 156ms |
192x192 | 183ms |
224x224 | 296ms |
Optimization for MCU which don't have FPU
STM32F103C8 run mnist
Options | infer time |
---|---|
TM_FASTSCALE=0 | 16ms |
TM_FASTSCALE=1 | 10ms |
Optimization for ARM MCU which have DSP (Cortex-M4,M7,etc.), suoport INT8 acceleration
STM32F411CE run mbnet 0.25, 128x128x3 input
Options | infer time |
---|---|
TM_ARCH_OPT0 && INT8 | 1199ms |
TM_ARCH_ARM_SIMD && INT8 | 840ms |
Optimization for ARM MCU which have MVEI instructions (Cortex-M55,etc.), suoport INT8 acceleration.
Experimental, not test data.
Optimization for ARM MPU which have NEON instructions (Cortex-A7 and newer), suoport INT8/FP32 acceleration
Raspberry Pi4 single core run mbnet 1.0, 224x224x3 input
(NEON INT8 not well optimized)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 860ms | 821ms |
TM_ARCH_CPU | FP32 | 2307ms | 2271ms |
TM_ARCH_ARM_NEON | FP32 | 1275ms | 1223ms |
TM_ARCH_ARM_NEON | INT8 | 959ms | 923ms |
Optimization for RISC-V MCU which have P-extend instructions (like T-Head E907), suoport INT8 acceleration
BL808 E907 core run mbnet 0.25, 128x128x3 input (mdl in psram, cpu run in 320M, O2)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 443ms | 283ms |
TM_ARCH_RV32P | INT8 | 345ms | 188ms |
Optimization for RISC-V MCU which have V-extend instructions (like T-Head C906), suoport INT8/FP32 acceleration
BL808 C906 core run mbnet 0.25, 128x128x3 input (mdl in psram, VLEN=128, cpu run in 480M, O2)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 153ms | 125ms |
TM_ARCH_CPU | FP32 | 215ms | 177ms |
TM_ARCH_RV64V | INT8 | 123ms | 95ms |
TM_ARCH_RV64V | FP32 | 160ms | 121ms |
TM_ARCH_RV64V | FP16 | 129ms | 81ms |
Optimization for CskyV2 MCU which have DSP instructions (like ck804ef), suoport INT8 acceleration
w801 ck804ef core run mbnet 0.25, 128x128x3 input (mdl in psram, cpu run at 240M,psram overclock to 120M, O3, with data cache on)
ARCH | MDL_TYPE | OPT0 time | OPT1 time |
---|---|---|---|
TM_ARCH_CPU | INT8 | 950ms | 816ms |
TM_ARCH_CSKYV2 | INT8 | 724ms | 606ms |
Use SmallCifar model. TinyMaix use stride=2's time multipy by 4.
NNoM&TinyMaix run with STM32H750@218M, other run with STM32F746@216M
InferLib | time(ms) |
---|---|
TFlite-micro | 393 |
MicroTVM untuned | 294 |
TinyMaix CPU O0 | 224 |
TinyMaix CPU O1 | 204 |
TinyMaix SIMD O0 | 176 |
NNoM | 159 |
MicroTVM tuned | 157 |
CMSIS-NN | 136 |
TinyMaix SIMD O1 | 132 |
tinyengine | 129 |