-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Tutorial] Deploy Quantized Model on CUDA #4667
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with some minor comments.
I was expecting to see some accuracy and performance characteristics, but I also realize that might be out of scope of this user tutorial.
Later, it might also be useful to add a developer tutorial to help TVM developers add new operators for quantization.
================================ | ||
**Author**: `Wuwei Lin <https://github.com/vinx13>`_ | ||
|
||
This article is an introductory tutorial of automatic quantization with TVM. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a link to this discuss forum - https://discuss.tvm.ai/t/quantization-story/3920 to give a high-level idea of whats automatic quantization.
|
||
############################################################################### | ||
# The calibration dataset should be a iterable object. We define the | ||
# calibration dataset as a generator object in Python. In this tutorials, we |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tutorials -> tutorial
# When the scales are not power of two, fixed point multiplications will | ||
# be used. | ||
# | ||
# For outputs, we can find the scales with data-aware quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
outputs --> intermediate feature maps
|
||
import tvm | ||
from tvm import relay | ||
from tvm.relay import quantize as qtz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not used
############################################################################### | ||
# Import the model | ||
# ---------------- | ||
# We use the Relay MxNet frontent to import a model from the Gluon model zoo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
frontend
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vinx13 ; this is a great tutorial! I've added some nits on language.
**Author**: `Wuwei Lin <https://github.com/vinx13>`_ | ||
|
||
This article is an introductory tutorial of automatic quantization with TVM. | ||
Automatic quantization is one of the quantization mode in TVM. More details of the quantization story in TVM can be found `here <https://discuss.tvm.ai/t/quantization-story/3920>`_. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mode -> modes
details of -> details on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also long line
# Prepare the Dataset | ||
# ------------------- | ||
# We will demonstrate how to prepare the calibration dataset for quantization. | ||
# We first download the validate set of ImageNet and pre-process the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validation set
|
||
|
||
############################################################################### | ||
# The calibration dataset should be a iterable object. We define the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"should be an"
# intermediate feature maps are power of two, we can leverage bit shifting for | ||
# multiplications. This make it computationally more efficient. In `max` mode, | ||
# the maximum is used as the scale. Without rounding, `max` mode might have | ||
# better accuracy in some cases. When the scales are not power of two, fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
powers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(requesting changes on those minor typos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
Thanks @vinx13 @anijain2305 @tmoreau89 |
* [Tutorial] Deploy Quantized Model on CUDA * update * update * address comments
* [Tutorial] Deploy Quantized Model on CUDA * update * update * address comments
* [Tutorial] Deploy Quantized Model on CUDA * update * update * address comments
This tutorial demonstrates how to import a model using Relay frontend, run quantization and calibration passes, and perform quantized inference.
ref #4435
cc @tqchen @masahi @anijain2305 @ZihengJiang @tmoreau89