-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OPT] Low-bit Quantization #2116
Conversation
This is all assuming a symmetric quantization scheme, correct? Have you considered generalizing this slightly to an asymmetric quantization scheme like the one used in GEMMLOWP, QNNPACK, FBGEMM, NNAPI, etc? |
Since quantization is a major feature, it is better to send a RFC first |
I will propose a RFC next week. Thanks @ajtulloch @tqchen . |
81669fb
to
d92f41e
Compare
Has there been an RFC posted btw? This comment probably belongs there. FWIW I'm a little concerned about some directions this PR is taking, or at least some use-cases that would be good to see handled that I don't see how they fit in currently. For background on my perspective, a standard training flow for quantized models in TF/C2 (at least the fwk's I'm familiar with that implement this), is to:
Does this workflow make sense to folks? If not, could folks please elaborate on where we differ? Given this flow, we'd like to insert TVM into this process. One key use case that I'd like TVM to consider supporting is to allow frameworks to continue to use their existing approaches for Steps 1-5, and involve TVM in Step 6. There are several reasons for this, such as calibration-based quantization isn't always sufficient, and we'd like to supporting importing from existing int8 graph IRs like TFLite or C2. I think requiring TVM to take on Steps 4 and 5 in order to implement quantized models is unnecessarily opinionated, and moves it towards being a fully-fledged framework in it's own right (which I thought was not the goal). I would have thought one natural (and minimalistic) direction for TVM to support quantized models (which isn't precluded by this diff, but I want to see what folks think about this) would be something like:
Concretely, my concerns with this approach (assuming the goal is to be the 'the one true way' to execute quantized models in TVM) are that it a) integrates too early in the pipeline, which unnecessarily requires some assumptions, and b) these assumptions aren't the most general ones (i.e. requires symmetric quantization as used by e.g. MKLDNN), which precludes asymmetric quantization as in TF, TFLite, C2, GEMMLOWP, QNNPACK, and channel-wise quantization as in TF/C2 which is very useful for pushing bitwidths lower (see e.g. https://arxiv.org/pdf/1806.08342.pdf), and c) is less modular than other approaches, which makes it harder to target from existing frameworks that already support quantization. I don't think our goals are in conflict, I just thought that I should put this on the radar. Happy to send out an RFC (and dedicate engineering effort) to the more alternative approach as well if folks are on board? |
@ajtulloch an RFC need to be sent out and we won't merge the PR before the RFC get discussed, so we can move the discuss there after it get posted |
Hi @ajtulloch, I have a paper deadline so I pushed forward this PR in a hurry to get a workable quantization workflow. Let me send out a RFC tomorrow. This PR won't be merged before we have discussion in the community. |
x |
Currently, it seems NNVM requires inputs of a op have same data type. But a quantization scheme may cause different types of inputs. Any suggestion about that? |
@lixiaoquan there's no such requirement today AFAIK, it's user-controlled in the implementation of |
topi/python/topi/util.py
Outdated
@@ -213,3 +214,16 @@ def select_array(i, j): | |||
return now | |||
|
|||
return tvm.compute(matrix.shape, select_array, name=name) | |||
|
|||
|
|||
@tvm.register_func("print_tensor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, maybe we can add as an util later as separate PR, but we need documents on these
This comment has been minimized.
This comment has been minimized.
@liangfu Thanks for catching this outdated test |
@@ -124,7 +124,7 @@ def _bind_params_by_name(func, params): | |||
return expr.bind(func, bind_dict) | |||
|
|||
|
|||
def optimize(func, target, params=None): | |||
def optimize(func, target=None, params=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this API changes recently? It breaks some codes @tqchen
Here is an evaluation script: https://gist.github.com/ZihengJiang/bcabe46a712a417a01a6967d4430b6b5 |
@antinucleon @hlu1 @anijain2305 please also help take a look when you have time |
@ZihengJiang sorry this is basic question, but is there support for mixed quantization levels? It looks like currently we specify a global weight and activation precision only. Since we can already skip the first k conv layers, it seems that this would be a useful generalization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
@eqy User can override the rewrite function to implement mix-precision quantization. But it is not included in this pr |
In resnet, we use int32 for residual addition. But I found saving intermediate int32 results to global memory is much slower, is it possible to use int8 in this case (we need to mofidy annotate of add)? I'm not sure the impact to the model precision. |
* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.
* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.
* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.
* [QUANTIZE] Quantization implementation. * Update. * Update. * Update. * Update.
hey guys, I'm wondering whether or not TVM support any INT16 quantization? If the answer is yes, is it quantization aware training or post-training quantization? Thanks! |
Thanks for contributing to TVM! Please refer to guideline https://docs.tvm.ai/contribute/ for useful information and tips. After the pull request is submitted, please request code reviews from Reviewers.