-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AutoTVM] Mali autotuning fails for ResNet #3980
Comments
@merrymercy @ZihengJiang can you followup a bit on this? |
I tried our official benchmark script. And found it is also broken after #3368. The error message is
It seems to be very similar to the problem in @mbarrett97 Does your error message also look like this? How to reproduce your error? Reproduce my error without mali
Apply diff diff --git a/apps/benchmark/mobile_gpu_imagenet_bench.py b/apps/benchmark/mobile_gpu_imagenet_bench.py
index c889b3d..be6051d 100644
--- a/apps/benchmark/mobile_gpu_imagenet_bench.py
+++ b/apps/benchmark/mobile_gpu_imagenet_bench.py
@@ -31,8 +31,9 @@ from util import get_network, print_progress
def evaluate_network(network, target, target_host, dtype, repeat):
# connect to remote device
- tracker = tvm.rpc.connect_tracker(args.host, args.port)
- remote = tracker.request(args.rpc_key)
+# tracker = tvm.rpc.connect_tracker(args.host, args.port)
+# remote = tracker.request(args.rpc_key)
+ remote = tvm
print_progress(network)
net, params, input_shape, output_shape = get_network(network, batch_size=1, dtype=dtype)
|
I found that the shapes are not simplified here for some convs I am not sure why |
I also encountered this issue in some recent runs on Mali with the tutorial script. All workloads except for the last layer in resnet18 have this problem. |
@zhiics, can you try to Simplify the |
@kimishpatel Thanks for your response. I tried it but it didn't work. |
@zhiics, hmm., ir_visitor_with_analyzer right now visits only a few exprs, like For and something else. It might be that to simplify r->extent it needs more? I am not sure right now, but thats the path I can suggest. |
Yes, that's also something I think as well as per https://discuss.tvm.ai/t/topi-winograd-test-topi-conv2d-winograd-py-fails-for-a-given-shape/4107/5 |
@zhiics, oh absolutely, this makes total sense. Realize also has bounds that we should populate as you suggested on the thread, and for that matter other Stmts too. |
This error reveals two things to be improved:
|
@tqchen @kimishpatel Yeah, I am actually trying to improve the simplifier to cover more exprs. One thing I am not quite sure for For item 2, I think we can probably log warning when |
@tqchen I now found where the problem was started. The issue is actually because the bound is not correctly simplified during InferBound. The old version of the simplifier can simplify the following bound to 1 (((((((((((blockIdx.x*128) + threadIdx.x)/16)/16)/64)*16)16) + ((((4 + (((((blockIdx.x128) + threadIdx.x)/16) % 16)*4)) - 1)/4)16)) + (((4 + ((((blockIdx.x128) + threadIdx.x) %16)4)) - 1)/4)) + 1) - (((((((((blockIdx.x128) + threadIdx.x)/16)/16)/64)*16)16) + (((((((blockIdx.x128) + threadIdx.x)/16) % 16)*4)/4)16)) + (((((blockIdx.x128) + threadIdx.x) % 16)*4)/4))) but the new simplifier here can only simplify it to: (((((((((((blockIdx.x*128) + threadIdx.x) % 256)/16)*4) + 3)/4)16) + ((((((blockIdx.x128) + threadIdx.x) % 16)4) + 3)/4)) + 1) - (((blockIdx.x128) + threadIdx.x) % 256)) And then this bound will be propagated down to all other passes and eventually cause the mentioned error. Could you please take a look to see or advise why it is not simplified? Thanks. |
The main problem is likely due to the fact that we dd not fully bind the range of the IterVar. This is a new requirement because we are being strict about the division mode and needs to know the sign of the expression. |
Thanks. I tried to change So this will then be considered together with https://discuss.tvm.ai/t/discuss-embed-more-bound-information-into-var-or-expr/4079 ? |
Upgrading the simplification will alleviate the problem, but would still be great for us to improve the bound context in the current setting. Both are directions for improvements |
Thanks. I think I probably need much more time to work on it because I need to understand the simplifier first. Will you have cycles to have a fix? It is currently blocking our internal merge. |
It would be great if you can look into it a bit, so that we have more developers who can hack into the simplifier :) reducing to unit-test cases (like the case you suggested) might help the process. |
@tqchen I dug deeper into the simplifier. Should we relax the following by removing I think |
@tqchen @zhiics my prove for the equation above. 1). if
2). if
To sum up, no matter whether b is a multiple of b or not, we have |
This is something has to do with the division semantics. Because the semantics is truncdiv in here. b = 2, a * x = 8, y = -3. The change of sign will make the result being different because of the truncdiv. see more about the division modes: #3977 |
@mbarrett97 This problem is caused by the used division mode because division in the schedule will use truncdiv in the TVM IR which needs context information about the sign. Tianqi's pending PR #4008 migrate indexdiv to floordiv. It solves my minimal example to reproduce the problem. Can you check if it solves your problem here as well? |
It does appear to resolve my issue. However, I'm now getting a new issue when I try to run InceptionV3 on Mali:
This looks like it could potentially be related to the changes (?), I'll see if I can find a minimal example to reproduce it. |
close due to inactive status, feel free to bring new trouble shooting threads to the discuss forum |
Autotuning on Mali fails for ResNet18_v1 (from gluon) with error:
3809 and 4072 appear to describe the same issue.
I've been able to resolve this by forcibly disabling the use of tophub fallback configs during task discovery under
tvm/python/tvm/autotvm/task/relay_integration.py
This doesn't seem like an elegant solution, but is there any reason why tophub needs to be used during this step? Disabling tophub here would also make it easier to add new targets which aren't yet represented in tophub.
The text was updated successfully, but these errors were encountered: