-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize cumsum in get_valid_counts #7123
Conversation
@mbrookhart Can you revive disabled topi tvm/tests/python/topi/python/test_topi_vision.py Lines 124 to 129 in 76b4ad0
|
e162f84
to
1360f30
Compare
2127395
to
eafdf9d
Compare
ping @Laurawly, any chance you could take a look? |
@Laurawly The plan is after we merge this first, we will generalize the cumsum IR in this PR into a reusable, exclusive scan primitive. After that, we can update our CUDA |
Sure, I can merge this first. |
Hi @mbrookhart thanks for this performance improvment! I found that this PR is causing |
Thanks, Trevor. If you can share the model script you're using, I can also work to debug today. |
I can reproduce the issue by running ssd test in tensorflow/test_forward.py with cuda target (I looked at this test yesterday for my PR, so I have a fresh memory):
@trevor-m Are you sure this is caused by |
hmm strange, after running the ssd test on GPU a few times, I cannot reproduce the error anymore. Could this error be random? One annoying thing about this model is that compilation time is extremely slow. It also requires increasing the stack size limit, otherwise it segfaults. |
Yeah, ouch: I'm not needing to increase the stack limit, and I haven't gotten this test to fail yet. |
Yeah the error is a bit random. However, I was able to reproduce it 100% of the time with TRT offload enabled. I can share a script shortly.
Yeah, I did a git bisect to determine this PR was the source of the issue, and #7172 was fine. |
Maybe it depends on the input data. Trevor and I ran it across a bunch of models, and it fails for few of them (not all). I believe that it can be because of input data (as number of boxes etc change with input image) |
@trevor-m I'm in mountain time, so I'll need to leave in about half an hour. If you can post the script that consistently fails tonight, I'll jump in first thing tomorrow morning and start hunting for which line causes the issue. |
@anijain2305 @trevor-m We should definitely use a fixed, real image for CI testing, like pytorch MaskRCNN test does. Please send a PR tvm/tests/python/frontend/pytorch/test_object_detection.py Lines 90 to 95 in 4c13ae9
|
I ran the model that was failing (ssd_mobilenet_v1_fpn_shared_box_predictor_640x640_coco14_sync_2018_07_03) under
@masahi Any thoughts? |
Does this mean NMS and not |
Looking at the code, assuming you have thrust enabled, this should be kernel0: tvm/python/tvm/topi/cuda/nms.py Lines 798 to 811 in 9815ae2
the thrust argsort wont get a number: tvm/python/tvm/topi/cuda/nms.py Lines 818 to 820 in 9815ae2
And this should be 1: tvm/python/tvm/topi/cuda/nms.py Lines 543 to 579 in 9815ae2
That could have threads So possibly it's failing because my changes to get_valid_count are returning the wrong valid_count. @trevor-m any chance we can dump the inputs/attrs for get_valid_count so I can make a unit test to check that hypothesis? I haven't been able to get it to fail with random inputs, but possibly there's an edge case in my exclusive_scan algorithm for this input data. |
Thanks for looking into it and finding that info @mbrookhart ! Here is the relevant relay graph:
The input shape is |
Ooh, interesting, doing NMS on no boxes, I'll take a look with that idea. |
I don't this this is valid if num_anchors is zero, it could lead to undefined behavior. Could you wrap that in an tvm/python/tvm/topi/cuda/nms.py Lines 209 to 214 in 9815ae2
|
This reverts commit c02c9c5.
I tried this yesterday. Unfortunately, this is not the source. The test still failed. |
Alas. I would still very much appreciate the script to reproduce this so I can hunt it down. |
tvm/python/tvm/topi/cuda/nms.py Lines 177 to 179 in 9815ae2
Log(0) is undefined, we should probably just wrap the entire thing in a if num_anchors > 0 |
Yes, Trevor is working on it. It needs TRT workflow and thats why the delay.
Let me try this as well. |
Thanks @mbrookhart , I tried that but it didn't fix the error. Here is a script to reproduce the error: https://gist.github.com/trevor-m/f44d3d0e7edcaee12722e518e5959b82 I also noticed this line in the kernel where cuda-gdb found a crash: https://github.com/apache/tvm/blob/main/python/tvm/topi/cuda/nms.py#L545 |
Thanks, I'll try to reproduce. You're building with thrust and TRT, right? you can't compile a cuda kernel with zero threads, so we always make sure it's at least 1: tvm/python/tvm/tir/ir_builder.py Lines 205 to 206 in 9815ae2
|
Yes, that's right, thrust + TRT. Thank you for your help with debugging this. |
* Parallelize cumsum in get_valid_counts * make the scan loop exclusive * switch to directly using exclusive scan * perform inner loop of final writes on anchor threads * fix flaky test fix lint * remove final cuda kernel Co-authored-by: masa <[email protected]>
* Parallelize cumsum in get_valid_counts * make the scan loop exclusive * switch to directly using exclusive scan * perform inner loop of final writes on anchor threads * fix flaky test fix lint * remove final cuda kernel Co-authored-by: masa <[email protected]>
* Parallelize cumsum in get_valid_counts * make the scan loop exclusive * switch to directly using exclusive scan * perform inner loop of final writes on anchor threads * fix flaky test fix lint * remove final cuda kernel Co-authored-by: masa <[email protected]>
* Parallelize cumsum in get_valid_counts * make the scan loop exclusive * switch to directly using exclusive scan * perform inner loop of final writes on anchor threads * fix flaky test fix lint * remove final cuda kernel Co-authored-by: masa <[email protected]>
As a followup to #6839 , this parallelizes the cumsum in get_valid_counts using an upsweep/downsweep tree-based prefix sum algorithm, similar to what I did in #7099.
On my 1070 Ti, testing deploy_ssd_gluoncv.py, I previously reported that get_valid_counts took 3674.62 microseconds this reduces that to
495.8258.497 microseconds.@masahi has expressed interest in implementing a more general prefix scan for other ops, as future work I expect we'll refactor this and do possible cache optimization.
Thanks
cc @Laurawly @zhiics @kevinthesun