-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[TOPI][CUDA] Add faster-rcnn proposal op #2420
Conversation
Thank you for working on this! Currently we are using hybrid script developed by @were to replace ir_builder, which makes the code much more readable and easier to debug. It would be nice if you can use hybrid script instead of ir_builder. You can take a look at the generic ssd operators PR to see how hybrid script works: #2353 |
cef8b92
to
fe9830c
Compare
There are still some blockers of moving to hybrid.
cc @were |
topi/python/topi/cuda/vision.py
Outdated
Parameters | ||
---------- | ||
outs: Array of Tensor | ||
The computation graph description of roi_align |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roi_align -> proposal op
topi/python/topi/generic/vision.py
Outdated
Parameters | ||
---------- | ||
outs: Array of Tensor | ||
The computation graph description of roi_align |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
roi_align -> proposal op
@vinx13 is this implementation supposed to mirror mxnet one? |
for i in const_range(a_complation_time_const):
# do something Then the
|
offset = start + 2 * tid + (k % 2) | ||
with ib.if_scope( | ||
tvm.all(offset + 1 < num_bbox, p_data[offset] < p_data[offset + 1])): | ||
temp_data[0] = p_data[offset] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because different offsets are executed in parallel, if two different offsets both satisfy the condition, they'll compete the usage of temp_data[0] (same thing happens for temp_index[0]). In that case, the argsort result is wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
temp_data
is local scoped so each thread will have its own temp_data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you tested your argsort on slightly larger dataset such as data_buf has shape (1, 500)? I tested it locally and it failed. Testing script can be found here: https://gist.github.com/Laurawly/66e8105c8db300bbce0771c1e58853ad
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a potential bug in tvm, temp_index
and temp_data
are in global memory, though I declared them to be local. In fact, if allocate statement is the first one emitted by ir builder, the memory goes to global because allocate is outside the produce of extern. The ir is like
// attr .. strorage_scope = 'local'
allocate ...
produce extern {
...
}
cc @tqchen
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, it would be great if you can look a bit into it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly tvm can correctly allocate local memory as long as you put the allocate statement inside the thread scope (scope_attr statement in ir builder). otherwise storage write pass cannot find the attach point of the allocation and put it to the beginning (which will be allocated as global, but we may add an assertion). I think current argsort works after we fix the global barrier #2473
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly Seems I somehow dropped some commits here. After adding global barrier, I can get the right result but sometimes there is deadlock possibly because of the bug in global barrier
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 Yeah, it works for me for up to (1, 6000). But it's good enough for the test cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes this is the mirror of mxnet |
nthread_tx = max_threads | ||
nthread_bx = num_bbox // max_threads + 1 | ||
ib.scope_attr(tx, "thread_extent", nthread_tx) | ||
ib.scope_attr(bx, "thread_extent", nthread_bx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly In nmr_ir, sometimes valid bboxes are dropped due to conflicts. The only thing I do to fix this issue is binding bx to virtual threads. Could you take a look here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 Your nms_ir looks very similar as mine. I haven't faced test cases which will drop bboxes. But one solution I have in mind is instead of parallelize on i
axis, shall we parallelize on l
and put i
to for loop for serialized writing to p_out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly I don't know how l
can be parallelized since sequential dependency on l
is required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 I recently tested nms on Mali GPU and the data race occurred, when I changed the blockIdx.x
to vthread
and added synchronization it worked. But I was able to use blockIdx.x
if I initialized the p_out to -1, but I still need the synchronization to make it work.
@masahi please follow up to moderate this PR as https://docs.tvm.ai/contribute/committer_guide.html :) |
@vinx13 is this PR still WIP? |
@masahi yes, I'm trying to solve the data race. sorry I was on vacation, will pick up this soon |
@Laurawly I have double-checked and found that the dropped boxes are due to float point precision loss. When the number of boxes increases, there is a slight chance that the iou is very close to the threshold so tvm and the ref implementation in mxnet produces different result |
thanks @vinx13 @Laurawly @kevinthesun this is merged. |
* [TOPI][CUDA] Add faster-rcnn proposal op * Fix doc * Add global barrier * Use vthread in argsort * Update sort and nms ir * Fix lint * Update sort ir in ssd nms
* [TOPI][CUDA] Add faster-rcnn proposal op * Fix doc * Add global barrier * Use vthread in argsort * Update sort and nms ir * Fix lint * Update sort ir in ssd nms
* [TOPI][CUDA] Add faster-rcnn proposal op * Fix doc * Add global barrier * Use vthread in argsort * Update sort and nms ir * Fix lint * Update sort ir in ssd nms
Please review @masahi @FrozenGene @tqchen @kevinthesun