Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TOPI][CUDA] Add faster-rcnn proposal op #2420

Merged
merged 7 commits into from
Feb 14, 2019
Merged

Conversation

vinx13
Copy link
Member

@vinx13 vinx13 commented Jan 11, 2019

@kevinthesun
Copy link
Contributor

Thank you for working on this! Currently we are using hybrid script developed by @were to replace ir_builder, which makes the code much more readable and easier to debug. It would be nice if you can use hybrid script instead of ir_builder. You can take a look at the generic ssd operators PR to see how hybrid script works: #2353

@vinx13 vinx13 force-pushed the feature/rpn branch 2 times, most recently from cef8b92 to fe9830c Compare January 11, 2019 07:15
@vinx13
Copy link
Member Author

vinx13 commented Jan 11, 2019

Parameters
----------
outs: Array of Tensor
The computation graph description of roi_align
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roi_align -> proposal op

Parameters
----------
outs: Array of Tensor
The computation graph description of roi_align
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

roi_align -> proposal op

@masahi
Copy link
Member

masahi commented Jan 11, 2019

@vinx13 is this implementation supposed to mirror mxnet one?

@were
Copy link
Contributor

were commented Jan 12, 2019

@vinx13

  1. This is supported by [Hybrid Script] Unify the symbol tables to one; support tvm.container.Array #2366. You can use
for i in const_range(a_complation_time_const):
    # do something

Then the #do something body will be unrolled by the hybrid compiler.

  1. Casting is easy for me to later support it. I apologize that I should to it ealier.

  2. My suggestion to get rid of this global function is that you can wrap up this value with tvm.const and pass it to a hybrid function as a parameter.

offset = start + 2 * tid + (k % 2)
with ib.if_scope(
tvm.all(offset + 1 < num_bbox, p_data[offset] < p_data[offset + 1])):
temp_data[0] = p_data[offset]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because different offsets are executed in parallel, if two different offsets both satisfy the condition, they'll compete the usage of temp_data[0] (same thing happens for temp_index[0]). In that case, the argsort result is wrong.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

temp_data is local scoped so each thread will have its own temp_data

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested your argsort on slightly larger dataset such as data_buf has shape (1, 500)? I tested it locally and it failed. Testing script can be found here: https://gist.github.com/Laurawly/66e8105c8db300bbce0771c1e58853ad

Copy link
Member Author

@vinx13 vinx13 Jan 14, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a potential bug in tvm, temp_index and temp_data are in global memory, though I declared them to be local. In fact, if allocate statement is the first one emitted by ir builder, the memory goes to global because allocate is outside the produce of extern. The ir is like

// attr .. strorage_scope = 'local'
allocate ... 
produce extern {
   ...
}

cc @tqchen

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, it would be great if you can look a bit into it

Copy link
Member Author

@vinx13 vinx13 Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laurawly tvm can correctly allocate local memory as long as you put the allocate statement inside the thread scope (scope_attr statement in ir builder). otherwise storage write pass cannot find the attach point of the allocation and put it to the beginning (which will be allocated as global, but we may add an assertion). I think current argsort works after we fix the global barrier #2473

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13 I pulled the changes in #2473 and tested your argsort locally but it still fails on test cases with shape (1,500)

Copy link
Member Author

@vinx13 vinx13 Jan 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laurawly Seems I somehow dropped some commits here. After adding global barrier, I can get the right result but sometimes there is deadlock possibly because of the bug in global barrier

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13 Yeah, it works for me for up to (1, 6000). But it's good enough for the test cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laurawly I changed it to use a single block + vthreads because I met some deadlock after I repeat the test.
@masahi Now we don't need to use global barrier. But still, we need storage_sync, @were can we support this in hybrid?

@vinx13
Copy link
Member Author

vinx13 commented Jan 13, 2019

@vinx13 is this implementation supposed to mirror mxnet one?

Yes this is the mirror of mxnet MultiProposal

@masahi masahi self-assigned this Jan 20, 2019
@vinx13 vinx13 changed the title [TOPI][CUDA] Add faster-rcnn proposal op [WIP][TOPI][CUDA] Add faster-rcnn proposal op Jan 24, 2019
nthread_tx = max_threads
nthread_bx = num_bbox // max_threads + 1
ib.scope_attr(tx, "thread_extent", nthread_tx)
ib.scope_attr(bx, "thread_extent", nthread_bx)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laurawly In nmr_ir, sometimes valid bboxes are dropped due to conflicts. The only thing I do to fix this issue is binding bx to virtual threads. Could you take a look here?

Copy link
Contributor

@Laurawly Laurawly Jan 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13 Your nms_ir looks very similar as mine. I haven't faced test cases which will drop bboxes. But one solution I have in mind is instead of parallelize on i axis, shall we parallelize on l and put i to for loop for serialized writing to p_out?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Laurawly I don't know how l can be parallelized since sequential dependency on l is required.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinx13 I recently tested nms on Mali GPU and the data race occurred, when I changed the blockIdx.x to vthread and added synchronization it worked. But I was able to use blockIdx.x if I initialized the p_out to -1, but I still need the synchronization to make it work.

@tqchen
Copy link
Member

tqchen commented Feb 12, 2019

@masahi please follow up to moderate this PR as https://docs.tvm.ai/contribute/committer_guide.html :)

@masahi
Copy link
Member

masahi commented Feb 12, 2019

@vinx13 is this PR still WIP?

@vinx13
Copy link
Member Author

vinx13 commented Feb 13, 2019

@masahi yes, I'm trying to solve the data race. sorry I was on vacation, will pick up this soon

@vinx13
Copy link
Member Author

vinx13 commented Feb 14, 2019

@Laurawly I have double-checked and found that the dropped boxes are due to float point precision loss. When the number of boxes increases, there is a slight chance that the iou is very close to the threshold so tvm and the ref implementation in mxnet produces different result
@masahi This pr is ready.

@vinx13 vinx13 changed the title [WIP][TOPI][CUDA] Add faster-rcnn proposal op [TOPI][CUDA] Add faster-rcnn proposal op Feb 14, 2019
@masahi masahi merged commit d20646c into apache:master Feb 14, 2019
@masahi
Copy link
Member

masahi commented Feb 14, 2019

thanks @vinx13 @Laurawly @kevinthesun this is merged.

libing4752 pushed a commit to libing4752/tvm that referenced this pull request Feb 18, 2019
* [TOPI][CUDA] Add faster-rcnn proposal op

* Fix doc

* Add global barrier

* Use vthread in argsort

* Update sort and nms ir

* Fix lint

* Update sort ir in ssd nms
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
* [TOPI][CUDA] Add faster-rcnn proposal op

* Fix doc

* Add global barrier

* Use vthread in argsort

* Update sort and nms ir

* Fix lint

* Update sort ir in ssd nms
wweic pushed a commit to neo-ai/tvm that referenced this pull request Feb 20, 2019
* [TOPI][CUDA] Add faster-rcnn proposal op

* Fix doc

* Add global barrier

* Use vthread in argsort

* Update sort and nms ir

* Fix lint

* Update sort ir in ssd nms
@yzhliu yzhliu mentioned this pull request Mar 2, 2019
28 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants