-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BugFix] SSD fully supported on GPUs, updated deploy_ssd tutorial #2510
Conversation
tutorials/nnvm/deploy_ssd.py
Outdated
#ctx = tvm.gpu(0) | ||
# Use these commented settings to build for opencl. | ||
#target = 'opencl' | ||
#ctx = tvm.gpu(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if I remember correctly, for opencl it should be tvm.opencl(0) or tvm.cl(0), isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sry I forgot to change.
I agree to put sort in a common file. And we can add a unitest for it as well. |
with ib.for_range(0, batch, for_type="unroll") as b: | ||
start = b * num_anchors | ||
with ib.if_scope(tid < num_anchors): | ||
p_out[start + tid] = tid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems storage_sync is missing here, I will update my pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 Would you like to seperate argsort to a seperate file so that we can share the use of it? I can add unitest to it if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly What's needed in ssd? Seems that you changed num_bbox
in my pr to p_index[0]
, why only first element in p_index
is used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can make argsort a normal topi op? I'll add cpu implementation later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 p_index is the valid_count variable which is a 1D array resulted from the multibox operators. So instead of sorting all of data.shape[1] numbers, we only need to sort the first p_index[0] numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly shouldn't be p_index[batch_id]? are you assuming batch = 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 p_index only have one dimension. So it should be p_index[0].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kevinthesun @Laurawly The difficulty of sharing argsort (or extract it as a topi operator) is that we hope sort_num
can be either a tvm.Tensor
or constant array, but we can't use tvm.Expr to subscript a python array. Do you have ideas?
with ib.else_scope(): | ||
start = sizes[tid-1] | ||
p_out[base_idx + k * axis_mul_after] = tvm.if_then_else( | ||
k < p_index[tid], index_new[k+start], k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly still confused, if batch > 1, it should enter this if branch (since axis_mul_before * axis_mul_after > 1
). Does p_index[tid]
here mean that each batch has a different valid count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 From here https://github.com/dmlc/tvm/blob/master/topi/python/topi/cuda/nms.py#L368 axis
is always 1, so axis_mul_before
and axis_mul_after
are both 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Laurawly since ndim == 2, axis == 1
, the actual loop is like
for i in range(0, 2):
if i < 1:
axis_mul_before *= data.shape[i]
I assume that axis_mul_after == 1
, axis_mul_before == data.shape[0]
, which is batch size, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinx13 Yeah, that's right. I see what you mean. So each batch could have a different valid count when batch_size > 1
. I shouldn't have assumed batch_size = 1. I just pushed the changes.
@Laurawly Btw have you checked the data race in nms ir? Seems __syncthreads and global barrier (maybe we should rewrite the ir to avoid global barrier) are needed on CUDA. I sometimes get incorrect nms results in my pr. |
@vinx13 Does the conflict happen in argsort_ir? |
@Laurawly the conflict happens in nms_ir, I replaced |
@vinx13 I don't see conflicts in my nms_ir using |
@Laurawly If the data written by other threads is needed (probably this line |
@vinx13 There's no data conflict for |
@Laurawly the writing |
@vinx13 No, because there's a condition that |
@Laurawly I see, thanks for your clarification |
thanks @Laurawly @vinx13 @kevinthesun @zhiics this is merged. |
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
…ache#2510) * nms fixed for gpu, tested on cuda and opencl devices, ssd now can run fully on the gpu * sort updated to use virtual thread * typo fixed * fix lint * fix lint * add support when batch_size > 1 * intel graphics conv2d bugs fixed for inception_v3 * intel conv2d api updated, nn input size 4 condition added * review addressed * move conv_tags to attributes * opencl ctx fixed * nms_ir index simplified
Thanks to @vinx13 's pr #2420, argsort working now on GPUs.
Tested SSD full pipeline on NVIDIA K80c and Intel HD graphics. Performance improved compared with heterogenous results.
Please review @masahi @kevinthesun @zhiics