Optimize where_index_op(fused kernel) #30556
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
PR types
Performance optimization
PR changes
OPs
Describe
起因:
timeline显示
where_index_op
存在多次memcpy和大量cpu耗时点,代码也证实由于使用trust::vector
的缘故,造成了几次default stream
的拷贝,而true_index
也是在cpu上计算的。优化方法:
default stream
上的拷贝(必须)warp diverge
(高优)and 无bank conflict
(高优)第三次优化:
commit:de2131f
优化点:
优化效果:
第二次优化:
commit:436b64a
优化点:
KeGetTrueIndex
优化效果:
效果较为显著
待优化点:
KeGetTrueIndex
未经任何优化,耗时特别长,亟待进一步优化true_num
的拷贝需要同步整个stream,或许可以优化建议比对另一个PR:PR30601,后者最大的问题在于极度依赖
thrust::inclusive_scan
在defalut stream
上的表现,若能自写一个prefix sum
kernel则优化效果肯定比这好得多。第一次优化:
commit:9344b79
优化点:
true_index
的cpu计算过程移到了gpu上KeGetTrueIndex
KeGetTrueIndex
每个线程统计各自区域的true_num
,然后起一个线程计算每个区域的true_num
,最后将true_index
写入对应的位置trust
库的default stream
拷贝改为使用memory::Copy
的异步拷贝优化成果,基于
mask_rcnn_r50_fpn_1x_coco
+coco17
+ 前18条统计结果:效果不是很明显,待进一步优化
待优化点:
KeGetTrueIndex
完全没有做到访存合并,无warp diverge
,无bank conflict
true_num
仍然需要default stream
拷贝来保证out
能获得正确的值,可以去掉改为stream
同步?