Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize where_index_op(fused kernel) #30556

Closed

Conversation

thisjiang
Copy link
Contributor

@thisjiang thisjiang commented Jan 19, 2021

PR types

Performance optimization

PR changes

OPs

Describe

起因:
timeline显示where_index_op存在多次memcpy和大量cpu耗时点,代码也证实由于使用trust::vector的缘故,造成了几次default stream的拷贝,而true_index也是在cpu上计算的。

优化方法:

  1. 将cpu运算移到gpu上(必须)and 去掉default stream上的拷贝(必须)
  2. 访存合并(高优)and 无warp diverge(高优)and 无bank conflict(高优)

第三次优化:
commit:de2131f
优化点:

  1. 利用寄存器存ptr_stride的值和out_ptr的计算中间值,减少global memory的访问

优化效果:

修改 ips
修改前 6.083488889
修改后 6.628144444

第二次优化:
commit:436b64a
优化点:

  1. 融合整个kernel为一个KeGetTrueIndex
  2. 将true_num的拷贝改为异步拷贝

优化效果:

修改 ips
修改前 6.083488889
修改后 6.590188889

效果较为显著

待优化点:

  1. 只起了一个block,耗时长,亟待进一步优化
  2. 该融合kernel KeGetTrueIndex未经任何优化,耗时特别长,亟待进一步优化
  3. true_num的拷贝需要同步整个stream,或许可以优化

建议比对另一个PR:PR30601,后者最大的问题在于极度依赖thrust::inclusive_scandefalut stream上的表现,若能自写一个prefix sumkernel则优化效果肯定比这好得多。

第一次优化:
commit:9344b79
优化点:

  1. true_index的cpu计算过程移到了gpu上KeGetTrueIndex
  2. KeGetTrueIndex每个线程统计各自区域的true_num,然后起一个线程计算每个区域的true_num,最后将true_index写入对应的位置
  3. trust库的default stream拷贝改为使用memory::Copy的异步拷贝

优化成果,基于mask_rcnn_r50_fpn_1x_coco + coco17 + 前18条统计结果:

修改 ips
修改前 6.083488889
修改后 5.9189

效果不是很明显,待进一步优化

待优化点:

  1. KeGetTrueIndex完全没有做到访存合并,无warp diverge,无bank conflict
  2. true_num仍然需要default stream拷贝来保证out能获得正确的值,可以去掉改为stream同步?

@paddle-bot-old
Copy link

Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@thisjiang thisjiang changed the title Optimize where_index_op Optimize where_index_op(fused kernel) Jan 20, 2021
@paddle-bot-old
Copy link

Sorry to inform you that de2131f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

@thisjiang thisjiang closed this Mar 15, 2021
@thisjiang thisjiang deleted the optimize-whereindexop branch December 10, 2021 03:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant