-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add tf32 support for A100 tensor core acceleration for cuBLAS #28732
Conversation
Update forked PaddlePaddle
Update my fork
update from PaddlePaddle
Update forked paddle repo
Update USERNAME/paddle
update Paddle USERNAME repo
update username repo
Thanks for your contribution! |
|
||
void tf32_switch_on_off(bool active) { allow_tf32_cublas = active; } | ||
|
||
bool get_tf32_switch() { return allow_tf32_cublas; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是不是放在cuda device context 里面比较好
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以的,会考虑在添加 cudnn 开关时一起修改,并且变量名函数名也会做相应修改。
python/paddle/fluid/tf32_switch.py
Outdated
@@ -0,0 +1,78 @@ | |||
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -0,0 +1,54 @@ | |||
# Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2020
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
关于函数名,变量名的问题,会在cudnn 的 PR 中做修改。 |
python/paddle/fluid/tf32_switch.py
Outdated
""" | ||
get the state of tf32 switch. | ||
|
||
Args: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
两处文档方面都可以参考下其他API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
暂时将 API 删除
Fixed unit test when no CUDA device is available
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@@ -57,6 +57,10 @@ struct GpuDevice; | |||
namespace paddle { | |||
namespace platform { | |||
|
|||
#ifdef PADDLE_WITH_CUDA | |||
static bool allow_tf32_cublas{true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个变量为啥不作为device_context的成员呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
作为全局变量的话,在 python 端使用一个临时的 CUDADeviceContext,来拨动开关,改变这个全局变量的值,因为这个值的改变,其他CUDADeviceContext 对象调用AllowTF32Cublas() 时就会返回allow_tf32_cublas实时变化的值。随时开随时关。
把它作为全局,上述逻辑是清晰的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
使用类的成员函数来set、get一个变量,那这个变量就应该定义成类的成员变量。如果是全局变量,就应该用全局函数来set和get,不要有这么奇怪的写法。
另外,即使作为类的成员变量,也自有更好的暴露到python端的办法,参考下pytorch吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已做修改。将函数写为全局函数。如果开关相关是 CUDADeviceContext
的类成员的话,开关所在的 CUDADeviceContext
实例 不能控制 matmul
运算所在实例的 开启或关闭。开关无效。
所以将开关相关写在全局。
@@ -361,6 +361,12 @@ CUDADeviceContext::~CUDADeviceContext() { | |||
#endif | |||
} | |||
|
|||
void CUDADeviceContext::SetTF32Cublas(bool active) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个函数名应该叫SetAllowTF32Cublas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
input_array1 = np.random.rand(4, 12, 64, 88).astype("float32") | ||
input_array2 = np.random.rand(4, 12, 88, 512).astype("float32") | ||
data1 = fluid.dygraph.to_variable(input_array1) | ||
data2 = fluid.dygraph.to_variable(input_array2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
单测里面也建议统一用2.0API,fluid.dygraph.to_variable -> paddle.to_tensor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
#if CUDA_VERSION >= 11000 | ||
if (AllowTF32Cublas()) { | ||
cublas_handle_.reset( | ||
new CublasHandleHolder(RawStream(), CUBLAS_TF32_TENSOR_OP_MATH)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
每次调CublasCall
的时候就重新new
一个CublasHandlerHolder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是。调用初始化函数InitCuBlasContext() 时候会 new 一个,cublas_handle_指向的 handle 对应CUBLAS_DEFAULT_MATH。这里当 if 满足,cublas_handle_ 就需要指向一个新的 handle(其对应CUBLAS_TF32_TENSOR_OP_MATH)。
如果将这个 if 判断放入初始化函数InitCuBlasContext()
,情况就会变成:新建一个 CUDADeviceContext
对象时,初始化函数InitCuBlasContext()
就会被调用,而开关的状态在创建这个CUDADeviceContext
对象时就固定了,中途若没有创建新的 CUDADeviceContext
对象,就不能拨动开关。如下代码片段:
所以此处将这个 if判断放在了 CublasCall
中。使得在一个 CUDADeviceContext
对象中可以随时拨动开关。
如果不每次调CublasCall
的时候就重新new
一个CublasHandlerHolder
,就需要两个不同的 handle
。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
New features
PR changes
Others
Describe
功能:支持tf32 cublas开关。默认为开启,当用户设置为 false 时,将其关闭:
用法:
效果:
对于两个 [10240, 10240] 的矩阵相乘,
关闭开关的执行时间是:0.113 s
开启开关的执行时间是:0.017 s
开启的性能为关闭的 6.6倍
修复部分单测效果:
由于精度 failed 的部分单测:
关闭 cublas 和 cudnn TF32 开关后(cudnn 开关在另一个 PR 中):