Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize the error messages of paddle CUDA API #23816

Merged
merged 7 commits into from
Apr 20, 2020

Conversation

zhwesky2010
Copy link
Contributor

@zhwesky2010 zhwesky2010 commented Apr 13, 2020

当前问题

目前Nvidia相关API报错信息格式:
1. 需要用户在Nvidia网站里,自行查找,但是该网址国内用户很难访问,而且网站内容多,也很难找到对应之处,一般用户也不会点击。
2. 各种 cudaFunction failed! 函数调用失败的信息,对用户参考的价值低,而且较难理解;

因此报错信息较为不友好,用户出现Nvidia相关API问题无法自行分析。存在问题较大,issue众多(总计会有10几个以上)。

  1. 用户issue1:急:Windows10安装,按照官网一步一步操作,CUDD10和cuDNN7.6.5(7.3.1也装过)等组件我都装了,python和pip版本也装了对应版本,安装过程很顺利没弹红字和错误,可在验证时总是没提示安装成功。 #21913
  2. 用户issue2:win10(笔记本GTX1050TI)下安装好cuda10.0和对应的cudnn后检测时报错,更改显卡休眠后重启还是不行 #22701
  3. 用户issue3:fluid W0224 Compiled with WITH_GPU, but no GPU found in runtime. #22749

升级方案

1. 重构了PADDLE_ENFORCE_CUDA_SUCCESS,开发者直接调用PADDLE_ENFORCE_CUDA_SUCCESS(error)即可,error可以是cudaError_t(cudaAPI)curandStatus_t(curandAPI)cudnnStatus_t(cudnnAPI)cublasStatus_t(cublasAPI)ncclResult_t(ncclAPI)五种API的任意一种,涉及Paddle中314个Nvidia相关 API,不再由开发者手动输入,因为开发者对于Nvidia相关的API的外部错误,也无法给出有效的信息,一般为:

cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl
cudaEventRecord raises unexpected exception

函数调用失败的信息,而该信息可以在栈信息C++ call Stack中查看,无需放在最关键的Error Summary中暴露给用户,对用户无实质帮助且形成了理解负担;

2. 新的报错信息由系统根据错误码自动填充,是通过爬虫从Nvidia官网爬取,或根据ncclGetErrorString等API自动获取,另外统一了五种NvidiaAPI报错信息最终格式;

报错预览

修改前

1. CUDA API
--------------------------------------------
Error Message Summary:
--------------------------------------------
Error: cudaGetDeviceCount failed in paddle::platform::GetCUDADeviceCountImpl, error code : 35, Please see detail in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038 at (D:\1.6.2\paddle\paddle\fluid\platform\gpu_info.cc:67)

修改后

1. CUDA API
----------------------
Error Message Summary:
----------------------
ExternalError: Cuda error(35), CUDA driver version is insufficient for CUDA runtime version.
[Advise: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1244)

2. CURAND API:
----------------------
Error Message Summary:
----------------------
ExternalError: Curand error, CURAND_STATUS_OUT_OF_RANGE : unspecified launch failure at (/Paddle/paddle/fluid/pybind/pybind.cc:1247)

3. CUDNN API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cudnn error, CUDNN_STATUS_INTERNAL_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1250)

4. CUBLAS API:
----------------------
Error Message Summary:
----------------------
ExternalError: Cublas error, CUBLAS_STATUS_LICENSE_ERROR at (/Paddle/paddle/fluid/pybind/pybind.cc:1253)

5. NCCL API:
----------------------
Error Message Summary:
----------------------
ExternalError: Nccl error, unhandled system error at (/Paddle/paddle/fluid/pybind/pybind.cc:1256)

注:第1种API加入了爬虫,有100多种错误码,并提供从Nvidia官网爬取到的详细信息;后4种API错误码较少,未引入爬虫,只提供简要的信息,竞品目前只提供最简要信息,后期可视情况看Paddle是否需要将详细信息也进行爬取;

调用方式举例:

报错信息自动产生,进行了全封装,开发者只需传入Nvidia API的返回值,调用简单;
PADDLE_ENFORCE_CUDA_SUCCESS(cudaGetDeviceCount(&count));
image

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

现在的报错信息都是统一格式的:

错误类型:关键错误提示.
  [附加提示:XXX] at (出错文件:行号)
  [出错Op(如果有的话)]

所以我觉得下面这种格式是不是也能统一下

--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.

Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run. at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)

改成

--------------------------------------------
Error Message Summary:
--------------------------------------------
ExternalError: CUDA runtime error(35): CUDA driver version is insufficient for CUDA runtime version.
  [Recommended Solution: This indicates that the installed NVIDIA CUDA driver is older than the CUDA runtime library. This is not a supported configuration.Users should install an updated NVIDIA display driver to allow the application to run.] at (/Paddle/paddle/fluid/pybind/pybind.cc:1243)

另外我感觉这个Recommended Solution题目有点太长了,能不能就叫Solution或者Advice或者别的一个词

@chenwhql
Copy link
Contributor

这里其实涉及到一个接口的边界问题,是迟早需要解决的:PADDLE_ENFORCE_CUDA_SUCCESS这个宏的概念边界有歧义了,一个报错检查宏,在设计上最好确定地属于下面某一种:

  1. 使用这个宏,需要开发者确保报错类型和报错信息的准确性,例如PADDLE_THROW, PADDLE_ENFORCE_EQ/NOT_NULL等
  2. 使用这个宏,不需要开发者关心报错类型和报错信息,宏将确保这两点的正确性,开发者只需要按要求填空就行,这类宏例如OP_INOUT_CHECK, GET_DATA_SAFELY

但目前在此PR这项非常棒的自动化填充报错的功能整合后,PADDLE_ENFORCE_CUDA_SUCCESS这个宏在cuda类报错上是确保报错类型和信息没问题的,但是在cudnn, cublas, curand这些类型上,又需要开发者确保报错类型和信息正确,这不是一个边界清晰的设计

这个改进方向是确定的,就是要让检查宏的边界清晰,有两个方向:

  1. 添加新的检查宏,用于包装这套自动化cuda报错填充检查逻辑,至于PADDLE_ENFORCE_CUDA_SUCCESS仍然保持原来的边界,即需要开发者确保报错合规性,最后可以删除
  2. 对cudnn, cublas, curand这些类型也支持自动填充合规报错类型和信息,让PADDLE_ENFORCE_CUDA_SUCCESS这个宏彻底变成一个不需要开发者关心报错类型和内容的宏

我个人倾向于方向1,定义新宏,原因如下:

  • 定义新宏,虽然会引入新的检查,但这样的宏使用极为简便,并没有太多推动成本,而且过程中没有兼容性问题
  • 方向2的问题:
    • 目前只支持了cuda,cudnn, cublas, curand仍有待支持,那势必存在一段时间,对于开发者来说,这个宏的概念是不清晰的,开发者在使用的时候混乱的话,我们很可能被诟病,而且开发者按照此PR里面的写法写了,是合规的,但是会被CI卡主,我们也会被诟病
    • 等所有都支持完之后,还要向开发者解释,这个宏的用法变了,这也是很麻烦的事,不入直接上新宏

采用新宏的话,实现大概是,例如(名字可以改的更好些):

// 待实现
inline std::string get_nvida_error_msg(cudnnStatus_t e) {
  return GetCudnnErrorMessage(e);
}

// 待实现
inline std::string get_nvida_error_msg(curandStatus_t e) {
  return GetCurandErrorMessage(e);
}

// 待实现
inline std::string get_nvida_error_msg(cublasStatus_t e) {
  return GetCublasErrorMessage(e);
}

// 已经实现
inline std::string get_nvida_error_msg(cudaError_t e) {
  return GetCudaErrorMessage(e);
}

#ifdef PADDLE_WITH_CUDA
#define CUDA_SUCCESS_CHECK(COND)                                             \
  do {                                                                       \
    auto __cond__ = (COND);                                                  \
    using __CUDA_STATUS_TYPE__ = decltype(__cond__);                         \
    constexpr auto __success_type__ =                                        \
        ::paddle::platform::details::CudaStatusType<                         \
            __CUDA_STATUS_TYPE__>::kSuccess;                                 \
    if (UNLIKELY(__cond__ != __success_type__)) {                            \
      try {                                                                  \
        ::paddle::platform::throw_on_error(                                  \
            __cond__, ::paddle::platform::errors::External(                  \
                          ::paddle::platform::get_nvida_error_msg(__cond__)) \
                          .ToString());                                      \
      } catch (...) {                                                        \
        HANDLE_THE_ERROR                                                     \
        throw ::paddle::platform::EnforceNotMet(std::current_exception(),    \
                                                __FILE__, __LINE__);         \
        END_HANDLE_THE_ERROR                                                 \
      }                                                                      \
    }                                                                        \
  } while (0)
#endif  // PADDLE_WITH_CUDA

@zhwesky2010 zhwesky2010 reopened this Apr 15, 2020
@luotao1
Copy link
Contributor

luotao1 commented Apr 15, 2020

目前只支持了cuda,cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平,即确保报错类型和信息没问题的,是不是函数封装和文案设计进行统一后就行了,不需要一定拿到NVIDA官网来的详细信息呢?

如果是这样,能不能一次性考虑到方向2来解决?

@chenwhql
Copy link
Contributor

目前只支持了cuda,cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平,即确保报错类型和信息没问题的,是不是函数封装和文案设计进行统一后就行了,不需要一定拿到NVIDA官网来的详细信息呢?

如果是这样,能不能一次性考虑到方向2来解决?

这个是说先简单封装下另外几种库,一次性将PADDLE_ENFORCE_CUDA_SUCCESS改成不需要类型和信息的吗?那2.0应该来不及了,因为这需要把paddle里面的所有PADDLE_ENFORCE_CUDA_SUCCESS都改了,而且CI也需要更新下规则

赶2.1再上的话,我觉得这样也OK

@zhwesky2010
Copy link
Contributor Author

zhwesky2010 commented Apr 16, 2020

目前只支持了cuda,cudnn, cublas, curand仍有待支持

请问cudnn, cublas如果要做成和本PR中cuda报错一样的水平,即确保报错类型和信息没问题的,是不是函数封装和文案设计进行统一后就行了,不需要一定拿到NVIDA官网来的详细信息呢?
如果是这样,能不能一次性考虑到方向2来解决?

这个是说先简单封装下另外几种库,一次性将PADDLE_ENFORCE_CUDA_SUCCESS改成不需要类型和信息的吗?那2.0应该来不及了,因为这需要把paddle里面的所有PADDLE_ENFORCE_CUDA_SUCCESS都改了,而且CI也需要更新下规则

赶2.1再上的话,我觉得这样也OK

讨论结果,以方案2实施,尽量赶上2.0

Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

其实可以暂时不删除build_ex_string, 新创建build_nvidia_error_msg函数,因为删除build_ex_string的话,目前paddle中使用PADDLE_ENFORCE检查cuda类错误时可能会出错,但是目前不知道paddle还有多少个这样的历史遗留检查

可以看下这个PR,#21994

@@ -33,6 +33,9 @@ class ErrorSummary {
// Note(chenweihang): Final deprecated constructor
// This constructor is only used to be compatible with
// current existing no error message PADDLE_ENFORCE_*
// Note(zhouwei): PADDLE_ENFORCE_CUDA_SUCCESS error message
// can be get automatically, error message from developer
// is not necessary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这段应该可以去掉了

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

chenwhql
chenwhql previously approved these changes Apr 17, 2020
Copy link
Contributor

@chenwhql chenwhql left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent!

liupluswei
liupluswei previously approved these changes Apr 17, 2020
Copy link
Contributor

@liupluswei liupluswei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@raindrops2sea raindrops2sea left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@liupluswei liupluswei merged commit 7817003 into PaddlePaddle:develop Apr 20, 2020
zhwesky2010 added a commit to zhwesky2010/Paddle that referenced this pull request Apr 21, 2020
Shixiaowei02 added a commit to Shixiaowei02/Paddle that referenced this pull request Apr 22, 2020
Shixiaowei02 added a commit that referenced this pull request Apr 23, 2020
* cherry-pick of DeviceContext Split, test=develop (#23737)

* New feature: thread local allocator, test=develop (#23989)

* add the thread_local_allocator, test=develop

* refactor the thread_local_allocator, test=develop

* provides option setting strategy, test=develop

* add boost dependency to cuda_stream, test=develop

* declare the stream::Priority as enum class, test=develop

* deal with PADDLE_ENFORCE_CUDA_SUCCESS macro in pr #23816
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants