[nvidia] reorder primitive fails correctness check #1703

dzarukin · 2023-08-14T21:53:49Z

Summary

oneDNN validation for Nvidia backend hits a correctness issue with reorder under benchdnn.

Version

Latest master.

Environment

Hardware:

NVIDIA A100 80GB PCIe
(A10 should also work for most cases).

Software

SYCL Compiler with Nvidia support.
Any version that compiles without issues, preferable no later than April.
[Optional] TBB
Any version.
[Optional] OpenCL CPU
Latest version is preferable.

Optional means that CPU backend can be enabled if dependency is satisfied. Otherwise, should be switched off.

Steps to reproduce

Build

mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE=release (or debug) -DDNNL_CPU_RUNTIME=DPCPP (or NONE) -DDNNL_GPU_RUNTIME=DPCPP -DDNNL_GPU_VENDOR=NVIDIA -DONEDNN_BUILD_GRAPH=OFF
cmake --build . --target benchdnn

Run

<env_vars> ./build/tests/benchdnn/benchdnn --reorder --engine=gpu --batch=test_reorder_gpu

Helper env vars:

CUDA_LOGINFO_DBG=1 CUDA_LOGDEST_DBG=stdout -- enables cuda API dump
CUDNN_LOGINFO_DBG=1 CUDNN_LOGDEST_DBG=stdout -- enables cudnn API dump
DNNL_VERBOSE=all (or desired level) -- enables oneDNN execution information

Helper tips:

benchdnn supports verbosity through -vX. Most info is available at v6. It's possible to dump destination with -v99 when really needed.
benchdnn documentation is here: https://github.com/oneapi-src/oneDNN/tree/master/tests/benchdnn (scroll down). Reorder doc and others may be found through links.
benchdnn binary also supports --help command, which will tip to use --reorder --help to dump all supported options.

Observed behavior

Failures may not be reproduced by a single run, running batch is the only reliable way to hit the issue (at least so far).

This is the most impactful issue so far because it affects many other primitives which are reorder-based.
Preliminary analysis says that it's likely a cross engine reorder implementation that affects the final result. The main suspect is synchronization part for out-of-order queue (since it's a default queue for Nvidia backend).
In-order queue seems to work fine (at this point, can be enabled inside the library manually only, benchdnn will get an extension with option soon).

Most of failures look like this:

[ 704][DST][0:0:11:0] exp_f32:         0.5 exp:         0.5 got:           0 diff:     0.5 rdiff:       1
[ 705][DST][0:0:11:1] exp_f32:           1 exp:           1 got:           0 diff:       1 rdiff:       1
[ 706][DST][0:0:11:2] exp_f32:         1.5 exp:         1.5 got:           0 diff:     1.5 rdiff:       1
[ 707][DST][0:0:11:3] exp_f32:           2 exp:           2 got:           0 diff:       2 rdiff:       1
[ 708][DST][0:0:11:4] exp_f32:          16 exp:          16 got:           0 diff:      16 rdiff:       1
[ 709][DST][0:0:11:5] exp_f32:          64 exp:          64 got:           0 diff:      64 rdiff:       1
[ 710][DST][0:0:11:6] exp_f32: 1.67772e+07 exp: 1.67772e+07 got:           0 diff:1.67772e+07 rdiff:       1
[ 711][DST][0:0:11:7] exp_f32:-1.67772e+07 exp:-1.67772e+07 got:           0 diff:1.67772e+07 rdiff:       1
[ 713][DST][0:0:11:9] exp_f32:        0.25 exp:        0.25 got:           0 diff:    0.25 rdiff:       1
22:FAILED (errors:21245 total:25165824) __REPRO: --reorder --engine=gpu --sdt=f32 --ddt=f32 --stag=abcd --dtag=acbd 64x16x384x64

Expected behavior

The issue is not appearing during batch validation.

The text was updated successfully, but these errors were encountered:

dzarukin · 2023-08-14T21:54:57Z

+@mehdi-goli

dzarukin · 2023-09-14T20:32:12Z

Since this commit, the failures will no longer be observed, but the out-of-order workflow is still not properly working. Will submit another issue to focus on out-of-order.

dzarukin added the sighting Suspicious library behavior. Should be promoted to a bug when confirmed label Aug 14, 2023

dzarukin self-assigned this Aug 14, 2023

dzarukin added bug A confirmed library bug and removed sighting Suspicious library behavior. Should be promoted to a bug when confirmed labels Aug 14, 2023

vpirogov assigned mehdi-goli and unassigned dzarukin Aug 28, 2023

dzarukin closed this as completed Sep 14, 2023

densamoilov mentioned this issue Oct 3, 2023

[nvidia|amd] Add missing synchronization #1732

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[nvidia] reorder primitive fails correctness check #1703

[nvidia] reorder primitive fails correctness check #1703

dzarukin commented Aug 14, 2023

dzarukin commented Aug 14, 2023

dzarukin commented Sep 14, 2023

[nvidia] reorder primitive fails correctness check #1703

[nvidia] reorder primitive fails correctness check #1703

Comments

dzarukin commented Aug 14, 2023

Summary

Version

Environment

Hardware:

Software

Steps to reproduce

Build

Run

Helper env vars:

Helper tips:

Observed behavior

Expected behavior

dzarukin commented Aug 14, 2023

dzarukin commented Sep 14, 2023