Resolve racecheck errors in ORC kernels #9916

vuule · 2021-12-16T00:54:21Z

Running ORC Python tests with compute-sanitizer --tool racecheck results in a number of errors/warnings.
This PR resolves the errors originating in ORC kernels. Remaining errors come from gpu_inflate.

Adds a few missing block/warp syncs and minor clean up in the affected code.

Causes 42% slowdown on average in ORC reader benchmarks. Not negligible, will double check whether the changes are required, or just resolving false positives in racecheck.
Ran the benchmarks many more times, and the average time difference is smaller than variations between runs.

codecov · 2021-12-16T02:31:43Z

Codecov Report

Merging #9916 (79c3eee) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

❗ Current head 79c3eee differs from pull request most recent head 76cc2b0. Consider uploading reports for the commit 76cc2b0 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.02    #9916      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20480     +175     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18346     +171

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d69ea61...76cc2b0. Read the comment docs.

…bug-racecheck-orc

elstehle

Looks fine. Maybe you can double-check whether that syncwarp after the shfl_sync is really needed.

elstehle · 2021-12-17T13:04:30Z

cpp/src/io/orc/stripe_data.cu

@@ -782,6 +780,7 @@ static __device__ uint32_t Integer_RLEv2(orc_bytestream_s* bs,
    pos  = shuffle(pos);
    n    = shuffle(n);
    w    = shuffle(w);
+    __syncwarp();


Not sure this one is needed here, as our shuffle is an alias for __shfl_sync, which, to my understanding, would converge threads participating in the shuffle (in our case: there is no mask, so all threads participate).
If, despite, __syncwarp should be required, we should leave a note that clarifies why we need __syncwarp here.

I'll add a comment. Really want to go towards error-free memcheck/racecheck reports.

This one resolves the following racecheck warnings, presumably because the tool does not recognize shuffle_sync as a sync point.

Warning: Race reported between Write access at 0x19520 in /cudf/cpp/src/io/orc/stripe_data.cu:735:unsigned int Integer_RLEv2 and Read access at 0x19be0 in /cudf/cpp/src/io/orc/stripe_data.cu:807:unsigned int Integer_RLEv2[488 hazards] Warning: Race reported between Write access at 0x19050 in /cudf/cpp/src/io/orc/stripe_data.cu:773:unsigned int Integer_RLEv2 and Read access at 0x196d0 in /cudf/cpp/src/io/orc/stripe_data.cu:816:unsigned int Integer_RLEv2 [16 hazards]

cpp/src/io/orc/stripe_data.cu

…bug-racecheck-orc

vuule · 2021-12-17T23:04:45Z

cpp/src/io/orc/stripe_data.cu

        baseval = rle->baseval.u32[r];
      else
        baseval = rle->baseval.u64[r];
      for (uint32_t j = tr; j < n; j += 32) {
        vals[base + j] += baseval;
      }
    }
+    __syncwarp();


This one fixes the following warning:

Warning: Race reported between Write access at 0x19520 in /cudf/cpp/src/io/orc/stripe_data.cu:735:unsigned int Integer_RLEv2 and Read access at 0x1a4e0 in /cudf/cpp/src/io/orc/stripe_data.cu:865:unsigned int Integer_RLEv2 [8 hazards]

vuule · 2021-12-17T23:07:35Z

cpp/src/io/orc/stripe_data.cu

-        if (s->chunk.type_kind == TIMESTAMP) {
-          s->top.data.buffered_count = s->top.data.max_vals - numvals;
+      if (t == 0 && numvals + vals_skipped > 0) {
+        auto const max_vals = s->top.data.max_vals;


Workaround for a presumable false positive:

Warning: Race reported between Write access at 0x19520 in /cudf/cpp/src/io/orc/stripe_data.cu:735:unsigned int Integer_RLEv2 and Read access at 0x1a4e0 in /cudf/cpp/src/io/orc/stripe_data.cu:865:unsigned int Integer_RLEv2 [8 hazards]

vuule · 2021-12-17T23:16:39Z

cpp/src/io/orc/stripe_enc.cu

-    s->nnz     = 0;
-    s->numvals = 0;
-  }
+  if (t == 0) { s->nnz = 0; }


Fixes the error:

Error: Race reported between Read access at 0x2ce0 in /cudf/cpp/src/io/orc/stripe_enc.cu:629: encode_null_mask and Write access at 0x2d30 in /cudf/cpp/src/io/orc/stripe_enc.cu:709:encode_null_mask [8 hazards]

Resetting numvals can be skipped because it is guaranteed to be zero after the loop above.

…bug-racecheck-orc

devavret · 2022-01-04T20:19:50Z

cpp/src/io/comp/gpuinflate.cu

-    pos      = min((__ffs(lit_mask) - 1) & 0xff, 32);
+    auto const symt     = (t < batch_len) ? b[t] : 256;
+    auto const lit_mask = ballot(symt >= 256);
+    auto pos            = min((__ffs(lit_mask) - 1) & 0xff, 32);


I can't spot the fix in this file. Is this code cleanup only?

vuule · 2022-01-05T09:20:06Z

rerun tests

vuule · 2022-01-07T01:58:54Z

rerun tests

vuule · 2022-01-07T04:07:19Z

@gpucibot merge

vuule added 2 commits December 15, 2021 16:47

resolve race conditions in ORC kernels

cac1f58

resolve a race condition in inflate

c9db3c7

vuule added bug Something isn't working cuIO cuIO issue labels Dec 16, 2021

vuule self-assigned this Dec 16, 2021

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Dec 16, 2021

vuule added the non-breaking Non-breaking change label Dec 16, 2021

vuule requested a review from elstehle December 16, 2021 06:46

vuule added 2 commits December 16, 2021 15:48

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

eba4778

…bug-racecheck-orc

revert extra sync

eefe523

elstehle approved these changes Dec 17, 2021

View reviewed changes

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

48273db

…bug-racecheck-orc

vuule commented Dec 17, 2021

View reviewed changes

vuule marked this pull request as ready for review December 17, 2021 23:17

vuule requested a review from a team as a code owner December 17, 2021 23:17

vuule requested review from trxcllnt and devavret December 17, 2021 23:17

vuule added 3 commits January 4, 2022 10:53

add comment about one added sync

6116a12

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

188f7b2

…bug-racecheck-orc

if constexpr

76cc2b0

devavret approved these changes Jan 4, 2022

View reviewed changes

vuule added the 5 - Ready to Merge Testing and reviews complete, ready to merge label Jan 4, 2022

rapids-bot bot merged commit de8c0b8 into rapidsai:branch-22.02 Jan 7, 2022

vuule deleted the bug-racecheck-orc branch January 7, 2022 04:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve racecheck errors in ORC kernels #9916

Resolve racecheck errors in ORC kernels #9916

vuule commented Dec 16, 2021 •

edited

Loading

codecov bot commented Dec 16, 2021 •

edited

Loading

elstehle left a comment

elstehle Dec 17, 2021

vuule Dec 17, 2021

vuule Dec 17, 2021

vuule Dec 17, 2021

vuule Dec 17, 2021

vuule Dec 17, 2021

devavret Jan 4, 2022

vuule commented Jan 5, 2022

vuule commented Jan 7, 2022

vuule commented Jan 7, 2022

Resolve racecheck errors in ORC kernels #9916

Resolve racecheck errors in ORC kernels #9916

Conversation

vuule commented Dec 16, 2021 • edited Loading

codecov bot commented Dec 16, 2021 • edited Loading

Codecov Report

elstehle left a comment

Choose a reason for hiding this comment

elstehle Dec 17, 2021

Choose a reason for hiding this comment

vuule Dec 17, 2021

Choose a reason for hiding this comment

vuule Dec 17, 2021

Choose a reason for hiding this comment

vuule Dec 17, 2021

Choose a reason for hiding this comment

vuule Dec 17, 2021

Choose a reason for hiding this comment

vuule Dec 17, 2021

Choose a reason for hiding this comment

devavret Jan 4, 2022

Choose a reason for hiding this comment

vuule commented Jan 5, 2022

vuule commented Jan 7, 2022

vuule commented Jan 7, 2022

vuule commented Dec 16, 2021 •

edited

Loading

codecov bot commented Dec 16, 2021 •

edited

Loading