-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL][CUDA] Add missing barrier to collectives #2990
Conversation
SYCL sub-group and group functions should act as synchronization points. Group collectives need a barrier at the end to ensure that back-to-back collectives do not lead to a race condition. Note that the barrier at the beginning of each collective occurs after each work-item writes its partial results to the scratch space. This is assumed safe because only the collective functions can access the space, and collective functions must be encountered in uniform control flow; any work-item encountering a collective function can assume it is safe to use the scratch space, because all work-items in the same work-group must have either executed no collective functions or the barrier at the end of a previous collective function. Signed-off-by: John Pennycook <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks fine for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a regression test covering this code?
There isn't, and I've struggled to write one. Because it's a race condition, the problem wouldn't show up on every run of the test. I'm also having trouble reproducing the failure at all with a simple test-case (with two back-to-back reductions) on the NVIDIA GPU that I use for testing, suggesting it might be quite hard to trigger in practice. |
I realize the it reproducer might be not 100% reliable, but a simple test-case sounds better than nothing. I see value in adding such test case as I suppose there is a chance it should be able to catch issues in collectives considering that we run regression tests multiple times. In addition to that it will catch issues not related to data races. |
Calls reduce, exclusive scan and inclusive scan multiple times back-to-back. Note that since we are testing for a race condition, it is possible for this test to pass even with an incorrect implementation. Signed-off-by: John Pennycook <[email protected]>
14438de
Makes sense. I've added a simple regression test in 14438de. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except one minor comment.
Thanks!
// RUN: %clangxx -fsycl -fsycl-targets=%sycl_triple %s -o %t.out | ||
// RUN: %HOST_RUN_PLACEHOLDER %t.out | ||
// RUN: %CPU_RUN_PLACEHOLDER %t.out | ||
// RUN: %GPU_RUN_PLACEHOLDER %t.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be located inside sycl/test/on-device directory according to the guidelines from Get Started Guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops. I'll go and read the new guidelines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the test should be even moved to https://github.com/intel/llvm-test-suite
Signed-off-by: John Pennycook <[email protected]>
Device query may return a value too large for a specific kernel; kernel query is required in order to respect local memory usage. Signed-off-by: John Pennycook <[email protected]>
* sycl: (378 commits) [sycl-post-link][NFC] Extracted the code into a subroutine (intel#3042) [SYCL][NFC] Remove commented out code (intel#3029) [CODEOWNERS] Fix ownership of DPC++ tools tests (intel#3047) [SYCL][NFC] Make tests insensitive to dso_local (intel#3037) [SYCL] Fix acquiring a mutex in _pi_context::finalize (intel#3001) [SYCL] Fix various compilation warnings in plugins (intel#2979) [SYCL][ESIMD] Add simd class conversion ctor and operator (intel#3028) [sycl-post-link][NFC] Use range-based for loop. (intel#3033) [SYCL][NFC] Fix warning in self-build (intel#3023) [NFC] Fix sycl-post-link tests to avoid potential race (intel#3031) [SYCL][CUDA] Add missing barrier to collectives (intel#2990) [SYCL] Make Intel attributes consistent with clang attributes. (intel#3022) [SYCL] Bump SYCL minor version (intel#3026) [SYCL][Doc] Added requirement on reference to test PR in commit message (intel#3010) [SYCL] Put constant initializer list data in non-generic addr space. (intel#3005) [SYCL][L0] Fix memory leak in PiDeviceCache and ZeCommandList (intel#2974) [SYCL] Fix detection of free function calls (intel#3003) [SYCL][NFC] Clean up the builder_dir argument description (intel#3021) [SYCL][ESIMD] Fix LowerESIMD crash on a scalar fptoui LLVM instruction (intel#2699) [NFC] Remove redundant call to getMainExecutable() (intel#3018) ...
SYCL sub-group and group functions should act as synchronization points.
Group collectives need a barrier at the end to ensure that back-to-back
collectives do not lead to a race condition.
Note that the barrier at the beginning of each collective occurs after
each work-item writes its partial results to the scratch space. This is
assumed safe because only the collective functions can access the space,
and collective functions must be encountered in uniform control flow; any
work-item encountering a collective function can assume it is safe to use
the scratch space, because all work-items in the same work-group must have
either executed no collective functions or the barrier at the end of a previous
collective function.
Signed-off-by: John Pennycook [email protected]