Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL][CUDA] Add missing barrier to collectives #2990

Merged
merged 6 commits into from
Jan 14, 2021

Conversation

Pennycook
Copy link
Contributor

SYCL sub-group and group functions should act as synchronization points.
Group collectives need a barrier at the end to ensure that back-to-back
collectives do not lead to a race condition.

Note that the barrier at the beginning of each collective occurs after
each work-item writes its partial results to the scratch space. This is
assumed safe because only the collective functions can access the space,
and collective functions must be encountered in uniform control flow; any
work-item encountering a collective function can assume it is safe to use
the scratch space, because all work-items in the same work-group must have
either executed no collective functions or the barrier at the end of a previous
collective function.

Signed-off-by: John Pennycook [email protected]

SYCL sub-group and group functions should act as synchronization points.
Group collectives need a barrier at the end to ensure that back-to-back
collectives do not lead to a race condition.

Note that the barrier at the beginning of each collective occurs after
each work-item writes its partial results to the scratch space.  This is
assumed safe because only the collective functions can access the space,
and collective functions must be encountered in uniform control flow; any
work-item encountering a collective function can assume it is safe to use
the scratch space, because all work-items in the same work-group must have
either executed no collective functions or the barrier at the end of a previous
collective function.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook added bug Something isn't working cuda CUDA back-end labels Jan 4, 2021
@Pennycook Pennycook requested a review from bader as a code owner January 4, 2021 21:22
Copy link
Contributor

@anton-v-gorshkov anton-v-gorshkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine for me.

bader
bader previously approved these changes Jan 11, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a regression test covering this code?

@Pennycook
Copy link
Contributor Author

Is there a regression test covering this code?

There isn't, and I've struggled to write one. Because it's a race condition, the problem wouldn't show up on every run of the test. I'm also having trouble reproducing the failure at all with a simple test-case (with two back-to-back reductions) on the NVIDIA GPU that I use for testing, suggesting it might be quite hard to trigger in practice.

@bader
Copy link
Contributor

bader commented Jan 11, 2021

Is there a regression test covering this code?

There isn't, and I've struggled to write one. Because it's a race condition, the problem wouldn't show up on every run of the test. I'm also having trouble reproducing the failure at all with a simple test-case (with two back-to-back reductions) on the NVIDIA GPU that I use for testing, suggesting it might be quite hard to trigger in practice.

I realize the it reproducer might be not 100% reliable, but a simple test-case sounds better than nothing. I see value in adding such test case as I suppose there is a chance it should be able to catch issues in collectives considering that we run regression tests multiple times. In addition to that it will catch issues not related to data races.
What do you think about it?

Calls reduce, exclusive scan and inclusive scan multiple times back-to-back.
Note that since we are testing for a race condition, it is possible for this
test to pass even with an incorrect implementation.

Signed-off-by: John Pennycook <[email protected]>
@Pennycook Pennycook dismissed stale reviews from bader and anton-v-gorshkov via 14438de January 11, 2021 19:02
@Pennycook Pennycook requested a review from a team as a code owner January 11, 2021 19:02
@Pennycook
Copy link
Contributor Author

What do you think about it?

Makes sense. I've added a simple regression test in 14438de.

bader
bader previously approved these changes Jan 11, 2021
Copy link
Contributor

@bader bader left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except one minor comment.
Thanks!

// RUN: %clangxx -fsycl -fsycl-targets=%sycl_triple %s -o %t.out
// RUN: %HOST_RUN_PLACEHOLDER %t.out
// RUN: %CPU_RUN_PLACEHOLDER %t.out
// RUN: %GPU_RUN_PLACEHOLDER %t.out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be located inside sycl/test/on-device directory according to the guidelines from Get Started Guide.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I'll go and read the new guidelines.

Copy link
Contributor

@romanovvlad romanovvlad Jan 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems the test should be even moved to https://github.com/intel/llvm-test-suite

bader
bader previously approved these changes Jan 11, 2021
alexbatashev
alexbatashev previously approved these changes Jan 12, 2021
@bader bader dismissed stale reviews from alexbatashev and themself via 73a8788 January 13, 2021 09:13
Device query may return a value too large for a specific kernel;
kernel query is required in order to respect local memory usage.

Signed-off-by: John Pennycook <[email protected]>
@romanovvlad romanovvlad merged commit 2b6f2cd into intel:sycl Jan 14, 2021
alexbatashev pushed a commit to alexbatashev/llvm that referenced this pull request Jan 19, 2021
* sycl: (378 commits)
  [sycl-post-link][NFC] Extracted the code into a subroutine (intel#3042)
  [SYCL][NFC] Remove commented out code (intel#3029)
  [CODEOWNERS] Fix ownership of DPC++ tools tests (intel#3047)
  [SYCL][NFC] Make tests insensitive to dso_local (intel#3037)
  [SYCL] Fix acquiring a mutex in _pi_context::finalize (intel#3001)
  [SYCL] Fix various compilation warnings in plugins (intel#2979)
  [SYCL][ESIMD] Add simd class conversion ctor and operator (intel#3028)
  [sycl-post-link][NFC] Use range-based for loop. (intel#3033)
  [SYCL][NFC] Fix warning in self-build (intel#3023)
  [NFC] Fix sycl-post-link tests to avoid potential race (intel#3031)
  [SYCL][CUDA] Add missing barrier to collectives (intel#2990)
  [SYCL] Make Intel attributes consistent with clang attributes. (intel#3022)
  [SYCL] Bump SYCL minor version (intel#3026)
  [SYCL][Doc] Added requirement on reference to test PR in commit message (intel#3010)
  [SYCL] Put constant initializer list data in non-generic addr space. (intel#3005)
  [SYCL][L0] Fix memory leak in PiDeviceCache and ZeCommandList (intel#2974)
  [SYCL] Fix detection of free function calls (intel#3003)
  [SYCL][NFC] Clean up the builder_dir argument description (intel#3021)
  [SYCL][ESIMD] Fix LowerESIMD crash on a scalar fptoui LLVM instruction (intel#2699)
  [NFC] Remove redundant call to getMainExecutable() (intel#3018)
  ...
@Pennycook Pennycook deleted the cuda-collective-barriers branch January 28, 2021 18:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuda CUDA back-end
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants