thread-scoped barriers do not work as expected #2825

admbbs · 2024-11-15T06:00:15Z

admbbs
Nov 15, 2024

Hi, I am trying to understand the concept of a thread-scoped barrier and here is some code I write:

#include <stdio.h>
#include <cuda/barrier>

__global__ void k() {
  using barrier = cuda::barrier<cuda::thread_scope_thread>;

  barrier bar;
  init(&bar, 1);

  printf("[%u] phase 0 data 0x%lx\n", threadIdx.x, *reinterpret_cast<uint64_t *>(&bar));
  bar.arrive_and_wait();
  printf("[%u] phase 1 data 0x%lx\n", threadIdx.x, *reinterpret_cast<uint64_t *>(&bar));
  bar.arrive_and_wait();
  printf("[%u] phase 2 data 0x%lx\n", threadIdx.x, *reinterpret_cast<uint64_t *>(&bar));
}

int main(int argc, char **argv){
  k<<<1, 1>>>();
  cudaDeviceSynchronize();
  return 0;
}

The code above is compiled with command nvcc s.cu -arch sm_80.

I thought thread-scoped means that each thread gets its own copy and they do not interfere each other, so the code above should
run successfully and output something like (The output maybe different in order since the 2 threads run concurrently.)

[0] phase 0 data 0x7fffffff7fffffff
[1] phase 0 data 0x7fffffff7fffffff
[0] phase 1 data 0xffffffff7fffffff
[1] phase 1 data 0xffffffff7fffffff
[0] phase 2 data 0x7fffffff7fffffff
[1] phase 2 data 0x7fffffff7fffffff

But what I get is output like this and the program just hungs in there. It seems that the two thread-scoped barriers from different threads interfers with each other.

[0] phase 0 data 0x7fffffff7fffffff
[1] phase 0 data 0x7fffffff7fffffff
[0] phase 1 data 0xfffffffe7fffffff

Did I get the concept of thread-scoped corectly or I miss somethine?

My environment follows if they concern:

nvcc --version shows

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Wed_Apr_17_19:19:55_PDT_2024
Cuda compilation tools, release 12.5, V12.5.40
Build cuda_12.5.r12.5/compiler.34177558_0

nvidia-smi shows

A100-PCIE-40GB

Answered by griwes

Nov 15, 2024

This is probably caused by #2585. Can you test with #2586 to see if that resolves the problem?

In all fairness we should probably redefine the thread-scope barrier to not use atomics in the first place, at least in non-shared memory, but until then, this smells like the above issue to me.

View full answer

griwes · 2024-11-15T13:41:38Z

griwes
Nov 15, 2024
Collaborator

This is probably caused by #2585. Can you test with #2586 to see if that resolves the problem?

In all fairness we should probably redefine the thread-scope barrier to not use atomics in the first place, at least in non-shared memory, but until then, this smells like the above issue to me.

2 replies

admbbs Nov 16, 2024
Author

Thanks to your reply! I believe it to be true. Trying to narrow it down to and understand the real cause, I wrote another code snippet as follows:

CUDA C++

__global__ void kernel() {
    cuda::atomic<int, cuda::thread_scope_block> x;
    x.fetch_add(1, cuda::memory_order_seq_cst);
}

In the PTX code, I saw something like

cvta.local.u64  %SP, %SPL;

and in the SASS code, something like

IMAD.MOV.U32 R1, RZ, RZ, c[0x0][0x28] 
 S2R R3, SR_LANEID

 VOTEU.ANY UR6, UPT, PT 
 FLO.U32 R4, UR6 

 ISETP.EQ.U32.AND P0, PT, R4, R3, PT 

 @P0 ATOM.E.ADD.STRONG.SM PT, R3, [R2.64], R5

I GUESS that, since the address of the atomic is cvta-ed to a generic address, the PTX compiler feels safe to combine all the fetch_ands from different threads into one leader selected by voting, hence the problem.

But what confuses me is that, clearly different threads should have different %SPLs, so why does the PTX compiler do combinations like this?

you may refer to godbolt link for the full code.

admbbs Nov 16, 2024
Author

Can you test with #2586 to see if that resolves the problem?

I will test this as soon as I get access to my CUDA machine.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

thread-scoped barriers do not work as expected #2825

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

thread-scoped barriers do not work as expected #2825

admbbs Nov 15, 2024

Replies: 1 comment · 2 replies

griwes Nov 15, 2024 Collaborator

admbbs Nov 16, 2024 Author

admbbs Nov 16, 2024 Author

admbbs
Nov 15, 2024

Replies: 1 comment 2 replies

griwes
Nov 15, 2024
Collaborator

admbbs Nov 16, 2024
Author

admbbs Nov 16, 2024
Author