Optimize custom all reduce #130

iotamudelta · 2024-08-12T21:38:21Z

remove volatile keywords (they have a different meaning on CUDA than HIP)
optimize and scope locking
increase cutoff for custom all reduce to 16 MB (based on perf data)
tune block number for MI300X
roughly 10% performance increase for non-latency bound sizes, cutoff vis-a-vis RCCL at 16 MB
while there, add missing MPI_Finalize() to test

While there, add missing finalize.

Increase sampling area to capture crossover.

gshtras · 2024-08-12T21:53:52Z

Looks good. Just need to fix the linters

dllehr-amd · 2024-08-14T17:06:23Z

Hey Folks..wanted to make a quick comment. This CAR will require rocm 6.2 to compile. the scoping intrinsics are introduced in LLVM in the 6.2 release. So we may see compile errors on __MEMORY_SCOPE_DEVICE etc.

wenkaidu · 2024-08-15T00:56:33Z

csrc/custom_all_reduce.cuh

  }
-  __syncthreads();
-  // use one thread to update flag
-  if (threadIdx.x == 0) self_sg->_flag[blockIdx.x] = flag;


If this line is removed, who will update self_sg->_flag?

wenkaidu · 2024-08-15T00:56:53Z

csrc/custom_all_reduce.cuh

  }
  __syncthreads();
-  // use one thread to update flag
-  if (threadIdx.x == 0) self_sg->_flag[blockIdx.x] = flag;


If this line is removed, who will update self_sg->_flag?

wenkaidu · 2024-08-15T01:00:00Z

csrc/custom_all_reduce.cuh

  if (threadIdx.x < ngpus) {
+    // reset flag for next time
+    __scoped_atomic_store_n(&self_sg->start[blockIdx.x][threadIdx.x], 0,
+                            __ATOMIC_RELAXED, __MEMORY_SCOPE_DEVICE);


Original implementation using resetting flag which is prone to race condition. Thus we have seen occasional hang during long running workload. self_sg->_flag was introduced to make the flag incrementing. I would prefer keep this new mechanism for stability.

mawong-amd · 2024-08-15T05:12:17Z

csrc/custom_all_reduce_test.cu

@@ -330,10 +330,17 @@ int main(int argc, char** argv) {
  //     run<half>(myRank, nRanks, comm, threads, block_limit, 4096 * 1024);
  //   }
  // }
+#ifdef USE _ROCM


I understand this is fixed in #137, but this should easily not pass review. Let's spend more time ensuring code quality and that tests pass before merging.

mawong-amd · 2024-08-15T05:19:39Z

Hey Folks..wanted to make a quick comment. This CAR will require rocm 6.2 to compile. the scoping intrinsics are introduced in LLVM in the 6.2 release. So we may see compile errors on __MEMORY_SCOPE_DEVICE etc.

We should not push out releases where the default settings (ROCm 6.1) do not compile. Again, I understand this is already fixed by #137 but hotfixes should be kept to a minimum. Especially when this issue is so readily detectable.

…stood Revert "Optimize custom all reduce (#130)" This reverts commit 636ff01.

@iotamudelta

* Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Make CAR ROCm 6.1 compatible. (#137)" This reverts commit 4d2dda6. * Per @iotamudelta suggestion until the deadlocks issue is better understood Revert "Optimize custom all reduce (#130)" This reverts commit 636ff01.

* First version * Revert error. While there, add missing finalize. * Use the correct defaults for ROCm. Increase sampling area to capture crossover. * Scope end_sync as well. * Guard only volatile keyword for ifndef USE_ROCM * Document crossover

iotamudelta added 4 commits August 8, 2024 22:25

First version

f0bf622

Revert error.

2881529

While there, add missing finalize.

Use the correct defaults for ROCm.

f7e671a

Increase sampling area to capture crossover.

Scope end_sync as well.

9724cf6

iotamudelta requested a review from gshtras August 12, 2024 21:38

iotamudelta and others added 3 commits August 12, 2024 16:38

Merge branch 'main' into car

fba5fb2

Guard only volatile keyword for ifndef USE_ROCM

2e4cc9d

Document crossover

214a668

iotamudelta and others added 3 commits August 13, 2024 19:08

Apply clang-format

6dcea07

Merge branch 'main' into car

541b5a5

Apply Python linter

94e2eca

gshtras approved these changes Aug 14, 2024

View reviewed changes

gshtras merged commit 636ff01 into ROCm:main Aug 14, 2024
13 checks passed

wenkaidu reviewed Aug 15, 2024

View reviewed changes

mawong-amd reviewed Aug 15, 2024

View reviewed changes

gshtras added a commit that referenced this pull request Aug 15, 2024

Per @iotamudelta suggestion until the deadlocks issue is better under…

c2a7bfa

…stood Revert "Optimize custom all reduce (#130)" This reverts commit 636ff01.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize custom all reduce #130

Optimize custom all reduce #130

iotamudelta commented Aug 12, 2024 •

edited

Loading

gshtras commented Aug 12, 2024

dllehr-amd commented Aug 14, 2024

wenkaidu Aug 15, 2024

wenkaidu Aug 15, 2024

wenkaidu Aug 15, 2024

mawong-amd Aug 15, 2024

mawong-amd commented Aug 15, 2024

Optimize custom all reduce #130

Optimize custom all reduce #130

Conversation

iotamudelta commented Aug 12, 2024 • edited Loading

gshtras commented Aug 12, 2024

dllehr-amd commented Aug 14, 2024

wenkaidu Aug 15, 2024

Choose a reason for hiding this comment

wenkaidu Aug 15, 2024

Choose a reason for hiding this comment

wenkaidu Aug 15, 2024

Choose a reason for hiding this comment

mawong-amd Aug 15, 2024

Choose a reason for hiding this comment

mawong-amd commented Aug 15, 2024

iotamudelta commented Aug 12, 2024 •

edited

Loading