Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allreduce cpu example fails with CCL_WORKER_COUNT > 1 #109

Open
piotrchmiel opened this issue Jan 16, 2024 · 3 comments
Open

Allreduce cpu example fails with CCL_WORKER_COUNT > 1 #109

piotrchmiel opened this issue Jan 16, 2024 · 3 comments

Comments

@piotrchmiel
Copy link

piotrchmiel commented Jan 16, 2024

I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .

I modified it slightly by increasing the buffer size 100 times:

diff --git a/examples/cpu/cpu_allreduce_test.cpp b/examples/cpu/cpu_allreduce_test.cpp
index 6e9ac4d..5dfe2d9 100644
--- a/examples/cpu/cpu_allreduce_test.cpp
+++ b/examples/cpu/cpu_allreduce_test.cpp
@@ -22,7 +22,7 @@
 using namespace std;

 int main() {
-    const size_t count = 4096;
+    const size_t count = 4096*100;

     size_t i = 0;

When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:

piotrc@machine:~/ws/oneCCL/build$ CCL_WORKER_COUNT=2 mpirun -np 2 examples/cpu/cpu_allreduce_test
[1705415958.879795729] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
[1705415958.879801821] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 559315 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 9 (Killed)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 559316 RUNNING AT gbnwp-pod023-1
=   KILLED BY SIGNAL: 6 (Aborted)
===================================================================================

With CCL_WORKER_COUNT=1 it works perfect.

piotrc@machine:~/ws/oneCCL/build$ mpirun -np 2 examples/cpu/cpu_allreduce_test
PASSED

What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.

Attached CCL_LOG_LEVEL=info logs.txt
Attached CCL_LOG_LEVEL=debug logs_debug.txt

@piotrchmiel
Copy link
Author

Possible workaround:

FI_PROVIDER=verbs CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test
PASSED

FI_PROVIDER=tcp CCL_WORKER_COUNT=2 ../../install/bin/mpirun -np 2 ../../install/examples/cpu/cpu_allreduce_test
PASSED

@nikitaxgusev
Copy link
Contributor

nikitaxgusev commented Jan 24, 2024

@piotrchmiel Hi. Your fi_info should say that psm3 is available for you, do you see that? Please execute it and check. https://github.com/oneapi-src/oneCCL/tree/master/deps/ofi/bin
Can you please give a hint how do you compile oneccl?

@yao-matrix
Copy link

@piotrchmiel , you can try this. echo 0 > /proc/sys/kernel/yama/ptrace_scope.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants