You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I modified it slightly by increasing the buffer size 100 times:
diff --git a/examples/cpu/cpu_allreduce_test.cpp b/examples/cpu/cpu_allreduce_test.cpp
index 6e9ac4d..5dfe2d9 100644
--- a/examples/cpu/cpu_allreduce_test.cpp
+++ b/examples/cpu/cpu_allreduce_test.cpp
@@ -22,7 +22,7 @@
using namespace std;
int main() {
- const size_t count = 4096;
+ const size_t count = 4096*100;
size_t i = 0;
When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:
piotrc@machine:~/ws/oneCCL/build$ CCL_WORKER_COUNT=2 mpirun -np 2 examples/cpu/cpu_allreduce_test
[1705415958.879795729] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
[1705415958.879801821] machine:rank1.cpu_allreduce_test: Reading from remote process' memory failed. Disabling CMA support
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
machine:rank1: Assertion failure at psm3/ptl_am/ptl.c:196: nbytes == req->req_data.recv_msglen
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 559315 RUNNING AT gbnwp-pod023-1
= KILLED BY SIGNAL: 9 (Killed)
===================================================================================
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 559316 RUNNING AT gbnwp-pod023-1
= KILLED BY SIGNAL: 6 (Aborted)
===================================================================================
What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.
I started playing with allreduce example from the main repository https://github.com/oneapi-src/oneCCL/blob/master/examples/cpu/cpu_allreduce_test.cpp .
I modified it slightly by increasing the buffer size 100 times:
When I run it with the CCL_WORKER_COUNT environment variable with a value > 1 it fails with the following errors:
With CCL_WORKER_COUNT=1 it works perfect.
What am I doing wrong ? Why it fails ? Should I use specific flags when compiling or set some specific environment variable or pass a specific option to mpirun ? It is worth mention that with smaller buffer size (for example 4096 * 10) everything works fine even with CCL_WORKER_COUNT set with value > 1.
Attached CCL_LOG_LEVEL=info logs.txt
Attached CCL_LOG_LEVEL=debug logs_debug.txt
The text was updated successfully, but these errors were encountered: