SLURM script: Let MPIPoolExecutor manage processes · quaquel/EMAworkbench@1a94011

Commit

SLURM script: Let MPIPoolExecutor manage processes

The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood.

When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error.

Here's how to address this:

1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`.

2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes.

Adjust your SLURM script:
```bash
#!/bin/bash
# Other directives...

# You don't need mpirun here
python my_model.py > py_test.log
```

3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you.

4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`.

Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.

Loading branch information

EwoutH committed Aug 29, 2023

1 parent 63ce62c commit 1a94011

scripts/test_script.sh

-Original file line number
+Diff line change
@@ Expand Up / @@ -14,4 +14,4 @@ module load python @@
     module load py-numpy
     module load py-mpi4py
-    mpirun python my_model.py > py_test.log
+    python my_model.py > py_test.log

0 comments on commit `1a94011`

Please sign in to comment.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit

There are no files selected for viewing

0 comments on commit `1a94011`

Commit

There are no files selected for viewing

0 comments on commit 1a94011

0 comments on commit `1a94011`