Skip to content

Commit

Permalink
SLURM script: Let MPIPoolExecutor manage processes
Browse files Browse the repository at this point in the history
The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood.

When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error.

Here's how to address this:

1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`.

2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes.

   Adjust your SLURM script:
   ```bash
   #!/bin/bash
   # Other directives...

   # You don't need mpirun here
   python my_model.py > py_test.log
   ```

3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you.

4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`.

Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
  • Loading branch information
EwoutH committed Aug 29, 2023
1 parent 63ce62c commit 1a94011
Showing 1 changed file with 1 addition and 1 deletion.
2 changes: 1 addition & 1 deletion scripts/test_script.sh
Original file line number Diff line number Diff line change
Expand Up @@ -14,4 +14,4 @@ module load python
module load py-numpy
module load py-mpi4py

mpirun python my_model.py > py_test.log
python my_model.py > py_test.log

0 comments on commit 1a94011

Please sign in to comment.