Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
SLURM script: Let MPIPoolExecutor manage processes
The error you're encountering is related to how mpi4py's `MPIPoolExecutor` works under the hood. When you're launching your script with `mpirun`, it's spawning multiple MPI processes. If you use `MPIPoolExecutor`, it tries to spawn additional processes (or threads, in this case) for each MPI process. This is why you see a conflict: you're essentially trying to spawn processes on cores that are already allocated, leading to the "All nodes which are allocated for this job are already filled" error. Here's how to address this: 1. **Avoid Nested Parallelism**: Don't combine `mpirun` with `MPIPoolExecutor`. Either use the typical MPI approach (using send/receive) or use the `MPIPoolExecutor`. 2. **Using `MPIPoolExecutor` without `mpirun`**: The way the `MPIPoolExecutor` works is that you run your Python script normally (i.e., without `mpirun`), and the `MPIPoolExecutor` will manage the creation and distribution of tasks across the MPI processes. Adjust your SLURM script: ```bash #!/bin/bash # Other directives... # You don't need mpirun here python my_model.py > py_test.log ``` 3. **Adjust the Code**: In your code, since you're launching without `mpirun`, you don't have to worry about the world size or checking against the number of jobs. The `MPIPoolExecutor` will automatically manage the tasks for you. 4. **Optionally, use `MPI.COMM_WORLD.Spawn`**: If you want more control, you can consider using `MPI.COMM_WORLD.Spawn()` to launch the worker processes instead of using the `MPIPoolExecutor`. Lastly, always be cautious when working on an HPC environment. Nested parallelism can exhaust resources and potentially harm other users' jobs. Always test on a smaller subset of cores/nodes and make sure to monitor your jobs to ensure they're behaving as expected.
- Loading branch information