Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

solved: how to launch a slurm executor from an interactive slurm job #248

Open
stas00 opened this issue Jul 12, 2024 · 0 comments
Open

solved: how to launch a slurm executor from an interactive slurm job #248

stas00 opened this issue Jul 12, 2024 · 0 comments

Comments

@stas00
Copy link

stas00 commented Jul 12, 2024

I forget where I saw it in the docs/code where it said not to launch a slurm executor from an srun interactive session - which is not quite always possible.

There is a simple workaround - unset SLURM_* env vars and then launch and it works just fine.

unset $(printenv | grep SLURM | sed -E 's/(.*)=.*/\1/' | xargs)
./my_datatrove_slurm.py

Of course, your srun session will now be w/o its env vars - which you may or may not care for.

To help others to find the solution, the error is likely to be:

srun: error: CPU binding outside of job step allocation, allocated CPUs are: 0x0000000000000FFF80000000000000000000000FFF8000000000.
srun: error: Task launch for StepId=120986.0 failed on node xxx-yyy-11: Unable to satisfy cpu bind request
srun: error: Application launch failed: Unable to satisfy cpu bind request
srun: Job step aborted

There is also this discussion that proposes to unset just SLURM_CPU_BIND_* env vars, so you'd then:

unset $(printenv | grep SLURM_CPU_BIND | sed -E 's/(.*)=.*/\1/' | xargs)
./my_datatrove_slurm.py

If you want to unset them just for the datatrove launcher use this one-liner syntax

SLURM_CPU_BIND= SLURM_CPU_BIND_VERBOSE= SLURM_CPU_BIND_LIST= SLURM_CPU_BIND_TYPE= ./my_datatrove_slurm.py

or you could of course unset them inside your script as well, which would make the launching even simpler.

That way all SLURM_* env vars will remain intact in your shell environment if you need them for something else.

edit:

I added:

        import os
        # datatrove fails to start slurm jobs from an interactive slurm job,
        # so hack to pretend we aren't inside an interactive slurm job by removing SLURM env vars
        for key in os.environ.keys():
            if key.startswith("SLURM_"):
                os.environ.pop(key)

on top of my script to make it always work.

@stas00 stas00 changed the title solved: how to launch a slurm executor from an already running slurm job solved: how to launch a slurm executor from an interactive slurm job Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant