Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot setup environment on server to run code #1

Closed
ryabhmd opened this issue Aug 6, 2024 · 5 comments
Closed

Cannot setup environment on server to run code #1

ryabhmd opened this issue Aug 6, 2024 · 5 comments
Assignees

Comments

@ryabhmd
Copy link
Collaborator

ryabhmd commented Aug 6, 2024

To run scilons_pipeline.py, I've been trying to build an image on slurm and install the datatrove[all] package (as per the instructions in the README).
I've tried to re-use several images from /netscratch/enroot (e.g. python+3.10.4-bullseye.sqsh, ubuntu20+conda.sqsh) and then installing the packages but always end up with incompatibility issues in the installed packages, which results in not being able to use the image.

E.g. when I build on ubuntu20+conda.sqsh and install the datatrove library I get:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
conda-repo-cli 1.0.75 requires requests==2.31.0, but you have requests 2.32.3 which is incompatible.

However, the required version of requests in incompatible with the datasets package.
Once I save the image and use it to run the code it cannot find any of the modules.

Any ideas on how to build an image to run the code? Maybe another I need to use another image to install the package in?

@malteos
Copy link

malteos commented Aug 6, 2024

Do you need requests for anything in the pipeline? My best guess is that you can simply ignore this error message.

You can also use one of my images: /netscratch/mostendorff/enroot/malteos_eulm_podman.sqsh

It has datatrove==0.2.0 installed.

@ryabhmd
Copy link
Collaborator Author

ryabhmd commented Aug 7, 2024

Thanks! Your image works. :)
However, now when I run the pipeline and it gets to the slurm execution part to launch a job from within the script, I get the following error:

FileNotFoundError: [Errno 2] No such file or directory: 'sbatch' srun: error: serv-3317: task 0: Exited with exit code 1
I tried to look at similar issues (e.g. this one) but they didn't solve the issue.
Any ideas?

@malteos
Copy link

malteos commented Aug 7, 2024

Slurm commands are not available within a containerized compute job. See https://github.com/scilons/datatrove/blob/main/src/datatrove/executor/slurm.py#L35

You need to start the Slurm pipeline from a login node or rewrite it to use a local execution pipeline.

@lfoppiano
Copy link

The local pipeline works fine and can be ran with an interactive job. I'm wondering if we want to use the slurm, should I create the environment directly in the login machine and run it there?

@lfoppiano
Copy link

I've installed mamba and set up a virtual environment, then I ran it from there. I'm closing this, feel free to let me know if you have further questions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants