-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
srun --cpus-per-task=1 causing job to be run twice. #6482
Comments
I've done more studying/learning on this matter which gets me closer to resolution and most things are making sense, except for the simple hostname command I talk about below.. I've been able to reproduce my problem of multiple jobs starting on a multi-core machine when I specify --cpus-per-task=1 with a simple usage of xterm. srun --time=11:00:00 --job-name=srun_test --cpus-per-task=1 --mem=0 --partition=sp-32-gb --exclusive --pty --x11 xterm The partition sp-32-gb can choose from three CRs: sp-32-gb-dy-sp-32gb-4-cores-[2-10],sp-32-gb-dy-sp-32gb-8-cores-[1-10],sp-32-gb-dy-sp-32gb-16-cores-[1-10] 4, 8, or 16 cores is available. If I change the command on the above srun from 'xterm' to be 'hostname', I was expecting four instances of the hostname to be spit out. Rather only one instance comes out: $ srun --job-name=srun_test --cpus-per-task=1 --mem=0 --partition=sp-32-gb --exclusive --pty hostname I don't understand why hostname does not spit out four times. ChatGPT tells me that since hostname is a simple command srun is being smart and squashing additional output. Odd. I went back to my user's command and I see it actually did start up four identical jobs, one on each cpu. So that is consistent. Great. That said, I made an assumption that --ntasks=1 is a default value. Clearly the slurm srun documentation proves me wrong: My user is now happy running his makefile with: |
I'm comfortable with the srun behavior now and understand that one really needs to use --ntasks=1 if you only want the job to run once if you don't use all the cpus on a node. |
parallelcluster 3.9.1 and 3.11.0.
My examples below are from my 3.11.0 test cluster.
My cluster has multiple instance types for a partition called od-32-gb:
$ sinfo | grep "^od-32-gb"
od-32-gb up infinite 40 idle~ od-c7a-4xl-dy-od-c7a-4xl-[1-10],od-c7i-4xl-dy-od-c7i-4xl-[1-10],od-m7a-2xl-dy-od-m7a-2xl-[1-10],od-r7a-xl-dy-od-r7a-xl-[1-10]
You can see the instance types in the compute resource names. All the compute resources have multi-threading disabled in the configuration file and verified with scontrol show node (all nodes show this same output for ThreadsPerCore):
State=IDLE+CLOUD+POWERED_DOWN ThreadsPerCore=1 TmpDisk=0 Weight=4105 Owner=N/A MCS_label=N/A
In fact, all the ?7a* nodes have hyperthreading disabled by AWS, so I'm suspicious of the c7i instance.
When a user starts a job with srun --cpus-per-task=1, sometimes slurm is starting two jobs.
I cannot reproduce this with a simple command such as hostname or echo hello.
If I replace --cpus-per-task=1 with --ntasks=1, srun does the right thing, i.e. only starts up one job.
It's not unlike what is seen here: https://groups.google.com/g/slurm-users/c/L4nCXtZLlTo
except that this post is related to hyperthreading enabled which I don't have. I have also logged onto the nodes where the jobs are running in duplicate and I see from lscpu the hyperthreading is disabled.
I am using --cpus-per-task to ensure I get a machine with the right core/cpu count when I use my partitions where I'm selecting instance types based on their memory config. I don't want to use --mem=XXX as I use --exclusive and I want the entire machine for the user and I don't want them to have to know how much memory really is available on a 32gb instance type.
Before I dig further in narrowing this problem down, I'm posting here in case someone has some insights or guidance on this issue.
describe-cluster output:
$ pcluster describe-cluster -n tsi4
{
"creationTime": "2024-10-17T21:03:33.056Z",
"headNode": {
"launchTime": "2024-10-17T21:08:39.000Z",
"instanceId": "i-0fdbb2d8be1e83a9d",
"instanceType": "m7a.medium",
"state": "running",
"privateIpAddress": "10.6.3.120"
},
"version": "3.11.0",
"clusterConfiguration": {
"url": "https://parallelcluster-93a06c12efe5c398-v1-do-not-delete.s3.us-west-2.amazonaws.com/parallelcluster/3.11.0/clusters/tsi4-mfnuhjkosy8ub3ol/configs/cluster-config.yaml?versionId=flMbPNVjuSxFoK5Yh6Jo3y_VBSqq0Gyi&AWSAccessKeyId=AKIAUYAYZG3JPXZ2AMIC&Signature=8AHJdhyk7K2cob3jUrbTRlQSTio%3D&Expires=1729362107"
},
"tags": [
{
"value": "3.11.0",
"key": "parallelcluster:version"
},
{
"value": "tsi4",
"key": "parallelcluster:cluster-name"
},
{
"value": "true",
"key": "parallelcluster-ui"
}
],
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "tsi4",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-west-2:326469498578:stack/tsi4/44996cd0-8ccb-11ef-ac1e-069a9ba2d89d",
"lastUpdatedTime": "2024-10-17T21:03:33.056Z",
"region": "us-west-2",
"clusterStatus": "CREATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}
The text was updated successfully, but these errors were encountered: