-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about max-worker-count #654
Comments
Hi, I'll answer these separately:
The number of running HyperQueue jobs is affected purely by the amount of available resources. If each job has a single task that requires 1 CPU, and you have a single worker with 128 cores, you could have 128 HQ jobs running even if you have only a single worker. So without more context, it's not enough to just look at the amount of HQ jobs.
This is what you should see. There should be at most 10 running workers.
This either means that you have multiple allocation queues, or have started workers by some other means, or there is a bug in HQ. |
In this case (on Barbora), I have only one allocation queue and one hq job takes the whole worker node (36 cores). I do not start the jobs in any other way than through that one allocation queue. I can provide debug log if that helps. |
You have sent |
OK, the debug log is here (128MB): https://www.fzu.cz/~svatosm/hq-debug-output.log |
Hmm, something quite weird is happening here. Here is the part of the log for one Slurm allocation with ID 115741:
The worker The second weird thing is that about 17 hours after the worker It almost seems like Slurm executed the allocation, then stopped it for some reason, and then it restarted it from scratch (running the same original command) under the same allocation ID. If this is indeed what has happened, then this breaks many assumptions made by the automatic allocator and is the probable cause of your issue. |
According to the slurm details of
|
Ok, thank you, that confirms my suspicion. Well, the automatic allocator has no notion of allocation restarts at the moment (I had no idea that Slurm/PBS is even allowed to do that.. they should just give the allocation a new ID, IMO), so this will need more complex design and implementation work in order to be fixed. I'll think about it. |
Hi,
I am trying to limit the rate at which out allocation is depleted. For that, I thought the max-worker-count option of allocation queue would be the way to do it. Now, watching it run, I have few questions about it.
1.) Is there a way to see that the option was propagated to the HQ? The
hq alloc list
does not show it:and
hq alloc info 5
does not show details of the allocation setting.2.) How does it actually work? According to the docs (https://it4innovations.github.io/hyperqueue/v0.16.0/deployment/allocation/#max-worker-count), it should set maximum number of workers that can be queued or running. So, I set the allocation queu like this:
With one allocation queue running 1 worker per allocation and maximum of 10 workers, I assumed that I would have maximum of 10 running workers/jobs. But looking at the number of running jobs, I see I have 24 of them:
I see the same when listing workers:
But I see the 10 in
hq alloc info 5
(dropping previous finished workers from the list to make it shorter):I am rather confused by this situation. So, can max-worker-count be used to limit number of running jobs?
The text was updated successfully, but these errors were encountered: