Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about max-worker-count #654

Open
svatosFZU opened this issue Jan 5, 2024 · 7 comments
Open

Question about max-worker-count #654

svatosFZU opened this issue Jan 5, 2024 · 7 comments

Comments

@svatosFZU
Copy link

Hi,
I am trying to limit the rate at which out allocation is depleted. For that, I thought the max-worker-count option of allocation queue would be the way to do it. Now, watching it run, I have few questions about it.
1.) Is there a way to see that the option was propagated to the HQ? The hq alloc list does not show it:

[[email protected] ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq alloc list
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+
| ID | State   | Backlog size | Workers per alloc | Timelimit | Manager | Name | Args                    |
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+
| 5  | RUNNING | 1            | 1                 | 2days     | SLURM   |      | -AOPEN-29-6,-pqcpu,-c36 |
+----+---------+--------------+-------------------+-----------+---------+------+-------------------------+

and hq alloc info 5 does not show details of the allocation setting.

2.) How does it actually work? According to the docs (https://it4innovations.github.io/hyperqueue/v0.16.0/deployment/allocation/#max-worker-count), it should set maximum number of workers that can be queued or running. So, I set the allocation queu like this:

/home/svatosm/hq-v0.16.0-linux-x64/hq alloc add slurm --max-worker-count 10 --backlog 1 --time-limit 48h --worker-time-limit 47h59m --worker-start-cmd "/scratch/project/open-29-6/session/ACmount.sh" --worker-stop-cmd "/scratch/project/open-29-6/session/ACumount.sh" -- -AOPEN-29-6 -pqcpu -c36

With one allocation queue running 1 worker per allocation and maximum of 10 workers, I assumed that I would have maximum of 10 running workers/jobs. But looking at the number of running jobs, I see I have 24 of them:

[[email protected] ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq job list
+------+------+---------+-------+
| ID   | Name | State   | Tasks |
+------+------+---------+-------+
| 3444 | bash | RUNNING | 1     |
| 3445 | bash | RUNNING | 1     |
| 3446 | bash | RUNNING | 1     |
| 3448 | bash | RUNNING | 1     |
| 3449 | bash | RUNNING | 1     |
| 3450 | bash | RUNNING | 1     |
| 3451 | bash | RUNNING | 1     |
| 3452 | bash | RUNNING | 1     |
| 3453 | bash | WAITING | 1     |
| 3454 | bash | RUNNING | 1     |
| 3455 | bash | RUNNING | 1     |
| 3456 | bash | RUNNING | 1     |
| 3457 | bash | RUNNING | 1     |
| 3458 | bash | RUNNING | 1     |
| 3459 | bash | RUNNING | 1     |
| 3461 | bash | WAITING | 1     |
| 3462 | bash | RUNNING | 1     |
| 3463 | bash | RUNNING | 1     |
| 3464 | bash | RUNNING | 1     |
| 3465 | bash | RUNNING | 1     |
| 3466 | bash | WAITING | 1     |
| 3467 | bash | WAITING | 1     |
| 3468 | bash | RUNNING | 1     |
| 3469 | bash | RUNNING | 1     |
| 3470 | bash | WAITING | 1     |
| 3472 | bash | WAITING | 1     |
| 3476 | bash | WAITING | 1     |
| 3478 | bash | WAITING | 1     |
| 3479 | bash | RUNNING | 1     |
| 3480 | bash | WAITING | 1     |
| 3481 | bash | RUNNING | 1     |
| 3482 | bash | WAITING | 1     |
| 3483 | bash | RUNNING | 1     |
| 3484 | bash | RUNNING | 1     |
| 3485 | bash | WAITING | 1     |
| 3486 | bash | WAITING | 1     |
| 3487 | bash | WAITING | 1     |
| 3488 | bash | WAITING | 1     |
| 3489 | bash | WAITING | 1     |
| 3490 | bash | WAITING | 1     |
| 3491 | bash | WAITING | 1     |
| 3492 | bash | WAITING | 1     |
| 3493 | bash | WAITING | 1     |
| 3494 | bash | WAITING | 1     |
| 3495 | bash | WAITING | 1     |
| 3496 | bash | WAITING | 1     |
| 3497 | bash | WAITING | 1     |
+------+------+---------+-------+
There are 3497 jobs in total. Use `--all` to display all jobs.

I see the same when listing workers:

[[email protected] ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq worker list
+------+---------+-----------------------+---------------------------+---------+----------------+
| ID   | State   | Hostname              | Resources                 | Manager | Manager Job ID |
+------+---------+-----------------------+---------------------------+---------+----------------+
| 1530 | RUNNING | cn34.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115785         |
| 1531 | RUNNING | cn16.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115209         |
| 1532 | RUNNING | cn73.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115378         |
| 1534 | RUNNING | cn71.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115741         |
| 1535 | RUNNING | cn88.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115316         |
| 1536 | RUNNING | cn84.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115331         |
| 1537 | RUNNING | cn42.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115272         |
| 1538 | RUNNING | cn44.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115374         |
| 1539 | RUNNING | cn54.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115649         |
| 1541 | RUNNING | cn99.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 115302         |
| 1542 | RUNNING | cn181.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115397         |
| 1543 | RUNNING | cn168.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115532         |
| 1544 | RUNNING | cn191.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115748         |
| 1545 | RUNNING | cn119.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 115422         |
| 1546 | RUNNING | cn111.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116091         |
| 1547 | RUNNING | cn124.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116112         |
| 1548 | RUNNING | cn129.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116113         |
| 1549 | RUNNING | cn187.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116218         |
| 1550 | RUNNING | cn55.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 116233         |
| 1551 | RUNNING | cn22.barbora.it4i.cz  | cpus 2x18; mem 187.36 GiB | SLURM   | 116234         |
| 1552 | RUNNING | cn114.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116235         |
| 1553 | RUNNING | cn8.barbora.it4i.cz   | cpus 2x18; mem 187.36 GiB | SLURM   | 116236         |
| 1554 | RUNNING | cn188.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116246         |
| 1555 | RUNNING | cn186.barbora.it4i.cz | cpus 2x18; mem 187.36 GiB | SLURM   | 116249         |
+------+---------+-----------------------+---------------------------+---------+----------------+

But I see the 10 in hq alloc info 5 (dropping previous finished workers from the list to make it shorter):

[[email protected] ~]$ /home/svatosm/hq-v0.16.0-linux-x64/hq alloc info 5
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+
| ID     | State    | Working directory                            | Worker count | Queue time          | Start time          | Finish time         |
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+
| 116091 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/161 | 1            | 04.01.2024 11:52:34 | 04.01.2024 13:20:58 |                     |
| 116112 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/162 | 1            | 04.01.2024 13:20:58 | 04.01.2024 13:21:15 |                     |
| 116113 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/163 | 1            | 04.01.2024 13:21:15 | 04.01.2024 13:21:31 |                     |
| 116114 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/164 | 1            | 04.01.2024 13:21:31 | 04.01.2024 17:53:43 | 04.01.2024 17:53:44 |
| 116218 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/165 | 1            | 04.01.2024 18:01:05 | 04.01.2024 19:11:05 |                     |
| 116233 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/166 | 1            | 04.01.2024 19:11:05 | 04.01.2024 19:14:06 |                     |
| 116234 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/167 | 1            | 04.01.2024 19:14:06 | 04.01.2024 19:14:19 |                     |
| 116235 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/168 | 1            | 04.01.2024 19:19:19 | 04.01.2024 19:19:39 |                     |
| 116236 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/169 | 1            | 04.01.2024 19:24:39 | 04.01.2024 19:25:00 |                     |
| 116237 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/170 | 1            | 04.01.2024 19:30:00 | 04.01.2024 19:30:00 | 04.01.2024 19:30:01 |
| 116241 | FAILED   | /home/svatosm/.hq-server/003/autoalloc/5/171 | 1            | 04.01.2024 19:45:00 | 04.01.2024 20:02:26 | 04.01.2024 20:02:27 |
| 116246 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/172 | 1            | 04.01.2024 20:30:00 | 04.01.2024 20:30:32 |                     |
| 116249 | RUNNING  | /home/svatosm/.hq-server/003/autoalloc/5/173 | 1            | 04.01.2024 21:00:32 | 04.01.2024 21:00:46 |                     |
+--------+----------+----------------------------------------------+--------------+---------------------+---------------------+---------------------+

I am rather confused by this situation. So, can max-worker-count be used to limit number of running jobs?

@Kobzol
Copy link
Collaborator

Kobzol commented Jan 5, 2024

Hi, I'll answer these separately:

  1. This option wasn't shown in the alloc queue table, but it will be once this PR is merged.

  2. In general, your understanding of the option is correct. If you set max worker count to N, the given allocation queue shall not ever allow more than N workers to run in parallel. However, note that this parameter only applies to a single allocation queue. If you have multiple allocation queues, you can have more than N workers (you can also of course start additional workers manually).

But looking at the number of running jobs, I see I have 24 of them:

The number of running HyperQueue jobs is affected purely by the amount of available resources. If each job has a single task that requires 1 CPU, and you have a single worker with 128 cores, you could have 128 HQ jobs running even if you have only a single worker. So without more context, it's not enough to just look at the amount of HQ jobs.

But I see the 10 in hq alloc info 5 (dropping previous finished workers from the list to make it shorter):

This is what you should see. There should be at most 10 running workers.

I see the same when listing workers:

This either means that you have multiple allocation queues, or have started workers by some other means, or there is a bug in HQ.

@svatosFZU
Copy link
Author

In this case (on Barbora), I have only one allocation queue and one hq job takes the whole worker node (36 cores). I do not start the jobs in any other way than through that one allocation queue. I can provide debug log if that helps.

@Kobzol
Copy link
Collaborator

Kobzol commented Jan 5, 2024

You have sent hq alloc info 5, so the previous 4 allocation queues are paused or removed, right? Yeah, debug log and a sequence of commands that you have used since the start of the HQ server (since it seems to have some non-trivial state, e.g. the 5 allocation queues) would be helpful.

@svatosFZU
Copy link
Author

OK, the debug log is here (128MB): https://www.fzu.cz/~svatosm/hq-debug-output.log
The previous allocation queues were started with different parameters and are all removed (for some time).
For the sequence of commands, I would need more details on what exactly is needed. As I am around 3k jobs and there are multiple checks of job statuses, there were probably tens of thousands of HQ commands performed since this server was started.

@Kobzol
Copy link
Collaborator

Kobzol commented Jan 5, 2024

Hmm, something quite weird is happening here. Here is the part of the log for one Slurm allocation with ID 115741:

[2024-01-03T12:20:00.433Z DEBUG hyperqueue::server::autoalloc::process] Worker 1522 connected to allocation 115741
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerLost(WorkerId(1522), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172785s) }, LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s })
[2024-01-03T17:40:40.434Z DEBUG hyperqueue::server::autoalloc::process] Removing worker 1522 from allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Handling message WorkerConnected(WorkerId(1534), ManagerInfo { manager: Slurm, allocation_id: "115741", time_limit: Some(172779s) })
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Registering worker 1534 for allocation 115741
[2024-01-04T10:52:41.832Z DEBUG hyperqueue::server::autoalloc::process] Worker 1534 connected to allocation 115741
[2024-01-04T10:52:41.832Z WARN  hyperqueue::server::autoalloc::process] Allocation 115741 has status Finished { started_at: SystemTime { tv_sec: 1704284400, tv_nsec: 433489969 }, finished_at: SystemTime { tv_sec: 1704303640, tv_nsec: 434765520 }, disconnected_workers: DisconnectedWorkers { workers: Map({WorkerId(1522): LostWorkerDetails { reason: HeartbeatLost, lifetime: 19240.00128616s }}) } } and does not expect new workers

The worker 1522 has connected from this allocation on Wednesday 12:20. Then, it later disconnected at 17:40 because of a lost heartbeat (either the worker has crashed or the network connection between the worker and the server was disrupted). Around 17:00, Slurm started claiming that this allocation was moved from running state to either PENDING (PD) or CONFIGURING, which HQ interprets as being queued. So the first weird thing is why did a running allocation switch back to the pending state, which apparently confused HQ.

The second weird thing is that about 17 hours after the worker 1522 has crashed, a different worker 1534 has connected to HQ, claiming to be from the same allocation! But that is very weird, because the autoalloc start script will never start another worker inside the same allocation (unless your scripts are starting a worker, which I doubt). After this has happened, HQ was quite confused, and thought that the allocation was finished, and therefore it started a new allocation, although the 1534 worker is apparently still running (and that's why you see more than 10 workers being online).

It almost seems like Slurm executed the allocation, then stopped it for some reason, and then it restarted it from scratch (running the same original command) under the same allocation ID. If this is indeed what has happened, then this breaks many assumptions made by the automatic allocator and is the probable cause of your issue.

@svatosFZU
Copy link
Author

According to the slurm details of 115741, it was restarted.

[[email protected] ~]$ scontrol show job 115741
JobId=115741 JobName=hq-alloc-5
   UserId=svatosm(4517) GroupId=svatosm(4647) MCS_label=N/A
   Priority=200268119 Nice=0 Account=open-29-6 QOS=2478_3949
   JobState=COMPLETED Reason=None Dependency=(null)
   Requeue=1 Restarts=1 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=1-02:46:50 TimeLimit=2-00:00:00 TimeMin=N/A
   SubmitTime=2024-01-03T17:04:45 EligibleTime=2024-01-03T17:06:46
   AccrueTime=2024-01-03T17:06:46
   StartTime=2024-01-04T11:52:20 EndTime=2024-01-05T14:39:10 Deadline=N/A
   PreemptEligibleTime=2024-01-04T11:52:20 PreemptTime=None
   SuspendTime=None SecsPreSuspend=0 LastSchedEval=2024-01-04T11:52:20 Scheduler=Main
   Partition=qcpu AllocNode:Sid=login1:4137603
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=cn71
   BatchHost=cn71
   NumNodes=1 NumCPUs=36 NumTasks=1 CPUs/Task=36 ReqB:S:C:T=0:0:*:*
   ReqTRES=cpu=36,mem=190000M,node=1,billing=1
   AllocTRES=cpu=36,node=1,billing=1
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=36 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/svatosm/.hq-server/003/autoalloc/5/154/hq-submit.sh
   WorkDir=/home/svatosm/.hq-server/003/autoalloc/5/154
   StdErr=/home/svatosm/.hq-server/003/autoalloc/5/154/stderr
   StdIn=/dev/null
   StdOut=/home/svatosm/.hq-server/003/autoalloc/5/154/stdout
   Power=

@Kobzol
Copy link
Collaborator

Kobzol commented Jan 5, 2024

Ok, thank you, that confirms my suspicion. Well, the automatic allocator has no notion of allocation restarts at the moment (I had no idea that Slurm/PBS is even allowed to do that.. they should just give the allocation a new ID, IMO), so this will need more complex design and implementation work in order to be fixed. I'll think about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants