Make `safe_interval` more dynamic for quick transport tasks #6544

GeigerJ2 · 2024-07-22T14:57:52Z

As realized together with @giovannipizzi while debugging things for our new cluster at PSI: When submitting a simple calculation (execution takes about 10s) for testing purposes, with the default safe_interval=30 in the Computer configuration, one has to wait an additional 90s until the job is done (30s for the upload, submit, and retrieve tasks, each). This is to be expected, of course, and one could just reduce the safe_interval (albeit increasing the risk of SSH overloads).

However, the upload task in that case is truly the first Transport task that is being executed by the daemon worker, so it could, in principle, enter immediately (the same if jobs were run previously, but longer ago than the safe_interval). I locally implemented a first version (thanks to @giovannipizzi's input) that does this, by adding a last_close_time attribute (currently added to the authinfo metadata for a first PoC). In the request_transport method of the TransportQueue, the time difference between the current time and the last_close_time is then checked, and if it is larger than safe_interval, the Transport is opened immediately via:

open_callback_handle = self._loop.call_later(0, do_open, context=contextvars.Context())  # or use 1 for safety?

bypassing the safe_interval (or safe_open_interval as it is called in transports.py).

In addition, the waiting times for the submit and retrieve tasks could also be reduced. It seems like currently, the safe_interval is imposed on all of them, even if they finish very quickly (I assume as all open a transport connection via SSH). So we were thinking if it's possible to make this a bit more sophisticated, e.g. by adding special transport requests, that could make use of the open transport, and keep a transport of which the task has finished open for a short time longer (also quickly discussed with @mbercx). Of course, one would still need to make sure SSH doesn't get overloaded, the implementation works with heavy loads (not just individual testing calculations), and one would also have to consider how this all works with multiple daemon workers. Again with @giovannipizzi, I had a quick look, but it seems like the implementation would be a bit more involved. So wondering what the others think, if this is feasible and worth investigating more time into. Pinging @khsrali who has looked a bit more into transports.

The text was updated successfully, but these errors were encountered:

giovannipizzi · 2024-07-22T15:31:14Z

Thanks for the nice write-up @GeigerJ2 ! Just some minor additional comments/clarifications

actually even for a millisecond run, the time to wait is 120s (or generally 4 times the safe interval), rather than 3x (90s):
- [30s] initial wait to upload
- [30s] time to wait to submit
- [30s] time for the first check of the queue (that, for jobs do not get queued, running immediately and run for < 30s, would already return that the job finished)
- [30s] to retrieve the calculations and set the calculation as finished
(and I guess one adds another 30s if there is also stashing involved)
The initial implementation keeps the time in the metadata of the authinfo, but already while discussing with Julian, we realized it's better not to put it there, as this is shared by all daemon workers, and could lead to wrong results, collisions and exceptions when multiple write the same DB row, etc. - better to just keep in another local attribute self.last_close_time, parallel to self._transport_requests. On the other hand, I just realize that if you are running from a local interpreter, and maybe submitting run() from a bash "for" loop (e.g. of verdi run commands), this might bypass the limit as all of them will think that nothing was submitted before. But probably this is OK with the current implementation? Fixing it properly would require making the whole concept of a safe_interval not specific to a worker, but global to a AiiDA profile.
In the implementation discussed above, in addition to setting the first parameter of call_later to zero if more than safe_interval seconds passed from the last call, I would also set the waiting time to the difference current_time - last_close_time, so e.g. you only wait 10 seconds if you closed the transport 20 seconds ago.
The points above solve the waiting of the first 30 seconds. For the other 3x30 seconds, the idea is that probably in this case the connection was just closed less than a second before, i.e. the time for AiiDA to change state. If we could keep the connection open for a configurable time after the last command (say with a default of 5 or 10 seconds), a full single submission could go down to probably just < 5 seconds, rather than 120s as now. However, I am not 100% sure of how to do this properly. My idea was that, when we get here:

aiida-core/src/aiida/engine/transports.py

Lines 119 to 120 in d73731f

if transport_request.count == 0:

if transport_request.future.done():

we actually inject another task of 5-10 seconds, and somehow mark that the last task is of this special type "last_wait=True". As soon as any task is executed, this instead sets "last_wait=False". When the special last_wait task finishes (5-10 seconds later), does nothing if in the meantime "last_wait" was set to False, otherwise performs the closing operations transport_request.future.result().close().
However, while thinking at how to implement it, I wasn't sure how to do it correctly, as we want to return from the request_transport immediately, and not wait for these 5-10 seconds, otherwise the whole task is blocked and we didn't solve the problem we just made the process wait even 5-10 seconds more :-) Suggestions welcome (pinging also @sphuber)

khsrali · 2024-10-11T14:25:28Z

So regarding this, I did a test to see how critically this issue impacts our performance.

I configured a computer with core.ssh as a transport plugin and core.direct as scheduler (ssh to my own machine). I then submitted 225 simple jobs. Each job does an arithmetic add and make 4 files, each 1KB in size. Finally, recorded the frequency of calls to a number of important functions and methods in core.ssh and BashCliScheduler.

First, safe_interval is set to 30 seconds. In this case, it takes 7 minutes to execute everything. During this time, transport.open() was called 8 times.

now, I set safe_interval to 0.1 seconds. Now, it takes 4 minutes and 28 seconds. During this time, transport.open() was called 7 times.

Since transport.open() was called only a few times, I think we may not need to further change the design, although there might be some improvement, but the benefit relative to the effort seems minor.

Note 1: In longer job scenarios, this volume of requests tends to be spread out over time. As a result, transport.open() may be called many more times perhaps. But I don't believe that would change the conclusion of this comment.
Note 2: If something worth investigating is the two long gaps in the second plot, which theoretically should not be there! In any case, that is certainly not related to safe_interval.

unkcpz · 2024-10-30T15:48:14Z

The points above solve the waiting of the first 30 seconds.

After discussion and pair coding with @GeigerJ2 I made #6596 which may solve the problem, by have a transport_request having the lifetime of the CalcJob.
The problem detected by @giovannipizzi is, if there is only one calcjob running then for every waiting.execute() call the transport_request counting goes to zero, which makes sense for the transport_request context manager implementation.
The solution is to treat "upload", "submit", "monitoring" and "retrieve" as 4 operations bundled with the same liftetime of transport_request allocated.

One downside of this implementation is when the calcjob take long time to go through all four step (which will be since the calcjob can submit to remote and wait in the remote scheduler queue), the SSH might be closed from SSH server side and it will then trigger the exponential mechanism anyway.

GeigerJ2 added type/feature request status undecided priority/nice-to-have labels Jul 22, 2024

GeigerJ2 mentioned this issue Nov 4, 2024

Decrease _DEFAULT_SAFE_OPEN_INTERVAL (to what?) #6599

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make `safe_interval` more dynamic for quick transport tasks #6544

Make `safe_interval` more dynamic for quick transport tasks #6544

GeigerJ2 commented Jul 22, 2024 •

edited

Loading

giovannipizzi commented Jul 22, 2024

khsrali commented Oct 11, 2024 •

edited

Loading

unkcpz commented Oct 30, 2024

Make safe_interval more dynamic for quick transport tasks #6544

Make safe_interval more dynamic for quick transport tasks #6544

Comments

GeigerJ2 commented Jul 22, 2024 • edited Loading

giovannipizzi commented Jul 22, 2024

khsrali commented Oct 11, 2024 • edited Loading

unkcpz commented Oct 30, 2024

Make `safe_interval` more dynamic for quick transport tasks #6544

Make `safe_interval` more dynamic for quick transport tasks #6544

GeigerJ2 commented Jul 22, 2024 •

edited

Loading

khsrali commented Oct 11, 2024 •

edited

Loading