-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make safe_interval
more dynamic for quick transport tasks
#6544
Comments
Thanks for the nice write-up @GeigerJ2 ! Just some minor additional comments/clarifications
|
After discussion and pair coding with @GeigerJ2 I made #6596 which may solve the problem, by have a transport_request having the lifetime of the One downside of this implementation is when the calcjob take long time to go through all four step (which will be since the calcjob can submit to remote and wait in the remote scheduler queue), the SSH might be closed from SSH server side and it will then trigger the exponential mechanism anyway. |
As realized together with @giovannipizzi while debugging things for our new cluster at PSI: When submitting a simple calculation (execution takes about 10s) for testing purposes, with the default
safe_interval=30
in theComputer
configuration, one has to wait an additional 90s until the job is done (30s for theupload
,submit
, andretrieve
tasks, each). This is to be expected, of course, and one could just reduce thesafe_interval
(albeit increasing the risk of SSH overloads).However, the
upload
task in that case is truly the firstTransport
task that is being executed by the daemon worker, so it could, in principle, enter immediately (the same if jobs were run previously, but longer ago than thesafe_interval
). I locally implemented a first version (thanks to @giovannipizzi's input) that does this, by adding alast_close_time
attribute (currently added to theauthinfo
metadata for a first PoC). In therequest_transport
method of theTransportQueue
, the time difference between the current time and thelast_close_time
is then checked, and if it is larger thansafe_interval
, theTransport
is opened immediately via:bypassing the
safe_interval
(orsafe_open_interval
as it is called intransports.py
).In addition, the waiting times for the
submit
andretrieve
tasks could also be reduced. It seems like currently, thesafe_interval
is imposed on all of them, even if they finish very quickly (I assume as all open a transport connection via SSH). So we were thinking if it's possible to make this a bit more sophisticated, e.g. by adding special transport requests, that could make use of the open transport, and keep a transport of which the task has finished open for a short time longer (also quickly discussed with @mbercx). Of course, one would still need to make sure SSH doesn't get overloaded, the implementation works with heavy loads (not just individual testing calculations), and one would also have to consider how this all works with multiple daemon workers. Again with @giovannipizzi, I had a quick look, but it seems like the implementation would be a bit more involved. So wondering what the others think, if this is feasible and worth investigating more time into. Pinging @khsrali who has looked a bit more into transports.The text was updated successfully, but these errors were encountered: