You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tldr of this issue; Dask Client can make an executor, get activated by it, but not get shutdown by it. AMLTK never sees the Client so it's on the user to do so or a Client just closes when the program shuts down.
Not an issue in a single python interpreter with a single Client, big problem in a test suite with many Client's being made.
When running the entire test suite recently, I've been getting errors along the lines of "too many file descriptors open". This should not be a problem in a typical setup but good to get fixed sooner rather than later.
A quick lookup of that issue and pytest found this (notably old) issue about pytest and objects not getting garbage collected, meaning resources were not freed.
One thing is that I've been relying on using __enter__() explicitly and then close() and shutdown() mechanisms of executors, rather than explicitly using context managers. I don't remember my reasoning for this but this should definitely be changed. The existing setup seemed good enough for Python's native executors and Loky but seemingly not for dask.
-> (Pdb) importpsutil
(Pdb) forfinpsutil.Process().open_files(): print(f)
popenfile(path='/tmp/dask-worker-space/worker-0c3gtful.dirlock', fd=24, position=0, mode='w', flags=557057)
+50moreofthesimilarlines# Dask seems to suggest there should be at most 2-10 per workers with 10 being rare, so seems they're not getting shut down properly
The text was updated successfully, but these errors were encountered:
eddiebergman
changed the title
[Bug] Scheduler might not be closing resources properly
[Bug] Dask Executor __exit__ does not close its Client
Jan 28, 2024
After some more testing, it is very much just a dask issue and not a Scheduler. The issue is best illustrated this way:
# Userland, creating the Clientclient=Client(...)
executor: cf.ClientExecutor=client.get_executor()
# What AMLTK can seescheduler=Scheduler(executor)
scheduler.run()
# Approximation of inside the `Scheduler`defrun():
withself.executor: # This starts the executor and client# And now it's closed the executor# Userland# the Client is still active here despite the executor that was# created from it being closed.client.close()
The problem is that cf.ClientExecutor.__exit__/shutdown() doesn't actually shutdown the Client it was created from. This means AMLTK has no direct way to close down the Client and the responsiblity is own the user to do so. This should be documented somewhere as I do not want to implement hacks that make this behaviour happen.
In the meantime, to fix the tests, we will simply make sure to close the Client.
Side note, this information is good to know and means we should actually implement this in the dask-jobqueue side of things as there we do have access to the creating client.
The issue was fixed by just closing the clients at the end of each fixture call. I accidentally committed the changes directly to main but the two prominent ones to consider are:
tldr of this issue; Dask Client can make an executor, get activated by it, but not get shutdown by it. AMLTK never sees the Client so it's on the user to do so or a Client just closes when the program shuts down.
Not an issue in a single python interpreter with a single Client, big problem in a test suite with many Client's being made.
When running the entire test suite recently, I've been getting errors along the lines of "too many file descriptors open". This should not be a problem in a typical setup but good to get fixed sooner rather than later.
A quick lookup of that issue and pytest found this (notably old) issue about pytest and objects not getting garbage collected, meaning resources were not freed.
I found simonw's investigations helpful and from following the trick he mentions in his blog here, It's mostly a lot of dask workers being left open. With more tests being added recently that include
Scheduler
's, this seems likely closely related.One thing is that I've been relying on using
__enter__()
explicitly and thenclose()
andshutdown()
mechanisms of executors, rather than explicitly using context managers. I don't remember my reasoning for this but this should definitely be changed. The existing setup seemed good enough for Python's native executors andLoky
but seemingly not fordask
.The text was updated successfully, but these errors were encountered: