Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Dask Executor __exit__ does not close its Client #239

Closed
eddiebergman opened this issue Jan 28, 2024 · 2 comments
Closed

[Bug] Dask Executor __exit__ does not close its Client #239

eddiebergman opened this issue Jan 28, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@eddiebergman
Copy link
Contributor

eddiebergman commented Jan 28, 2024

tldr of this issue; Dask Client can make an executor, get activated by it, but not get shutdown by it. AMLTK never sees the Client so it's on the user to do so or a Client just closes when the program shuts down.

Not an issue in a single python interpreter with a single Client, big problem in a test suite with many Client's being made.


When running the entire test suite recently, I've been getting errors along the lines of "too many file descriptors open". This should not be a problem in a typical setup but good to get fixed sooner rather than later.

A quick lookup of that issue and pytest found this (notably old) issue about pytest and objects not getting garbage collected, meaning resources were not freed.

I found simonw's investigations helpful and from following the trick he mentions in his blog here, It's mostly a lot of dask workers being left open. With more tests being added recently that include Scheduler's, this seems likely closely related.

One thing is that I've been relying on using __enter__() explicitly and then close() and shutdown() mechanisms of executors, rather than explicitly using context managers. I don't remember my reasoning for this but this should definitely be changed. The existing setup seemed good enough for Python's native executors and Loky but seemingly not for dask.


-> (Pdb) import psutil
(Pdb) for f in psutil.Process().open_files(): print(f)
popenfile(path='/tmp/dask-worker-space/worker-0c3gtful.dirlock', fd=24, position=0, mode='w', flags=557057)
+ 50 more of the similar lines

# Dask seems to suggest there should be at most 2-10 per workers with 10 being rare, so seems they're not getting shut down properly
@eddiebergman eddiebergman added the bug Something isn't working label Jan 28, 2024
@eddiebergman eddiebergman self-assigned this Jan 28, 2024
@eddiebergman eddiebergman changed the title [Bug] Scheduler might not be closing resources properly [Bug] Dask Executor __exit__ does not close its Client Jan 28, 2024
@eddiebergman
Copy link
Contributor Author

eddiebergman commented Jan 28, 2024

After some more testing, it is very much just a dask issue and not a Scheduler. The issue is best illustrated this way:

# Userland, creating the Client
client = Client(...)
executor: cf.ClientExecutor = client.get_executor()

# What AMLTK can see
scheduler = Scheduler(executor)
scheduler.run()

# Approximation of inside the `Scheduler`
def run():
    with self.executor:  # This starts the executor and client
    
    # And now it's closed the executor
    
# Userland
# the Client is still active here despite the executor that was
# created from it being closed.
client.close()

The problem is that cf.ClientExecutor.__exit__/shutdown() doesn't actually shutdown the Client it was created from. This means AMLTK has no direct way to close down the Client and the responsiblity is own the user to do so. This should be documented somewhere as I do not want to implement hacks that make this behaviour happen.

In the meantime, to fix the tests, we will simply make sure to close the Client.


Side note, this information is good to know and means we should actually implement this in the dask-jobqueue side of things as there we do have access to the creating client.

@eddiebergman
Copy link
Contributor Author

eddiebergman commented Jan 28, 2024

The issue was fixed by just closing the clients at the end of each fixture call. I accidentally committed the changes directly to main but the two prominent ones to consider are:

  • 844ce6a - Fixing the tests
  • 6615af6 - Accessing the client and closing it with dask-jobqueue

I re-removed all exceptions to branch protections rules so I don't accidentally push again to main... whoops 👀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant