Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle connection closure for Dask workers #78

Closed
jacksund opened this issue Feb 7, 2022 · 2 comments · Fixed by #503
Closed

Handle connection closure for Dask workers #78

jacksund opened this issue Feb 7, 2022 · 2 comments · Fixed by #503
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@jacksund
Copy link
Owner

jacksund commented Feb 7, 2022

For long running Prefect Agent and Dask Clusters, the database connection (through Django) sometimes disconnects -- causing all following database transactions to fail with "connection already closed" error.

@jacksund jacksund added the bug Something isn't working label Feb 7, 2022
@jacksund
Copy link
Owner Author

This issue is a common one as I've seen several stack overflow posts about it, such as this one.

A while back, I tried fixing this with a WorkerPlugin for Dask, but struggled with connection leaks. I should revisit my commented-out code here:

# BUG: do I want to make a connection first? It looks like if the connection
# closes at any point, then all following tasks will fail -- that is, once
# the connection closes, no following tasks reopen it. Maybe if I don't establish
# the connection upfront, I can just let every task make their own.
# The error I get is...
# Unexpected error: InterfaceError('connection already closed') dask
# This has only happened when I submit >5,000 flow runs, so there may be too
# many connections opened at that point. I'm not sure if that's actually the
# case though.
# BUG: for scaling Dask to many workers, I initially ran into issues of "too many
# files open". This is addressed in Dask's FAQ:
# https://distributed.dask.org/en/latest/faq.html#too-many-open-file-descriptors
# To summarize the fix...
# (1) check the current soft limit (soft = no sudo permissions) for files
# ulimit -Sn
# (2) to increase the softlimit, edit the limits.conf file and add one line
# sudo nano /etc/security/limits.conf
# # add this line below
# * soft nofile 10240
# (3) close and reopen the terminal
# (4) confirm we changed the limit
# ulimit -Sn
#
# This may also be a leak of sockets being left open by Dask:
# (1) get the PID of the running process with
# ps -aef | grep python
# (2) look at the fd's (file opened) by the given process
# cd /proc/<PID>/fd; ls -l
# (3) count the number of files opened by the given process
# ls /proc/<PID>/fd/ | wc -l
# (4) view overall stats with
# cat /proc/<PID>/net/sockstat
# (5) another option to list open files is
# lsof -p <PID> | wc -l
#
# Whenever I see a heartbeat fail, I also see a massive jump in the number of
# files opened by the process. I believe zombie prefect runs are creating
# a socket leak.
#
# --------------------------------------------------------------------------------------
# THIS REPRESENTS A PLUGIN FOR A DASK WORKER. I MAY WANT TO SWITCH TO THIS APPROACH
# IN THE FUTURE. See here: https://distributed.dask.org/en/latest/plugins.html
# from dask.distributed import WorkerPlugin
# class DjangoPlugin(WorkerPlugin):
# def setup(self, worker):
# print("PRELOADING DJANGO TO DASK WORKER")
# # First setup django settings for simmate
# from simmate.configuration.django import setup_full # ensures setup
# # The settings (including the database) are all set up now, but django doesn't
# # actually connect to the database until a query is made. So here, we do a
# # very simple query that should work for any django database. We don't actaully
# # need the output. We just want make a call that establishes a connection.
# # Let's just use the ContentType table because it's typically small.
# from django.contrib.contenttypes.models import ContentType
# # and make a quick query
# ContentType.objects.count()
# def teardown(self, worker):
# print("SHUTING DOWN DASK WORKER CONNECTIONS TO DJANGO WORKER")
# # grab all of the existing database connections for this thread
# from django.db import connections
# # close all of these connections to prevent connection leaking
# connections.close_all()
# REGISTER WITH....
# # def dask_setup(scheduler):
# # plugin = DjangoPlugin()
# # client.register_worker_plugin(plugin) # doctest: +SKIP
# OR...
# from dask.distributed import Client
# client = Client(cluster.scheduler.address)
# from simmate.configuration.dask.connect_to_database import DjangoPlugin
# plugin = DjangoPlugin()
# client.register_worker_plugin(plugin)

@jacksund
Copy link
Owner Author

This may be relevant to this issue -- but DigitalOcean makes automatic updates to database servers:

Updates are automatically applied to this cluster Tuesday after 9AM - 1PM (EST).

How would long-lived connections behave here? Could this be a cause of the connections closing?

@jacksund jacksund added the help wanted Extra attention is needed label Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant