Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure garbage collection of distributed.scheduler.TaskState instances #6364

Conversation

hendrikmakait
Copy link
Member

Partially addresses #6250

  • Tests added / passed
  • Passes pre-commit run --all-files

@hendrikmakait hendrikmakait changed the title Ensure garbage collection of scheduler `TaskState instances Ensure garbage collection of scheduler TaskState instances May 18, 2022
@hendrikmakait hendrikmakait changed the title Ensure garbage collection of scheduler TaskState instances Ensure garbage collection of distributed.scheduler.TaskState instances May 18, 2022
Comment on lines +1798 to +1799
wait_profiler()
gc.collect()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used this myself already but this is actually not safe. The wait_profiler polls for the thread to not be running. If we rely on ordinary ref-count based object collection, this is sufficient since as soon as the profiler thread pauses, the ref-count based collection cleans up all references and we're good.

Most of the TaskStates are actually part of a densely connected, self referencing data structure. these self referencing cycles are not necessarily a problem since the gc.collect can detect these, break them and clean up. However, the background thread may actually already be running again after leaving wait_profiler since we're only polling and are not actually stopping the thread.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd rather need something like

with no_profiler():
    # Some magic (e.g. a lock) ensures that the profile thread cannot watch while this ctx manager is held
    gc.collect()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this?

from distributed import profile

with profile.lock:
    gc.collect()
# profile.py

lock = threading.Lock()

def _watch(...):
    ...
    with lock:
        frame = sys._current_frames()[thread_id]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(it seems simpler to reference a lock directly rather than make a context manager)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I was typing the contextmanager first and while typing realized this is not occult magic but simply a lock :)

@github-actions
Copy link
Contributor

Unit Test Results

       15 files  ±    0       15 suites  ±0   9h 4m 59s ⏱️ + 1h 59m 40s
  2 801 tests ±    0  1 229 ✔️  -   1 493       90 💤 +12  1 302 +1 301     180 🔥 +   180 
21 079 runs  +309  9 567 ✔️  - 10 280  1 002 💤 +80  9 242 +9 241  1 268 🔥 +1 268 

For more details on these failures and errors, see this check.

Results for commit 97c862d. ± Comparison against base commit 36e9946.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants