Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use dependency injection for proc memory mocks #6004

Closed
wants to merge 3 commits into from

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Mar 25, 2022

The mocks introduced in #5878 are unstable in tests like test_fail_to_pickle_spill
because the mocks only register once the test is entered. With the extremely small memory limit, and the extreme small monitor-interval, the event loop appears to be stressed out sufficiently that the workers actually never come up in time and the test times out. Increasing the interval is one option. I encountered this during #5910

The mocks are not to blame here but rather that we're patching an initialized instance after it has been started. IMO, the proper way would be to use dependency injection using a fake object instead of a monkeypatch.

This introduces a dependency injection pattern for the WorkerMemoryManager and defines a specific mock target as motivated already in #5870 (comment)

Copy link
Collaborator

@crusaderky crusaderky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, the whole thing feels over-designed here.
The failures (for reference: https://github.com/dask/distributed/runs/5693287345?check_suite_focus=true) are both caused by

  1. having "memory_limit": "1 GB"
  2. the worker is already beyond 1GB RAM when the worker starts, as it runs in the main process
  3. the test is about pause
  4. hence, the pause may kick in before the monkey patch of the instance can happen.

The solution is very simple - just make memory_limit and all mocked measures safely larger than what the whole test suite will ever consume - e.g. 10 GB.

Comment on lines +346 to +347
or self.memory_limit is None
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
or self.memory_limit is None
):
):
assert self.memory_limit is not None
assert self.memory_terminate_fraction is not False

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few lines below we're calculating memory / self.memory_limit, i.e. if memory_limit is None we can skip memory_monitor / return early.

So, why remove the self.memory_limit is None from the if clause and why add these asserts? If these asserts are a concerning edge case, we should probably rather add a unit test

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's not an edge case; the whole method will never be scheduled if memory_limit is disabled. See lines ~140 in __init__. There are already several tests for it.

@@ -302,32 +313,46 @@ def __init__(
dask.config.get("distributed.worker.memory.monitor-interval"),
default=None,
)
self.nanny = weakref.ref(nanny)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is self._worker above - could you make it consistent?
Also, please add to the attributes declarations above in both classes.

@fjetter
Copy link
Member Author

fjetter commented Mar 28, 2022

IMHO, the whole thing feels over-designed here.

I understand this concern. I still thing we should go through with this. I don't think mocking something mid flight is a stable way of testing. The fact that the pause may already kick in before the test even starts, regardless of how the settings are configured is not stable and harder to maintain. Even if setting the parameters properly avoids this issue, there is a lot of knowledge required to A) identify the situation and B) to set the proper values. Instead of a monkeypatch, a fake object, i.e. a fully functional object with a few simplifications (e.g. the process memory) is typically a more robust design and this is what I am proposing here.

Providing the type of WorkerMemoryManager to the worker appears to be the simplest way to achieve this since the factoring out of the memory related code. If there is another method to achieve this, I'm open to exploring but I really would like to have our patch/fake in place before the server starts.

@github-actions
Copy link
Contributor

Unit Test Results

       11 files   -        1         11 suites   - 1   6h 3m 0s ⏱️ - 33m 14s
  2 670 tests +       1    2 580 ✔️  -        7    81 💤  -   1    8 +  8  1 🔥 +1 
14 734 runs   - 1 184  13 889 ✔️  - 1 166  810 💤  - 53  34 +34  1 🔥 +1 

For more details on these failures and errors, see this check.

Results for commit 0ec2a91. ± Comparison against base commit 6dd928b.

@crusaderky
Copy link
Collaborator

Even if setting the parameters properly avoids this issue, there is a lot of knowledge required to A) identify the situation and B) to set the proper values.

The log of the failed tests (https://github.com/dask/distributed/runs/5693287345?check_suite_focus=true) states:

2022-03-25 14:54:22,158 - distributed.worker_memory - WARNING - Worker is at 102% memory usage. Pausing worker.  Process memory: 0.95 GiB -- Worker memory limit: 0.93 GiB
2022-03-25 14:54:51,752 - distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Cluster creation timeout; retrying

it looks very explicit to me.

regardless of how the settings are configured is not stable and harder to maintain.

I think it's a safe statement that the main process of our test suite will never reach 10GB permanent RAM. You can use 100GB or 1TB if you want to remove all ambiguity.

The fact that the pause may already kick in before the test even starts, regardless of how the settings are configured

The pause may kick in on any test decorated by @gen_cluster if enough memory is leaked by previous tests.

a fully functional object with a few simplifications (e.g. the process memory) is typically a more robust design and this is what I am proposing here.

I would agree for the actual production code. I disagree for unit tests - you're just making them harder to maintain.

Providing the type of WorkerMemoryManager to the worker appears to be the simplest way to achieve this since the factoring out of the memory related code. If there is another method to achieve this, I'm open to exploring but I really would like to have our patch/fake in place before the server starts.

You're adding complexity to Worker for the only purpose of unit testing. This to me is an antipattern.

@fjetter
Copy link
Member Author

fjetter commented Apr 1, 2022

Closing in favour of #6055

@fjetter fjetter closed this Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants