Use dependency injection for proc memory mocks #6004

fjetter · 2022-03-25T14:52:54Z

The mocks introduced in #5878 are unstable in tests like test_fail_to_pickle_spill
because the mocks only register once the test is entered. With the extremely small memory limit, and the extreme small monitor-interval, the event loop appears to be stressed out sufficiently that the workers actually never come up in time and the test times out. Increasing the interval is one option. I encountered this during #5910

The mocks are not to blame here but rather that we're patching an initialized instance after it has been started. IMO, the proper way would be to use dependency injection using a fake object instead of a monkeypatch.

This introduces a dependency injection pattern for the WorkerMemoryManager and defines a specific mock target as motivated already in #5870 (comment)

crusaderky

IMHO, the whole thing feels over-designed here.
The failures (for reference: https://github.com/dask/distributed/runs/5693287345?check_suite_focus=true) are both caused by

having "memory_limit": "1 GB"
the worker is already beyond 1GB RAM when the worker starts, as it runs in the main process
the test is about pause
hence, the pause may kick in before the monkey patch of the instance can happen.

The solution is very simple - just make memory_limit and all mocked measures safely larger than what the whole test suite will ever consume - e.g. 10 GB.

crusaderky · 2022-03-25T15:46:26Z

distributed/worker_memory.py

+            or self.memory_limit is None
+        ):


Suggested change

or self.memory_limit is None

):

):

assert self.memory_limit is not None

assert self.memory_terminate_fraction is not False

A few lines below we're calculating memory / self.memory_limit, i.e. if memory_limit is None we can skip memory_monitor / return early.

So, why remove the self.memory_limit is None from the if clause and why add these asserts? If these asserts are a concerning edge case, we should probably rather add a unit test

it's not an edge case; the whole method will never be scheduled if memory_limit is disabled. See lines ~140 in __init__. There are already several tests for it.

crusaderky · 2022-03-25T15:47:52Z

distributed/worker_memory.py

@@ -302,32 +313,46 @@ def __init__(
            dask.config.get("distributed.worker.memory.monitor-interval"),
            default=None,
        )
+        self.nanny = weakref.ref(nanny)


This is self._worker above - could you make it consistent?
Also, please add to the attributes declarations above in both classes.

fjetter · 2022-03-28T08:34:13Z

IMHO, the whole thing feels over-designed here.

I understand this concern. I still thing we should go through with this. I don't think mocking something mid flight is a stable way of testing. The fact that the pause may already kick in before the test even starts, regardless of how the settings are configured is not stable and harder to maintain. Even if setting the parameters properly avoids this issue, there is a lot of knowledge required to A) identify the situation and B) to set the proper values. Instead of a monkeypatch, a fake object, i.e. a fully functional object with a few simplifications (e.g. the process memory) is typically a more robust design and this is what I am proposing here.

Providing the type of WorkerMemoryManager to the worker appears to be the simplest way to achieve this since the factoring out of the memory related code. If there is another method to achieve this, I'm open to exploring but I really would like to have our patch/fake in place before the server starts.

github-actions · 2022-03-28T10:35:43Z

Unit Test Results

      11 files -       1       11 suites - 1 6h 3m 0s ⏱️ - 33m 14s
  2 670 tests +      1   2 580 ✔️ -       7   81 💤 -   1   8 ❌ +  8 1 🔥 +1
14 734 runs - 1 184 13 889 ✔️ - 1 166 810 💤 - 53 34 ❌ +34 1 🔥 +1

For more details on these failures and errors, see this check.

Results for commit 0ec2a91. ± Comparison against base commit 6dd928b.

crusaderky · 2022-03-28T11:44:04Z

Even if setting the parameters properly avoids this issue, there is a lot of knowledge required to A) identify the situation and B) to set the proper values.

The log of the failed tests (https://github.com/dask/distributed/runs/5693287345?check_suite_focus=true) states:

2022-03-25 14:54:22,158 - distributed.worker_memory - WARNING - Worker is at 102% memory usage. Pausing worker.  Process memory: 0.95 GiB -- Worker memory limit: 0.93 GiB
2022-03-25 14:54:51,752 - distributed.utils_test - ERROR - Failed to start gen_cluster: TimeoutError: Cluster creation timeout; retrying

it looks very explicit to me.

regardless of how the settings are configured is not stable and harder to maintain.

I think it's a safe statement that the main process of our test suite will never reach 10GB permanent RAM. You can use 100GB or 1TB if you want to remove all ambiguity.

The fact that the pause may already kick in before the test even starts, regardless of how the settings are configured

The pause may kick in on any test decorated by @gen_cluster if enough memory is leaked by previous tests.

a fully functional object with a few simplifications (e.g. the process memory) is typically a more robust design and this is what I am proposing here.

I would agree for the actual production code. I disagree for unit tests - you're just making them harder to maintain.

Providing the type of WorkerMemoryManager to the worker appears to be the simplest way to achieve this since the factoring out of the memory related code. If there is another method to achieve this, I'm open to exploring but I really would like to have our patch/fake in place before the server starts.

You're adding complexity to Worker for the only purpose of unit testing. This to me is an antipattern.

fjetter · 2022-04-01T16:36:08Z

Closing in favour of #6055

Use dependency injection for proc memory mocks

4f7c70f

fjetter force-pushed the dep_injection_memory_mocks branch from b9d780b to 4f7c70f Compare March 25, 2022 14:53

fjetter requested a review from crusaderky March 25, 2022 14:53

crusaderky reviewed Mar 25, 2022

View reviewed changes

Flag nanny as private

f78b1d5

Add annotations for references

0ec2a91

fjetter mentioned this pull request Apr 1, 2022

Use more robust limits in test_worker_memory #6055

Merged

fjetter closed this Apr 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use dependency injection for proc memory mocks #6004

Use dependency injection for proc memory mocks #6004

fjetter commented Mar 25, 2022

crusaderky left a comment •

edited

Loading

crusaderky Mar 25, 2022

fjetter Mar 28, 2022

crusaderky Mar 28, 2022

crusaderky Mar 25, 2022

fjetter commented Mar 28, 2022

github-actions bot commented Mar 28, 2022

crusaderky commented Mar 28, 2022

fjetter commented Apr 1, 2022

-            or self.memory_limit is None
-        ):
+        ):
+            assert self.memory_limit is not None
+            assert self.memory_terminate_fraction is not False

Use dependency injection for proc memory mocks #6004

Use dependency injection for proc memory mocks #6004

Conversation

fjetter commented Mar 25, 2022

crusaderky left a comment • edited Loading

Choose a reason for hiding this comment

crusaderky Mar 25, 2022

Choose a reason for hiding this comment

fjetter Mar 28, 2022

Choose a reason for hiding this comment

crusaderky Mar 28, 2022

Choose a reason for hiding this comment

crusaderky Mar 25, 2022

Choose a reason for hiding this comment

fjetter commented Mar 28, 2022

github-actions bot commented Mar 28, 2022

Unit Test Results

crusaderky commented Mar 28, 2022

fjetter commented Apr 1, 2022

crusaderky left a comment •

edited

Loading