FIX-#7346: Handle execution on Dask workers to avoid creating conflic… #7347

data-makerman · 2024-07-19T15:19:51Z

…ting Clients

What do these changes do?

Check if execution is happening on a Dask worker node and, if so, avoid creating conflicting clients. Worker nodes are not allowed to make additional clients.

first commit message and PR title follow format outlined here

NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.
passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit with git commit -s
Resolves BUG: Apply on axis=1 causes "daemonic processes are not allowed to have children" on some operations on Dask engine, or launches Ray instance #7346
tests added and passing
- After setting up a conda env using environment-dev.yml, the full pytest suite fails reporting missing packages. pytest modin/tests/core/storage_formats/pandas/test_internals.py is passing.
module layout described at docs/development/architecture.rst is up-to-date

…ating conflicting Clients Signed-off-by: Michael Akerman <[email protected]>

devin-petersohn

Thanks @data-makerman, great patch! Would you be able to add a new test with the breaking code from your issue #7346 to the end of modin/tests/core/storage_foramts/pandas/test_internals.py? I tested this and it works in Ray, so we shouldn't need to exclude any engine from the test.

YarShev · 2024-07-22T12:31:25Z

modin/core/execution/dask/common/utils.py

@@ -30,6 +30,17 @@
 def initialize_dask():
    """Initialize Dask environment."""
    from distributed.client import default_client
+    from distributed.worker import get_worker
+
+    try:


I wonder why initialize_dask is called in a worker? Could you verify if this is the case?

I definitely don't fully understand it. I can verify that this PR stops the issue and prevents multiple instances of a Dask cluster from starting under any of the operations where I was observing the behavior. Since it occurs in an apply context on a remote worker, it was beyond my available time (or existing technical skill) to debug exactly what was happening on the Dask worker. It seems possible that there's some other root cause leading to a call to initialize_dask.

I can verify by inference that initialize_dask is being called inside a worker, because it appears to be the only place in Modin 0.31 where the distributed.Client class is ever instantiated, and I can observe in the stdout that multiple Clients are being created as daemonic processes on Dask during the apply operation demonstrated in #7346, but only when working with Modin (not with the equivalent operation in Pandas).

I can hazard a partial guess as to what might be happening that would require further study based on some very confusing behavior I observed: sometimes while attempting to use client.submit(lambda x: x.apply(foo), pandas_df) directly on a Pandas dataframe (not Modin), I saw the same error, but only if Modin had been imported using import modin.pandas as pd. It made me wonder if Dask was calling a pd function while pd had been masked in the worker's namespace by Modin?

I think I can probably create a working example of that if I have enough time later, which might help find the root cause.

I think I have a decent understanding of what is going on, but there's still something weird happening that I can't explain.

Modin is never fully initialized on the workers. Modin's initialization code is never run on the workers, unless a worker runs a task requires subscribing to an engine it will never have this problem. Link:

modin/modin/core/execution/dispatching/factories/dispatcher.py

Lines 99 to 118 in f5f9ae9

class FactoryDispatcher(object):

"""

Class that routes IO-work to the factories.

This class is responsible for keeping selected factory up-to-date and dispatching

calls of IO-functions to its actual execution-specific implementations.

"""

__factory: factories.BaseFactory = None

@classmethod

def get_factory(cls) -> factories.BaseFactory:

"""Get current factory."""

if cls.__factory is None:

from modin.pandas import _update_engine

Engine.subscribe(_update_engine)

Engine.subscribe(cls._update_factory)

StorageFormat.subscribe(cls._update_factory)

return cls.__factory

The Series constructor ends up in this path via from_pandas. So when you call pd.Series(...) from within an apply function, cloudpickle will serialize all of the dependencies of that call and then unpack it within a worker. This could potentially explain why @data-makerman is seeing this happening with pandas after using Modin (but that still seems like a bug to me and is worth investigating more).

Now, what I don't understand is that in ipython and jupyter, instead of hitting this problem we have something else entirely, which is that Modin starts a Ray cluster inside a Dask cluster:

I don't really even understand why it's different, because if I use %run issue-7346.py within ipython I get the same issue as before with Dask:

So I think this patch is absolutely correct in detecting that we are in a worker and avoiding initializing a second Dask cluster, but I will follow on with a patch for this weird ipython issue.

Also, the reason this doesn't happen with Ray is because Ray uses a global variable that all workers share to ensure that one and only one Ray cluster are initialized in the same client/worker. We bypass initialization if Ray is initialized here:

modin/modin/core/execution/ray/common/utils.py

Line 96 in f5f9ae9

if not ray.is_initialized() or override_is_cluster:

Ray will always be initialized from within a worker.

Now, what I don't understand is that in ipython and jupyter, instead of hitting this problem we have something else entirely, which is that Modin starts a Ray cluster inside a Dask cluster:

This probably has to do with the fact that workers know nothing about Modin configs set in the main process.

@YarShev I think that is correct, but do you know why it would be different in ipython vs the python script?

I would guess there is a different strategy for importing modules in ipython.

Signed-off-by: Michael Akerman <[email protected]>

data-makerman · 2024-07-22T17:51:41Z

Thanks @data-makerman, great patch! Would you be able to add a new test with the breaking code from your issue #7346 to the end of modin/tests/core/storage_foramts/pandas/test_internals.py? I tested this and it works in Ray, so we shouldn't need to exclude any engine from the test.

Added, although I'm not 100% sure I matched the desired test format and standard so please let me know if any changes are needed! The new test is passing on my branch.

That said, I don't understand how to test that the test properly fails. If I revert my fix commit and run it as-is with the pytest -s option, I can see that it's starting Ray (and therefore NOT failing). The test DOES fail if I uninstall ray-core, though, so I can tell it's properly failing without the fix if Dask is the only option. Please let me know if you have any advice!

devin-petersohn · 2024-07-22T22:13:10Z

@data-makerman you can use MODIN_ENGINE=dask python -m pytest ... we have a config environment variable that should be make it easy to test.

modin/tests/core/storage_formats/pandas/test_internals.py

Co-authored-by: Iaroslav Igoshev <[email protected]>

Signed-off-by: Michael Akerman <[email protected]>

modin/tests/core/storage_formats/pandas/test_internals.py

Signed-off-by: Michael Akerman <[email protected]>

…Dask-workers-to-avoid-creating-conflicting-Clients

data-makerman · 2024-09-17T14:52:28Z

At this point the pipeline is failing on tests which don't seem to have anything to do with changes in this PR, and I'm frankly a bit at a loss as to what they might mean. I have merged the current state of the main branch of Modin into this feature branch and restarted the checks, but if they still fail I would appreciate any input on fixes!

The failures are variations on:

botocore.exceptions.ClientError: An error occurred (InvalidLocationConstraint) when calling the CreateBucket operation: The specified location-constraint is not valid

FIX-modin-project#7346: Handle execution on Dask workers to avoid cre…

b5f50b9

…ating conflicting Clients Signed-off-by: Michael Akerman <[email protected]>

data-makerman requested review from devin-petersohn, mvashishtha, RehanSD, YarShev, vnlitvinov, anmyachev, dchigarev and a team as code owners July 19, 2024 15:19

devin-petersohn reviewed Jul 19, 2024

View reviewed changes

YarShev reviewed Jul 22, 2024

View reviewed changes

Add tests for FIX-modin-project#7346

3d17773

Signed-off-by: Michael Akerman <[email protected]>

data-makerman force-pushed the Fix-#7346--Handle-execution-on-Dask-workers-to-avoid-creating-conflicting-Clients branch from a360f93 to 3d17773 Compare July 22, 2024 17:51

github-advanced-security bot found potential problems Jul 22, 2024

View reviewed changes

modin/tests/core/storage_formats/pandas/test_internals.py Fixed Show fixed Hide fixed

YarShev reviewed Jul 23, 2024

View reviewed changes

modin/tests/core/storage_formats/pandas/test_internals.py Outdated Show resolved Hide resolved

data-makerman and others added 2 commits July 29, 2024 10:32

Update modin/tests/core/storage_formats/pandas/test_internals.py

6eb5442

Co-authored-by: Iaroslav Igoshev <[email protected]>

Fixed linting issues

c6eb900

Signed-off-by: Michael Akerman <[email protected]>

github-advanced-security bot found potential problems Jul 29, 2024

View reviewed changes

modin/tests/core/storage_formats/pandas/test_internals.py Fixed Show fixed Hide fixed

data-makerman added 3 commits August 1, 2024 20:27

Include fixture in test function for Engine selection

36e1036

Signed-off-by: Michael Akerman <[email protected]>

Fix linting error due to whitespace

036b91e

Signed-off-by: Michael Akerman <[email protected]>

Merge branch 'main' into Fix-modin-project#7346--Handle-execution-on-…

9ff6261

…Dask-workers-to-avoid-creating-conflicting-Clients

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FIX-#7346: Handle execution on Dask workers to avoid creating conflic… #7347

FIX-#7346: Handle execution on Dask workers to avoid creating conflic… #7347

data-makerman commented Jul 19, 2024

devin-petersohn left a comment

YarShev Jul 22, 2024

data-makerman Jul 22, 2024

devin-petersohn Jul 22, 2024

YarShev Jul 23, 2024

devin-petersohn Jul 23, 2024

YarShev Jul 23, 2024

data-makerman commented Jul 22, 2024

devin-petersohn commented Jul 22, 2024

data-makerman commented Sep 17, 2024

	class FactoryDispatcher(object):
	"""
	Class that routes IO-work to the factories.

	This class is responsible for keeping selected factory up-to-date and dispatching
	calls of IO-functions to its actual execution-specific implementations.
	"""

	__factory: factories.BaseFactory = None

	@classmethod
	def get_factory(cls) -> factories.BaseFactory:
	"""Get current factory."""
	if cls.__factory is None:
	from modin.pandas import _update_engine

	Engine.subscribe(_update_engine)
	Engine.subscribe(cls._update_factory)
	StorageFormat.subscribe(cls._update_factory)
	return cls.__factory

FIX-#7346: Handle execution on Dask workers to avoid creating conflic… #7347

Are you sure you want to change the base?

FIX-#7346: Handle execution on Dask workers to avoid creating conflic… #7347

Conversation

data-makerman commented Jul 19, 2024

What do these changes do?

devin-petersohn left a comment

Choose a reason for hiding this comment

YarShev Jul 22, 2024

Choose a reason for hiding this comment

data-makerman Jul 22, 2024

Choose a reason for hiding this comment

devin-petersohn Jul 22, 2024

Choose a reason for hiding this comment

YarShev Jul 23, 2024

Choose a reason for hiding this comment

devin-petersohn Jul 23, 2024

Choose a reason for hiding this comment

YarShev Jul 23, 2024

Choose a reason for hiding this comment

data-makerman commented Jul 22, 2024

devin-petersohn commented Jul 22, 2024

data-makerman commented Sep 17, 2024