[client][placement groups] Client placement group hooks, attempt #3 #15382

DmitriGekhtman · 2021-04-17T04:40:50Z

Why are these changes needed?

Allows usage of the new Placement Group API in client mode.
Basically the same as this PR:
#14060

The strategy is to use cpu=0 remote functions for placement group operations which invoke ray.worker.global_worker.core_worker.

Other changes:

Ray Client uses json to serialize and deserialize options, so PlacementGroup objects are converted to and from json-serializable format when passed into and retrieved from options.
Added a context manager for Ray client tests and used it in most of the existing placement group tests.
Modified conversion of normal actors/remote functions to client actors/remote functions to work when client session is disconnected and reconnected.
Corrected behavior of client mode ray.wait with timeout 0.

Related issue number

Closes #13147

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/util/placement_group.py

python/ray/tests/test_client.py

python/ray/tests/test_placement_group.py

ijrsvt

LGTM! A few minor comments. Thanks so much for doing this!!!!

ijrsvt · 2021-04-20T05:08:36Z

@DmitriGekhtman is this ready to merge?

DmitriGekhtman · 2021-04-20T11:26:26Z

@ijrsvt yep!

python/ray/util/placement_group.py

wuisawesome

Left some comments. 2 main things

Can we try to keep the diffs smaller? It makes it much easier to review
Generally confused about the client_mode_wrap thing and if it's needed.

wuisawesome · 2021-04-20T20:29:29Z

python/ray/_private/client_mode_hook.py

+    This is useful for functions where the goal isn't to delegate
+    module calls to the ray client equivalent, but to instead implement
+    ray client features that can be executed by tasks on the server side.
+    """


I'm having a little trouble parsing this doc string. So is this meant for ray client's internal functionality that needs to run on the cluster?

Correct. I think the doc string is supposed to distinguish this wrapper from client_mode_hook which delegates public ray APIs to Ray client's RayApiStub

I'll update the doc string.

Let me know if the new doc string is clearer.

python/ray/util/client/ray_client_helpers.py

python/ray/util/client/common.py

python/ray/util/client/server/server.py

python/ray/util/placement_group.py

DmitriGekhtman · 2021-04-20T23:11:42Z

Left some comments. 2 main things

Can we try to keep the diffs smaller? It makes it much easier to review

Generally confused about the client_mode_wrap thing and if it's needed.

(1.) Will try to restructure the tests so that a fixture is used instead of a context manager, to avoid indentation.
If for some reason that doesn't work, it's possible to view diffs with whitespace ignored:
dear-github/dear-github#91 (comment)

(2.) I think it's necessary to be able to use all of the placement group public APIs.
An alternative is the strategy of PR #13310 which is to make a new ClientPlacementGroup class.

python/ray/util/placement_group.py

DmitriGekhtman · 2021-04-21T15:39:53Z

I don't immediately see a clean way of replacing the context managers in these particular tests with fixtures.
(Open to concrete suggestions, though.)

The issue is that the placement group tests currently do a bunch of custom cluster setup logic in the body of the test (not in a fixture) and then call ray.init.
A client connection has to be established after all the custom setup logic -- not sure if it makes sense right now to restructure each test so that a client connection can be established in a fixture executed before the body of the test.

Maybe it would be possible to do client setup in the yield line of a fixture? Not sure if that's a good idea.

In other contexts, a fixture would definitely be the way to go.

DmitriGekhtman · 2021-04-21T15:56:51Z

The current diffs in test_placement_group.py are mostly logic being indented into a context manager.
There's a unit test of PlacementGroup.to_dict and PlacementGroup.from_dict added to the end.

The following tests are still not tested with Ray client in this PR:

test_ready_warning_suppressed
[Uses an internal API]

test_automatic_cleanup_job
test_automatic_cleanup_detached_actors
test_detached_placement_group
test_named_placement_group
[Test behavior of normal ray drivers]

test_create_placement_group_after_gcs_server_restart
test_create_actor_with_placement_group_after_gcs_server_restart
test_placement_group_wait_api
[Just inconvenient for me to test locally because they're flaky on MacOS.]

ijrsvt · 2021-04-21T16:46:33Z

@DmitriGekhtman I think leaving them as is is fine, and reviewing with 'ignore whitepsaces' is sufficient here. I'm more concerned that refactoring to fixtures may introduce actual changes to the tests.

wuisawesome · 2021-04-21T16:54:34Z

python/ray/_private/client_mode_hook.py

+    @wraps(func)
+    def wrapper(*args, **kwargs):
+        if client_mode_should_convert():
+            f = ray.remote(num_cpus=0)(func)


I'm still concerned about why we need to submit these tasks to the cluster, it just seems kinda unnecessary and pretty decent chunk of overhead/complication. I understand that we can't run these functions on the client, but the client server is a driver in the cluster so it should be fine right?

The goal of this PR is to make public placement group APIs available on the client.

So a user should be able to do the following locally

import ray from ray.util.placement_group import placement_group ray.util.connect(...) pg = placement_group(...) ...

Currently, that would throw an error here

@wuisawesome I agree that this is slower than having a gRPC message for each of these functions, but I don't think that the minor performance difference is worth the additional complexity added to the client + server (namely having to create and manage ClientPlacementGroups).

@wuisawesome Is this ok to resolve for now?

Does this seem like an abuse of Client tasks? Is this iffiness of this that tasks should be used for heavy computations
(rather than for simple RPCs which is pretty much what's going on here) ?

I guess there's a reason num_cpus=1 is the default for a task, rather than num_cpus=0? What's the correct use case (if any) for a num_cpus=0 task?

tasks should be used for heavy computations

There is definitely more overhead with submitting a Ray task (as opposed to a RPC call). That being said I think the extra overhead isn't that big of a deal given that if people are trying to create many, many placement groups, the real bottleneck will all likelihood be in waiting for the cluster to scale to the appropriate size to support all the placement groups.

What's the correct use case (if any) for a num_cpus=0 task?

For computationally light tasks. The functions being wrapped are either getter/setter methods (tiny overhead) or are waiting for PG creation (which can be thought of as a 'node-startup'-bound operation).

wuisawesome · 2021-04-21T17:19:17Z

I'm more concerned that refactoring to fixtures may introduce actual changes to the tests.

Yeah this is fair and in that context, I think it makes sense to hold off on making it a fixture for now.

We probably should've done the ray.inits in fixtures to begin with :(

DmitriGekhtman assigned rkooo567, ijrsvt and ericl Apr 17, 2021

DmitriGekhtman commented Apr 17, 2021

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

ijrsvt reviewed Apr 19, 2021

View reviewed changes

python/ray/tests/test_client.py Outdated Show resolved Hide resolved

DmitriGekhtman added 12 commits April 19, 2021 13:56

Modifications and type fixes

4f46960

Serialize and deserialize placement groups

11a9460

context manager

c403968

Check if key exists

ca500f0

Arg typo

15979b2

Comment

82dc004

wip

19baf05

Some tests

ff10523

num_cpus 0

d7e6339

Fix wait 0

1a836ee

More tests

d78fd69

Use signal actor to test ray.wait with timeout 0

caaec1c

DmitriGekhtman force-pushed the placement-groups-again branch from 0af6012 to caaec1c Compare April 19, 2021 21:02

ericl removed their assignment Apr 19, 2021

DmitriGekhtman added 2 commits April 19, 2021 14:14

Fix test, typo

714ff7d

Ignore lint issue with a noqa

77be665

ijrsvt reviewed Apr 19, 2021

View reviewed changes

python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved

ijrsvt reviewed Apr 19, 2021

View reviewed changes

python/ray/tests/test_placement_group.py Outdated Show resolved Hide resolved

ijrsvt approved these changes Apr 19, 2021

View reviewed changes

DmitriGekhtman added 3 commits April 19, 2021 14:36

Move context manager test. Fix grammar.

fe465bf

Remove context manager test logic from the old spot

86c62a7

Fix test_client

2753a83

ckw017 approved these changes Apr 20, 2021

View reviewed changes

ckw017 requested changes Apr 20, 2021

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

Type hints for alternative constructors

5d456f1

wuisawesome requested changes Apr 20, 2021

View reviewed changes

wuisawesome reviewed Apr 21, 2021

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

wuisawesome reviewed Apr 21, 2021

View reviewed changes

python/ray/util/placement_group.py Outdated Show resolved Hide resolved

DmitriGekhtman added 3 commits April 20, 2021 18:27

Serialization/deserialization cleanup

5351eba

Comment

663837d

Comment

b9a0054

left a test off

3b15e1e

wuisawesome reviewed Apr 21, 2021

View reviewed changes

wuisawesome added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2021

DmitriGekhtman added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Apr 21, 2021

wuisawesome approved these changes Apr 22, 2021

View reviewed changes

ijrsvt merged commit 0d0c241 into ray-project:master Apr 23, 2021

richardliaw mentioned this pull request Apr 29, 2021

[core] ray client not idempotent, causing tests to fail when used together with horovod #15378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[client][placement groups] Client placement group hooks, attempt #3 #15382

[client][placement groups] Client placement group hooks, attempt #3 #15382

DmitriGekhtman commented Apr 17, 2021

ijrsvt left a comment

ijrsvt commented Apr 20, 2021

DmitriGekhtman commented Apr 20, 2021

wuisawesome left a comment

wuisawesome Apr 20, 2021

DmitriGekhtman Apr 20, 2021

DmitriGekhtman Apr 21, 2021 •

edited

Loading

DmitriGekhtman commented Apr 20, 2021

DmitriGekhtman commented Apr 21, 2021

DmitriGekhtman commented Apr 21, 2021

ijrsvt commented Apr 21, 2021 •

edited

Loading

wuisawesome Apr 21, 2021

DmitriGekhtman Apr 21, 2021

DmitriGekhtman Apr 21, 2021 •

edited

Loading

ijrsvt Apr 21, 2021

DmitriGekhtman Apr 21, 2021

DmitriGekhtman Apr 22, 2021

DmitriGekhtman Apr 22, 2021

ijrsvt Apr 22, 2021

wuisawesome commented Apr 21, 2021

[client][placement groups] Client placement group hooks, attempt #3 #15382

[client][placement groups] Client placement group hooks, attempt #3 #15382

Conversation

DmitriGekhtman commented Apr 17, 2021

Why are these changes needed?

Related issue number

Checks

ijrsvt left a comment

Choose a reason for hiding this comment

ijrsvt commented Apr 20, 2021

DmitriGekhtman commented Apr 20, 2021

wuisawesome left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

DmitriGekhtman commented Apr 20, 2021

DmitriGekhtman commented Apr 21, 2021

DmitriGekhtman commented Apr 21, 2021

ijrsvt commented Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DmitriGekhtman Apr 21, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wuisawesome commented Apr 21, 2021

DmitriGekhtman Apr 21, 2021 •

edited

Loading

ijrsvt commented Apr 21, 2021 •

edited

Loading

DmitriGekhtman Apr 21, 2021 •

edited

Loading