Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[client][placement groups] Client placement group hooks, attempt #3 #15382

Merged
merged 22 commits into from
Apr 23, 2021

Conversation

DmitriGekhtman
Copy link
Contributor

Why are these changes needed?

Allows usage of the new Placement Group API in client mode.
Basically the same as this PR:
#14060

The strategy is to use cpu=0 remote functions for placement group operations which invoke ray.worker.global_worker.core_worker.

Other changes:

  • Ray Client uses json to serialize and deserialize options, so PlacementGroup objects are converted to and from json-serializable format when passed into and retrieved from options.
  • Added a context manager for Ray client tests and used it in most of the existing placement group tests.
  • Modified conversion of normal actors/remote functions to client actors/remote functions to work when client session is disconnected and reconnected.
  • Corrected behavior of client mode ray.wait with timeout 0.

Related issue number

Closes #13147

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ericl ericl removed their assignment Apr 19, 2021
Copy link
Contributor

@ijrsvt ijrsvt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! A few minor comments. Thanks so much for doing this!!!!

@ijrsvt
Copy link
Contributor

ijrsvt commented Apr 20, 2021

@DmitriGekhtman is this ready to merge?

@DmitriGekhtman
Copy link
Contributor Author

@ijrsvt yep!

python/ray/util/placement_group.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments. 2 main things

  1. Can we try to keep the diffs smaller? It makes it much easier to review
  2. Generally confused about the client_mode_wrap thing and if it's needed.

This is useful for functions where the goal isn't to delegate
module calls to the ray client equivalent, but to instead implement
ray client features that can be executed by tasks on the server side.
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a little trouble parsing this doc string. So is this meant for ray client's internal functionality that needs to run on the cluster?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. I think the doc string is supposed to distinguish this wrapper from client_mode_hook which delegates public ray APIs to Ray client's RayApiStub

I'll update the doc string.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if the new doc string is clearer.

python/ray/util/client/ray_client_helpers.py Show resolved Hide resolved
python/ray/util/client/common.py Outdated Show resolved Hide resolved
python/ray/util/client/server/server.py Outdated Show resolved Hide resolved
python/ray/util/placement_group.py Show resolved Hide resolved
python/ray/util/placement_group.py Show resolved Hide resolved
@DmitriGekhtman
Copy link
Contributor Author

Left some comments. 2 main things

  1. Can we try to keep the diffs smaller? It makes it much easier to review
  2. Generally confused about the client_mode_wrap thing and if it's needed.

(1.) Will try to restructure the tests so that a fixture is used instead of a context manager, to avoid indentation.
If for some reason that doesn't work, it's possible to view diffs with whitespace ignored:
dear-github/dear-github#91 (comment)

(2.) I think it's necessary to be able to use all of the placement group public APIs.
An alternative is the strategy of PR #13310 which is to make a new ClientPlacementGroup class.

@DmitriGekhtman
Copy link
Contributor Author

I don't immediately see a clean way of replacing the context managers in these particular tests with fixtures.
(Open to concrete suggestions, though.)

The issue is that the placement group tests currently do a bunch of custom cluster setup logic in the body of the test (not in a fixture) and then call ray.init.
A client connection has to be established after all the custom setup logic -- not sure if it makes sense right now to restructure each test so that a client connection can be established in a fixture executed before the body of the test.

Maybe it would be possible to do client setup in the yield line of a fixture? Not sure if that's a good idea.

In other contexts, a fixture would definitely be the way to go.

@DmitriGekhtman
Copy link
Contributor Author

The current diffs in test_placement_group.py are mostly logic being indented into a context manager.
There's a unit test of PlacementGroup.to_dict and PlacementGroup.from_dict added to the end.

The following tests are still not tested with Ray client in this PR:

test_ready_warning_suppressed
[Uses an internal API]

test_automatic_cleanup_job
test_automatic_cleanup_detached_actors
test_detached_placement_group
test_named_placement_group
[Test behavior of normal ray drivers]

test_create_placement_group_after_gcs_server_restart
test_create_actor_with_placement_group_after_gcs_server_restart
test_placement_group_wait_api
[Just inconvenient for me to test locally because they're flaky on MacOS.]

@ijrsvt
Copy link
Contributor

ijrsvt commented Apr 21, 2021

@DmitriGekhtman I think leaving them as is is fine, and reviewing with 'ignore whitepsaces' is sufficient here. I'm more concerned that refactoring to fixtures may introduce actual changes to the tests.

@wraps(func)
def wrapper(*args, **kwargs):
if client_mode_should_convert():
f = ray.remote(num_cpus=0)(func)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still concerned about why we need to submit these tasks to the cluster, it just seems kinda unnecessary and pretty decent chunk of overhead/complication. I understand that we can't run these functions on the client, but the client server is a driver in the cluster so it should be fine right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal of this PR is to make public placement group APIs available on the client.

Copy link
Contributor Author

@DmitriGekhtman DmitriGekhtman Apr 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So a user should be able to do the following locally

import ray
from ray.util.placement_group import placement_group
ray.util.connect(...)
pg = placement_group(...)
...

Currently, that would throw an error here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuisawesome I agree that this is slower than having a gRPC message for each of these functions, but I don't think that the minor performance difference is worth the additional complexity added to the client + server (namely having to create and manage ClientPlacementGroups).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wuisawesome Is this ok to resolve for now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this seem like an abuse of Client tasks? Is this iffiness of this that tasks should be used for heavy computations
(rather than for simple RPCs which is pretty much what's going on here) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess there's a reason num_cpus=1 is the default for a task, rather than num_cpus=0? What's the correct use case (if any) for a num_cpus=0 task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tasks should be used for heavy computations

There is definitely more overhead with submitting a Ray task (as opposed to a RPC call). That being said I think the extra overhead isn't that big of a deal given that if people are trying to create many, many placement groups, the real bottleneck will all likelihood be in waiting for the cluster to scale to the appropriate size to support all the placement groups.

What's the correct use case (if any) for a num_cpus=0 task?

For computationally light tasks. The functions being wrapped are either getter/setter methods (tiny overhead) or are waiting for PG creation (which can be thought of as a 'node-startup'-bound operation).

@wuisawesome wuisawesome added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Apr 21, 2021
@wuisawesome
Copy link
Contributor

I'm more concerned that refactoring to fixtures may introduce actual changes to the tests.

Yeah this is fair and in that context, I think it makes sense to hold off on making it a fixture for now.

We probably should've done the ray.inits in fixtures to begin with :(

@DmitriGekhtman DmitriGekhtman added tests-ok The tagger certifies test failures are unrelated and assumes personal liability. and removed @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. labels Apr 21, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tests-ok The tagger certifies test failures are unrelated and assumes personal liability.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support placement groups in Ray client
6 participants