-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[client][placement groups] Client placement group hooks, attempt #3 #15382
[client][placement groups] Client placement group hooks, attempt #3 #15382
Conversation
0af6012
to
caaec1c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! A few minor comments. Thanks so much for doing this!!!!
@DmitriGekhtman is this ready to merge? |
@ijrsvt yep! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. 2 main things
- Can we try to keep the diffs smaller? It makes it much easier to review
- Generally confused about the
client_mode_wrap
thing and if it's needed.
This is useful for functions where the goal isn't to delegate | ||
module calls to the ray client equivalent, but to instead implement | ||
ray client features that can be executed by tasks on the server side. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a little trouble parsing this doc string. So is this meant for ray client's internal functionality that needs to run on the cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. I think the doc string is supposed to distinguish this wrapper from client_mode_hook
which delegates public ray APIs to Ray client's RayApiStub
I'll update the doc string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if the new doc string is clearer.
(1.) Will try to restructure the tests so that a fixture is used instead of a context manager, to avoid indentation. (2.) I think it's necessary to be able to use all of the placement group public APIs. |
I don't immediately see a clean way of replacing the context managers in these particular tests with fixtures. The issue is that the placement group tests currently do a bunch of custom cluster setup logic in the body of the test (not in a fixture) and then call ray.init. Maybe it would be possible to do client setup in the yield line of a fixture? Not sure if that's a good idea. In other contexts, a fixture would definitely be the way to go. |
The current diffs in The following tests are still not tested with Ray client in this PR: test_ready_warning_suppressed test_automatic_cleanup_job test_create_placement_group_after_gcs_server_restart |
@DmitriGekhtman I think leaving them as is is fine, and reviewing with 'ignore whitepsaces' is sufficient here. I'm more concerned that refactoring to fixtures may introduce actual changes to the tests. |
@wraps(func) | ||
def wrapper(*args, **kwargs): | ||
if client_mode_should_convert(): | ||
f = ray.remote(num_cpus=0)(func) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still concerned about why we need to submit these tasks to the cluster, it just seems kinda unnecessary and pretty decent chunk of overhead/complication. I understand that we can't run these functions on the client, but the client server is a driver in the cluster so it should be fine right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal of this PR is to make public placement group APIs available on the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So a user should be able to do the following locally
import ray
from ray.util.placement_group import placement_group
ray.util.connect(...)
pg = placement_group(...)
...
Currently, that would throw an error here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome I agree that this is slower than having a gRPC
message for each of these functions, but I don't think that the minor performance difference is worth the additional complexity added to the client + server (namely having to create and manage ClientPlacementGroups
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuisawesome Is this ok to resolve for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this seem like an abuse of Client tasks? Is this iffiness of this that tasks should be used for heavy computations
(rather than for simple RPCs which is pretty much what's going on here) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess there's a reason num_cpus=1
is the default for a task, rather than num_cpus=0
? What's the correct use case (if any) for a num_cpus=0
task?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tasks should be used for heavy computations
There is definitely more overhead with submitting a Ray task (as opposed to a RPC call). That being said I think the extra overhead isn't that big of a deal given that if people are trying to create many, many placement groups, the real bottleneck will all likelihood be in waiting for the cluster to scale to the appropriate size to support all the placement groups.
What's the correct use case (if any) for a num_cpus=0 task?
For computationally light tasks. The functions being wrapped are either getter/setter
methods (tiny overhead) or are waiting for PG creation (which can be thought of as a 'node-startup'-bound operation).
Yeah this is fair and in that context, I think it makes sense to hold off on making it a fixture for now. We probably should've done the ray.inits in fixtures to begin with :( |
Why are these changes needed?
Allows usage of the new Placement Group API in client mode.
Basically the same as this PR:
#14060
The strategy is to use cpu=0 remote functions for placement group operations which invoke
ray.worker.global_worker.core_worker
.Other changes:
Related issue number
Closes #13147
Checks
scripts/format.sh
to lint the changes in this PR.