-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Enable task to returne object as if it's returned by its parent #26774
Conversation
Hi @stephanie-wang , I ended up with this prototype. With this, I can forward the returned result successfully, and that object is accessible later in python if run on a single-node cluster. But when I run it in a cluster, the object cannot be successfully retrieved in python. It will hang indefinitely. I'm not sure which part I missed. Any help would be appreciated! |
Do you have a test case showing case what you are doing here? |
hi @jjyao , @ray.remote
class a:
def b:
return 1
@ray.remote
def c():
h = ray.get_actor("actor_a")
return h.b.options(forward_to_parent = 1).remote()
if __name__ == "__main__":
h = a.options(name="actor_a").remote()
ref = ray.get(c.remote()) Normally we can call the b method directly. In our case, we want the value returned by b to be fault tolerant. However, actor of class This PR is to implement assigning the ownership of |
@kira-lin Is it possible to solve it with ray-project/enhancements#10? |
It seems to be able to solve our problem. But it'll need an external HA storage, if I understand correctly? How is it different than saving our spark results to HDFS and use ray to read it? |
Yes, it needs external HA storage. I don't think there's much difference if you only consider the results and availability. However, Ray makes it transparent to Ray users. You don't need to design and maintain key / object ref / HDFS path mappings. |
I see. This PR aims to apply ray and spark's lineage-based recovery to handle object loss. I think there are no conflicts between this PR and the Object HA PR. If our users are OK with checkpointing all the training data, then go with HA is a very good choice, while some users may want better performance and choose lineage-based recovery. |
@kira-lin do I understand correctly that this only applies to stateless actors? Also it will be nice to add unit test/integration test to show the problem it solves. |
@scv119 Yes, as long as the actos's task can be executed by another actor and yield the same output. I'll add a unit test soon. |
Hi @kira-lin, Sorry for the late reply and thanks for your response. Now I understand your use case and motivations. I think my current concern is how generic the solution is. Seems you would only need it when the actor is not restartable and the task can be run by another equivalent actor that generates the same result. Not sure besides your use cases, who else will need it. I may not know raydp very well but is it possible to make the stateless actor restartable (e.g. creating a new spark executor)? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel it might worth a REP (https://github.com/ray-project/enhancements) to flush out the problem statement, proposal, semantics of the proposed APIs, design, etc. WDYT? @stephanie-wang @scv119
if (parent_num_returns < 0) { | ||
return ObjectID::FromIndex(TaskId(), return_index + 1); | ||
} else { | ||
return ObjectID::FromIndex(ParentTaskId(), parent_num_returns + return_index + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means we can only have one forward_to_parent
task call inside one task right?
Also does this api align with the ownership transfer thing we might want to do in the future? @stephanie-wang |
@jjyao, as you said, this pr is not that generic. I'm trying to see if the executor can be restarted in raydp now. I was thinking this should be simpler at that time, but it turns out to be a large PR. |
Why are these changes needed?
It is crucial for raydp to implement fault tolerance.
It solves the following issue: if an object returned by an actor needs to be recovered, while the actor was dead, and cannot be restarted. (In our case, actor would be spark executors, and it cannot be registered if it restarts). By adding a task to call the actor's task, and forwarding the lineage to that task, the task is able to be resubmitted and choose an actor to run the task again.
Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.