[Core] Enable task to returne object as if it's returned by its parent #26774

kira-lin · 2022-07-20T08:36:39Z

Why are these changes needed?

It is crucial for raydp to implement fault tolerance.

It solves the following issue: if an object returned by an actor needs to be recovered, while the actor was dead, and cannot be restarted. (In our case, actor would be spark executors, and it cannot be registered if it restarts). By adding a task to call the actor's task, and forwarding the lineage to that task, the task is able to be resubmitted and choose an actor to run the task again.

Related issue number

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

kira-lin · 2022-07-20T08:50:05Z

Hi @stephanie-wang ,
I realized that it's possible to return spark rdd data from executor actors through ray calls, because RDD and Partition is serializable. I have implemented this, but in order for it to be fault tolerant, I need to implement this feature, which is very similar to what you've suggested me before.

I ended up with this prototype. With this, I can forward the returned result successfully, and that object is accessible later in python if run on a single-node cluster. But when I run it in a cluster, the object cannot be successfully retrieved in python. It will hang indefinitely.

I'm not sure which part I missed. Any help would be appreciated!

jjyao · 2022-08-01T21:00:33Z

Do you have a test case showing case what you are doing here?

kira-lin · 2022-08-02T03:05:11Z

hi @jjyao ,
I have some problem to make this work in a cluster now, so I have not added the tests yet.
But the effect we want is something like this:

@ray.remote
class a:
   def b:
       return 1

@ray.remote
def c():
    h = ray.get_actor("actor_a")
    return h.b.options(forward_to_parent = 1).remote()

if __name__ == "__main__":
    h = a.options(name="actor_a").remote()
    ref = ray.get(c.remote())

Normally we can call the b method directly. In our case, we want the value returned by b to be fault tolerant. However, actor of class a cannot be restarted. (If spark loses an executor, it'll request a new one.) In this case, b cannot be resubmitted.
Also it's important to notice spark executors are interchangeable, we can resubmit the task to another spark executor.
Therefore, we want to introduce a normal task c, which finds a spark executor to submit the task, and can be resubmitted when the returned value is lost.

This PR is to implement assigning the ownership of a.b's returned value to c's caller, driver in this case, and use c's task id to generate the object id, in order to resubmit c when lost.

kfstorm · 2022-08-09T09:47:16Z

@kira-lin Is it possible to solve it with ray-project/enhancements#10?

kira-lin · 2022-08-10T01:57:16Z

It seems to be able to solve our problem. But it'll need an external HA storage, if I understand correctly?

How is it different than saving our spark results to HDFS and use ray to read it?

kfstorm · 2022-08-10T02:48:37Z

It seems to be able to solve our problem. But it'll need an external HA storage, if I understand correctly?

How is it different than saving our spark results to HDFS and use ray to read it?

Yes, it needs external HA storage.

I don't think there's much difference if you only consider the results and availability. However, Ray makes it transparent to Ray users. You don't need to design and maintain key / object ref / HDFS path mappings.

kira-lin · 2022-08-11T01:24:40Z

I see. This PR aims to apply ray and spark's lineage-based recovery to handle object loss. I think there are no conflicts between this PR and the Object HA PR. If our users are OK with checkpointing all the training data, then go with HA is a very good choice, while some users may want better performance and choose lineage-based recovery.

scv119 · 2022-08-15T15:47:08Z

@kira-lin do I understand correctly that this only applies to stateless actors? Also it will be nice to add unit test/integration test to show the problem it solves.

kira-lin · 2022-08-16T09:09:46Z

@scv119 Yes, as long as the actos's task can be executed by another actor and yield the same output.

I'll add a unit test soon.

jjyao · 2022-08-16T22:31:32Z

Hi @kira-lin,

Sorry for the late reply and thanks for your response. Now I understand your use case and motivations. I think my current concern is how generic the solution is. Seems you would only need it when the actor is not restartable and the task can be run by another equivalent actor that generates the same result. Not sure besides your use cases, who else will need it.

I may not know raydp very well but is it possible to make the stateless actor restartable (e.g. creating a new spark executor)?

jjyao

I feel it might worth a REP (https://github.com/ray-project/enhancements) to flush out the problem statement, proposal, semantics of the proposed APIs, design, etc. WDYT? @stephanie-wang @scv119

jjyao · 2022-08-16T22:39:34Z

src/ray/common/task/task_spec.cc

+  if (parent_num_returns < 0) {
+    return ObjectID::FromIndex(TaskId(), return_index + 1);
+  } else {
+    return ObjectID::FromIndex(ParentTaskId(), parent_num_returns + return_index + 1);


This means we can only have one forward_to_parent task call inside one task right?

jjyao · 2022-08-16T22:56:17Z

Also does this api align with the ownership transfer thing we might want to do in the future? @stephanie-wang

kira-lin · 2022-08-18T06:42:38Z

@jjyao, as you said, this pr is not that generic. I'm trying to see if the executor can be restarted in raydp now. I was thinking this should be simpler at that time, but it turns out to be a large PR.

kira-lin added 2 commits July 15, 2022 14:18

merge

9776065

upd

d31676c

kira-lin requested review from jovany-wang, kfstorm, raulchen, ericl, fishbone, WangTaoTheTonic, wuisawesome, AmeerHajAli, robertnishihara, pcmoritz and scv119 as code owners July 20, 2022 08:36

kira-lin added 2 commits July 29, 2022 07:48

add update_forwarded_object

2494011

upd

87309b0

kira-lin added 2 commits August 2, 2022 11:28

fix

d990a1f

add to submissible tasks

6706c3e

scv119 assigned jjyao Aug 15, 2022

scv119 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 15, 2022

jjyao reviewed Aug 16, 2022

View reviewed changes

stephanie-wang self-assigned this Aug 18, 2022

kira-lin closed this Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Enable task to returne object as if it's returned by its parent #26774

[Core] Enable task to returne object as if it's returned by its parent #26774

kira-lin commented Jul 20, 2022

kira-lin commented Jul 20, 2022 •

edited

Loading

jjyao commented Aug 1, 2022

kira-lin commented Aug 2, 2022

kfstorm commented Aug 9, 2022

kira-lin commented Aug 10, 2022

kfstorm commented Aug 10, 2022

kira-lin commented Aug 11, 2022

scv119 commented Aug 15, 2022

kira-lin commented Aug 16, 2022

jjyao commented Aug 16, 2022

jjyao left a comment

jjyao Aug 16, 2022

jjyao commented Aug 16, 2022

kira-lin commented Aug 18, 2022

[Core] Enable task to returne object as if it's returned by its parent #26774

[Core] Enable task to returne object as if it's returned by its parent #26774

Conversation

kira-lin commented Jul 20, 2022

Why are these changes needed?

Related issue number

Checks

kira-lin commented Jul 20, 2022 • edited Loading

jjyao commented Aug 1, 2022

kira-lin commented Aug 2, 2022

kfstorm commented Aug 9, 2022

kira-lin commented Aug 10, 2022

kfstorm commented Aug 10, 2022

kira-lin commented Aug 11, 2022

scv119 commented Aug 15, 2022

kira-lin commented Aug 16, 2022

jjyao commented Aug 16, 2022

jjyao left a comment

Choose a reason for hiding this comment

jjyao Aug 16, 2022

Choose a reason for hiding this comment

jjyao commented Aug 16, 2022

kira-lin commented Aug 18, 2022

kira-lin commented Jul 20, 2022 •

edited

Loading