Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graceful error handling in future_lapply #32

Open
jburos opened this issue Jun 23, 2017 · 6 comments
Open

Graceful error handling in future_lapply #32

jburos opened this issue Jun 23, 2017 · 6 comments

Comments

@jburos
Copy link

jburos commented Jun 23, 2017

I've been using future_lapply to submit jobs to a cluster of gcloud compute nodes - which is awesome!! Thanks so much for adding this user-friendly function.

One issue I'm running into, however, concerns the behavior when one of my gcloud compute nodes goes down unexpectedly.

Here's what I see:

Error in unserialize(node$con) :
  Failed to retrieve the value of ClusterFuture from cluster node #1 (on ‘35.190.151.35’).  The reason reported was ‘error reading from connection’
Calls: future_lapply ... FutureRegistry -> collectValues -> value -> value.ClusterFuture
Execution halted

Is there a good way to catch these errors for individual nodes, so that I can still access results from the nodes which are still executing? IE, can we pass an on.error function to call to values(fs)?

For now I have wrapped these in a call to withCallingHandlers() to catch & ignore this error. I haven't tested it in a real-life situation, but assuming it works this may be something to include in your vignette Common Issues with Solutions.

The code currently looks like something the following:

withCallingHandlers({
    remote_jobs <- future_lapply(seq_len(num_data_sims),
                                 execute_simtest,
                                 future.seed = 0xBEEF,
                                 future.scheduling = 2
                                 )
    }, FutureError = function(e) NULL
)
@HenrikBengtsson
Copy link
Collaborator

I agree. For a starter, future_lapply() could recover the results for the elements that were processed by the other workers, and return a FutureError object for the elements processed by the failed node(s). I'll think more about this.

The next level of fanciness is to automatically redo the failed elements on workers already running. However, that opens up lots of potential problems, because it could be that the worker dies because of problematic elements - then trying the reevaluate those will bring down more workers.


Related to the above, I just found a note to myself from Jan 2016 to investigate whether it's possible to decided whether a worker has died or not, e.g.

> library("future")
> plan(multisession)
> f <- future({ tools::pskill(Sys.getpid()); TRUE })
> value(f)
Error in unserialize(node$con) : 
  Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost').  The reason reported was 'error reading from connection'

It would be nice to have a more informative error message here - one that suggests that the worker has died. It would also be nice to be able to exclude a non-working worker. On the other hand, that's a whole other level of framework to support, which might better fit in the underlying parallel package that we're using here, or be implemented in a "parallel2" package, which the future relies on.

@jburos
Copy link
Author

jburos commented Jun 28, 2017

I agree - some option to restart a failed job a max of X times would be really helpful. In the case of a node going down (say, because it's preemptible), as a user this could cause me to be comfortable setting X to some higher value than if node or process failures were more likely to be due to job-level features.

I hear you re: this might belong at a lower-level of the code, perhaps removing failed nodes & re-routing jobs appropriately. At the same time, I can see some utility in being able to skip over or ignore errors when retrieving results from the nodes even if those errors are not due to the nodes being down.

Just a side note, I wasn't able to get the withCallingHandlers code to ignore these failures as I had desired. Perhaps the call to values(fs) within future_lapply needs to be wrapped in a withRestarts? I'm a little out of my depth with these more advanced error-handling methods in R but thinking it might be time for me to learn more. I may play around with this to see what I can come up with.

For now, I'm back to using my own cludgy list-handlers of future jobs which I can manually wrap in a tryCatch. Though I'm hesitating to improve my RNG & scheduling portions since I do plan to switch back to future_lapply. (In my use case, since i'm working with simulating data, the replicable RNG generator is super-helpful! So thanks for that).

@HenrikBengtsson
Copy link
Collaborator

Just a side note, I wasn't able to get the withCallingHandlers code to ignore these failures as I had desired. Perhaps the call to values(fs) within future_lapply needs to be wrapped in a withRestarts?

Yes, the error occurs outside the future itself, so you won't be able to workaround it by embedding handlers in the future expression.

At some point in the past I was also elaborating / considering a future_lapply_as_futures() function that would return a list of futures and allow you to call values() yourself, which would allow you to do your own exception handling. Then there's been thoughts / requests about a future_apply(X, MARGIN, FUN) implementation, and so on. On top of this, one need to be careful to think about "the identity" of the Future API - whatever is implemented should conceptually work for all types of futures (including the ones implemented in the future by someone else than I).

To step back a little bit here, these type of issues and feature requests were the reason why I was hesitant making future_lapply() part of the future package itself - maybe it belongs to a higher-level package, i.e. at the level where foreach, plyr, purrr lives. The future package and it's Future API could be considered to be a low-level API that support those (but not re-implement their features). This also explains why I've been trying to "stay under the radar" with future_lapply().

Hopefully the above explains why it might take some to move forward on these types of requests; I wanna stride carefully here so I'm not adding features that need to be redrawn later.

@HenrikBengtsson
Copy link
Collaborator

FYI, created #159 do discuss where future_lapply() and similar functions should live, i.e. move it to a separate future.apply package.

HenrikBengtsson referenced this issue in futureverse/future Feb 23, 2018
@HenrikBengtsson HenrikBengtsson transferred this issue from futureverse/future Nov 14, 2018
@achetverikov
Copy link

Is there any progress on this? It would be really nice to be able to capture errors from unsuccessful jobs without losing all successful ones.

@riedel
Copy link

riedel commented Apr 24, 2020

There seems to have been progress on futures wrt this. So is only #44 required to get the full thing?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants