Graceful error handling in future_lapply #32

jburos · 2017-06-23T21:19:55Z

I've been using future_lapply to submit jobs to a cluster of gcloud compute nodes - which is awesome!! Thanks so much for adding this user-friendly function.

One issue I'm running into, however, concerns the behavior when one of my gcloud compute nodes goes down unexpectedly.

Here's what I see:

Error in unserialize(node$con) :
  Failed to retrieve the value of ClusterFuture from cluster node #1 (on ‘35.190.151.35’).  The reason reported was ‘error reading from connection’
Calls: future_lapply ... FutureRegistry -> collectValues -> value -> value.ClusterFuture
Execution halted

Is there a good way to catch these errors for individual nodes, so that I can still access results from the nodes which are still executing? IE, can we pass an on.error function to call to values(fs)?

For now I have wrapped these in a call to withCallingHandlers() to catch & ignore this error. I haven't tested it in a real-life situation, but assuming it works this may be something to include in your vignette Common Issues with Solutions.

The code currently looks like something the following:

withCallingHandlers({
    remote_jobs <- future_lapply(seq_len(num_data_sims),
                                 execute_simtest,
                                 future.seed = 0xBEEF,
                                 future.scheduling = 2
                                 )
    }, FutureError = function(e) NULL
)

The text was updated successfully, but these errors were encountered:

HenrikBengtsson · 2017-06-28T05:52:51Z

I agree. For a starter, future_lapply() could recover the results for the elements that were processed by the other workers, and return a FutureError object for the elements processed by the failed node(s). I'll think more about this.

The next level of fanciness is to automatically redo the failed elements on workers already running. However, that opens up lots of potential problems, because it could be that the worker dies because of problematic elements - then trying the reevaluate those will bring down more workers.

Related to the above, I just found a note to myself from Jan 2016 to investigate whether it's possible to decided whether a worker has died or not, e.g.

> library("future")
> plan(multisession)
> f <- future({ tools::pskill(Sys.getpid()); TRUE })
> value(f)
Error in unserialize(node$con) : 
  Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost').  The reason reported was 'error reading from connection'

It would be nice to have a more informative error message here - one that suggests that the worker has died. It would also be nice to be able to exclude a non-working worker. On the other hand, that's a whole other level of framework to support, which might better fit in the underlying parallel package that we're using here, or be implemented in a "parallel2" package, which the future relies on.

jburos · 2017-06-28T16:26:07Z

I agree - some option to restart a failed job a max of X times would be really helpful. In the case of a node going down (say, because it's preemptible), as a user this could cause me to be comfortable setting X to some higher value than if node or process failures were more likely to be due to job-level features.

I hear you re: this might belong at a lower-level of the code, perhaps removing failed nodes & re-routing jobs appropriately. At the same time, I can see some utility in being able to skip over or ignore errors when retrieving results from the nodes even if those errors are not due to the nodes being down.

Just a side note, I wasn't able to get the withCallingHandlers code to ignore these failures as I had desired. Perhaps the call to values(fs) within future_lapply needs to be wrapped in a withRestarts? I'm a little out of my depth with these more advanced error-handling methods in R but thinking it might be time for me to learn more. I may play around with this to see what I can come up with.

For now, I'm back to using my own cludgy list-handlers of future jobs which I can manually wrap in a tryCatch. Though I'm hesitating to improve my RNG & scheduling portions since I do plan to switch back to future_lapply. (In my use case, since i'm working with simulating data, the replicable RNG generator is super-helpful! So thanks for that).

HenrikBengtsson · 2017-06-29T22:34:15Z

Just a side note, I wasn't able to get the withCallingHandlers code to ignore these failures as I had desired. Perhaps the call to values(fs) within future_lapply needs to be wrapped in a withRestarts?

Yes, the error occurs outside the future itself, so you won't be able to workaround it by embedding handlers in the future expression.

At some point in the past I was also elaborating / considering a future_lapply_as_futures() function that would return a list of futures and allow you to call values() yourself, which would allow you to do your own exception handling. Then there's been thoughts / requests about a future_apply(X, MARGIN, FUN) implementation, and so on. On top of this, one need to be careful to think about "the identity" of the Future API - whatever is implemented should conceptually work for all types of futures (including the ones implemented in the future by someone else than I).

To step back a little bit here, these type of issues and feature requests were the reason why I was hesitant making future_lapply() part of the future package itself - maybe it belongs to a higher-level package, i.e. at the level where foreach, plyr, purrr lives. The future package and it's Future API could be considered to be a low-level API that support those (but not re-implement their features). This also explains why I've been trying to "stay under the radar" with future_lapply().

Hopefully the above explains why it might take some to move forward on these types of requests; I wanna stride carefully here so I'm not adding features that need to be redrawn later.

HenrikBengtsson · 2017-08-09T11:16:03Z

FYI, created #159 do discuss where future_lapply() and similar functions should live, i.e. move it to a separate future.apply package.

…m resolved futures [#25, #59, #67, #154, #188 #199, #200]. This also fixes #199 and #200.

achetverikov · 2019-08-06T09:45:30Z

Is there any progress on this? It would be really nice to be able to capture errors from unsuccessful jobs without losing all successful ones.

riedel · 2020-04-24T00:53:26Z

There seems to have been progress on futures wrt this. So is only #44 required to get the full thing?

HenrikBengtsson referenced this issue in futureverse/future Feb 3, 2018

Introducing FutureEvaluationCondition classes [#25, #154, #155, #188]

aa30d5b

HenrikBengtsson referenced this issue in futureverse/future Feb 23, 2018

Introducing FutureResult for returning richer sets of information fro…

f8a9afe

…m resolved futures [#25, #59, #67, #154, #188 #199, #200]. This also fixes #199 and #200.

HenrikBengtsson transferred this issue from futureverse/future Nov 14, 2018

HenrikBengtsson added the feature request label Aug 6, 2019

HenrikBengtsson mentioned this issue Jun 25, 2020

finer control over future_lapply() #60

Open

3 tasks

Kodiologist mentioned this issue Apr 13, 2022

When stopping with an error, return the item that caused the error #101

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Graceful error handling in future_lapply #32

Graceful error handling in future_lapply #32

jburos commented Jun 23, 2017

HenrikBengtsson commented Jun 28, 2017

jburos commented Jun 28, 2017

HenrikBengtsson commented Jun 29, 2017

HenrikBengtsson commented Aug 9, 2017

achetverikov commented Aug 6, 2019

riedel commented Apr 24, 2020

Graceful error handling in future_lapply #32

Graceful error handling in future_lapply #32

Comments

jburos commented Jun 23, 2017

HenrikBengtsson commented Jun 28, 2017

jburos commented Jun 28, 2017

HenrikBengtsson commented Jun 29, 2017

HenrikBengtsson commented Aug 9, 2017

achetverikov commented Aug 6, 2019

riedel commented Apr 24, 2020