-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Graceful error handling in future_lapply #32
Comments
I agree. For a starter, The next level of fanciness is to automatically redo the failed elements on workers already running. However, that opens up lots of potential problems, because it could be that the worker dies because of problematic elements - then trying the reevaluate those will bring down more workers. Related to the above, I just found a note to myself from Jan 2016 to investigate whether it's possible to decided whether a worker has died or not, e.g. > library("future")
> plan(multisession)
> f <- future({ tools::pskill(Sys.getpid()); TRUE })
> value(f)
Error in unserialize(node$con) :
Failed to retrieve the value of MultisessionFuture from cluster node #1 (on 'localhost'). The reason reported was 'error reading from connection' It would be nice to have a more informative error message here - one that suggests that the worker has died. It would also be nice to be able to exclude a non-working worker. On the other hand, that's a whole other level of framework to support, which might better fit in the underlying parallel package that we're using here, or be implemented in a "parallel2" package, which the future relies on. |
I agree - some option to restart a failed job a max of X times would be really helpful. In the case of a node going down (say, because it's I hear you re: this might belong at a lower-level of the code, perhaps removing failed nodes & re-routing jobs appropriately. At the same time, I can see some utility in being able to skip over or ignore errors when retrieving results from the nodes even if those errors are not due to the nodes being down. Just a side note, I wasn't able to get the For now, I'm back to using my own cludgy list-handlers of |
Yes, the error occurs outside the future itself, so you won't be able to workaround it by embedding handlers in the future expression. At some point in the past I was also elaborating / considering a To step back a little bit here, these type of issues and feature requests were the reason why I was hesitant making Hopefully the above explains why it might take some to move forward on these types of requests; I wanna stride carefully here so I'm not adding features that need to be redrawn later. |
FYI, created #159 do discuss where |
Is there any progress on this? It would be really nice to be able to capture errors from unsuccessful jobs without losing all successful ones. |
There seems to have been progress on futures wrt this. So is only #44 required to get the full thing? |
I've been using
future_lapply
to submit jobs to a cluster of gcloud compute nodes - which is awesome!! Thanks so much for adding this user-friendly function.One issue I'm running into, however, concerns the behavior when one of my gcloud compute nodes goes down unexpectedly.
Here's what I see:
Is there a good way to catch these errors for individual nodes, so that I can still access results from the nodes which are still executing? IE, can we pass an
on.error
function to call tovalues(fs)
?For now I have wrapped these in a call to
withCallingHandlers()
to catch & ignore this error. I haven't tested it in a real-life situation, but assuming it works this may be something to include in your vignette Common Issues with Solutions.The code currently looks like something the following:
The text was updated successfully, but these errors were encountered: