-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Better error handling in daemon + gateway #1325
Comments
@fsdiogo as you're the current OKR owner for this, I've assigned this to you |
Another error that happens directly after starting daemon right now:
|
@victorbjelkholm I can't seem to be able to reproduce the error in your last comment, can you share the steps or configs you have that produces the error? |
@fsdiogo this issue (#1325) is not about a specific issue but rather having having a error handler for uncaught exceptions from within the daemon. Currently, any error that happens down the stack (in js-bitswap for example) crashes the daemon, which is why it's so unstable currently. We need to make it so the daemon won't fully be crashed and needing a restart just because a error in one stream or for one operation. |
Alright, so it's best to start off on the daemon itself, i.e. try to reproduce an error and make it so that the daemon doesn't crash? |
@fsdiogo I think just adding |
I've been thinking about a better approach to this problem. Instead of a band aid approach, we should really understand how we're dealing with errors and have a robust solution to propagate them. We are using /*
With async the stack trace gets lost: it only tells the line
where we throw the error, but it came from one of the calls
in the series, we don't have a way to know from which one.
*/
const async = require('async')
async.series([
(cb) => cb(null, 'async x'),
(cb) => cb('async error'),
(cb) => cb(null, 'async xxx'),
], (err, res) => {
if (err) {
throw new Error(err)
}
})
/*
> node async.js
Users/diogo/async-promises/async.js:15
throw new Error(err);
^
Error: async error
at async.series (/Users/diogo/async-promises/async.js:15:10)
at /Users/diogo/async-promises/node_modules/async/dist/async.js:3874:9
at /Users/diogo/async-promises/node_modules/async/dist/async.js:473:16
at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:976:17)
at /Users/diogo/async-promises/node_modules/async/dist/async.js:958:16
at /Users/diogo/async-promises/node_modules/async/dist/async.js:3871:13
at async.series (/Users/diogo/async-promises/async.js:11:11)
at /Users/diogo/async-promises/node_modules/async/dist/async.js:3866:24
at replenish (/Users/diogo/async-promises/node_modules/async/dist/async.js:998:17)
at iterateeCallback (/Users/diogo/async-promises/node_modules/async/dist/async.js:983:17)
*/ A solution for this is Promises: /*
With promises the stack trace is intact, being possible
to observe all the lines and modules that lead to the error.
*/
new Promise((resolve, reject) => {
setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))
const randomFn = () => {
throw new Error('randomFn error')
}
/*
> node promises.js
Error: randomFn error
at randomFn (/Users/diogo/async-promises/promises.js:14:9)
at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
at <anonymous>
*/ Promises will vastly improve our debug capability and therefore help us with the handling of future errors. Moreover, we standardise the way we handle errors and improve the codebase by handling errors that are being lost along the way right now. PR #1335 by @hugomrdias is a good example of that. I had already opened the issue #1311 to start using promises, but I think this KR is the perfect case of why we should really do that. It's not a matter of callbacks vs promises, but of what works vs what doesn't. I've been thinking of possible solutions but this is the best way that I've found, I've talked with @vasco-santos and @hugomrdias and they agree with this approach. What do you guys think? @victorbjelkholm @diasdavid |
@dryajov want to chime in as you've dealt with this too? |
@fsdiogo so the issue here is not really with async.series([
(cb) => cb(null, 'async x'),
(cb) => cb('async error'),
(cb) => cb(null, 'async xxx'),
], (err, res) => {
if (err) {
throw new Error(err) // --> error instantiated here
}
}) The const async = require('async')
async.series([
(cb) => cb(null, 'async x'),
(cb) => cb(new Error('async error')), // --> error instantiated here
(cb) => cb(null, 'async xxx'),
], (err, res) => {
if (err) {
throw err
}
}) We get the correct stacktrace:
Note that in you're promise example you do instantiate the error at the place of rejection, so the stack comes up correctly as well: /*
With promises the stack trace is intact, being possible
to observe all the lines and modules that lead to the error.
*/
new Promise((resolve, reject) => {
setTimeout(() => resolve('promise x'), 100)
})
.then((res) => res + 'x')
.then((res) => randomFn())
.then((res) => res + 'xxx')
.catch((err) => console.log(err))
const randomFn = () => {
throw new Error('randomFn error')
}
/*
> node promises.js
Error: randomFn error
at randomFn (/Users/diogo/async-promises/promises.js:14:9)
at Promise.then.then (/Users/diogo/async-promises/promises.js:9:16)
at <anonymous>
*/ |
@dryajov sure ur correct, but the problem isn't where you create the errors. It's propagation, with async, errors instances are just passed around (or teleported as i say it). With promises, errors are automatically re throwed and stack-traces follow the code path and with a lot less noise from async internals. Not having full stack traces is really painful because we have to manually search the code base, commenting or whatever until we find the line we need. In contrast with complete stacks we can cmd+click BOOM editor jumped to the line we need perfect developer experience and full productivity. |
Aside from my comment above, this is still obviously a legitimate issue overall. Promises do help with that somewhat, but they are IMO not the final solution for this particular issue and I would argue that bringing them up here as an end all solution for this problem, might end up doing more harm than good. I though about it for some time and I think some of the issues we have boil down to mixing error handling styles. We rely heavily on callback style async constructs in our code base, which occasionally get mixed with IMO, in order to handle errors correctly we need to:
|
This stuff below also is tricky because callback might not be called, in some places we even have const done = (err, res) => {
if (err) {
self.emit('error', err)
return callback(err)
}
self.state.initialized()
self.emit('init')
callback(null, res)
} What do you think about removing event emitters ? moving stuff like boot() inside ipfs constructor to a instance method or a static factory method where we can return a promise or use callbacks. |
Ah, yeah - Yeah, I think removing |
Yes it's true for promises, that's why I mention removing event emitters. IMO this should be one of the first things to do then all the stuff you said and finally add promises to help with stacks, debugging , back track cancelables, retry/timeouts etc |
Thanks for the input @dryajov and @hugomrdias!
|
So, I realize that removing event emitting can simplify things (no argument from me that any critical error during startup should translate to an error in the callback/rejection in a promise), but we should keep in mind that the events are part of the public API. Various users may be relying on them. They at least deserve notice of the removal well before it happens, and ideally a release or two where the events still work but log a warning when used. (That said, event emitters should all be wrapped in |
@Mr0grog agree, this change should be treated as a breaking change and possibly a major version bump (0.3.x?). |
@fsdiogo before refactoring the code, make sure to have tests that check for the situations where the daemon errors out today. Try things like opening a connection and write trash at different levels of the dialing (transport -> secio -> stream muxing -> protocols). |
Yeah @Mr0grog @dryajov, it will be a breaking change and apps using ipfs will probably have to be refactored, so that's why in its first iteration we should offer both versions. I've opened a small WIP PR with that: #1359. But as @diasdavid pointed out, I shall start with a different task. I've updated #1325 (comment) with the tasks order. |
I have a couple more suggestions for your consideration:
To address @victorbjelkholm's original request: In Node.js the best practice on uncaught exceptions is to crash the process. We need to track down and fix any issues that cause the process to crash. In Node.js you can't meaningfully reason about the state of the system in the case of an unhandled exception so you need to restart the process to avoid leaking memory, holding open connections unnecessarily, corrupting shared state etc. etc. It's worth having a read of https://www.joyent.com/node-js/production/design/errors - relevant section:
|
I'm not clear on the benefits of removing To avoid crashing the process on an I get the point about handlers throwing and not allowing the callback to be called, but they are unhandled exceptions - you don't really want to catch them (what would you do? - it's not your code that err'd) and if you put them on the next tick then it's just going to be an unhandled exception there too. Our code should be doing any cleanup and/or running critical code paths before emit is called. BTW I'm +1 on all the other points that @fsdiogo has listed. |
👍
I guess I read this issue more as “there are a lot of errors we could be catching and handling so as not to leave ourselves in a bad state, so we don’t need to crash the daemon as often.” Because otherwise I totally agree that simply ignoring any old error is a recipe for really confusing and possibly more dangerous disasters, like data loss, sometime later in the process’s lifetime.
In any other large system I’ve worked on, we’ve generally followed the logic that, if you catch them, you do you clean-up and then re-throw (or just use a
Exactly! Same as above; that’s the point. It’s not your error, so you shouldn’t be preventing/stopping it. You should only be preventing it from breaking your internal state because, again, it’s not your error.
Wherever possible, 👍 ! It’s just not always reasonable/possible. Sometimes it’s important that the ordering of the event come before some other code (e.g. |
I would actually abstain of using node stream at all and instead use |
Yeah, you’re probably right for the case of all the stuff here. I think |
Also, since we made the decision of using pull-stream exclusively a while back. Plus, although full featured, pull-stream are still light weight compared to node streams in general. |
@dryajov point taken! The sentiment was to ensure stream errors do not cause the daemon to crash so as long as we're ensuring that I'm happy 😄 |
@fsdiogo during last js-ipfs weekly, there were some question with regards on how to approach this. Do you feel it would be good to sync? |
Well, I was going to follow these steps, if you feel that's the way to go, awesome! If not, we should sync 🙂 |
|
Ok, I see, so that is where we diverge. You should first of all create the tests that cause the crashes. |
You want that approach so we don't start changing the code blindly, right? I get it. I just think that way the process will be a lot slower, no? |
@fsdiogo we are not optimizing for shipping fast here, the target should be correctness and ensuring that the same problems don't come back to haunt us. |
Agree, we should have a test for every failure we run into 👍 |
@fsdiogo a good source of problems to solve is the list of bugs -- https://github.com/ipfs/js-ipfs/labels/bug --. Creating tests that reproduce those will help you nail down where the problems are coming from. @alanshaw, @pgte and @satazor have a vast amount of experience in making large Node.js services robust. They can give a ton of tips on how to move forward the best way possible, reach out! |
Alright, thanks @diasdavid! I'll reach out then! |
No more "already piped" with 0.31 - #1458 |
Duplicate/Same as #1406. Let's continue there |
Today, if there is a error in a stream or any other error within the daemon or the gateway, the entire process crashes. While not a problem for testing, it's a problem when deploying and trying to run a long-term daemon (for example, for the js-ipfs daemons).
I raised this before in #1243
Basically, we should be able to gracefully handle errors down the stack without having the daemon and gateway crash.
Current issue is the following error that happens every ~10 minutes:
I'm not saying that we should (or should not) fix this particular error, but rather when this error (and similar) happens, the daemon and gateway should not die but continue running.
The text was updated successfully, but these errors were encountered: