Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix server crashing on failed publish #201

Merged
merged 7 commits into from
May 2, 2018

Conversation

gkubisa
Copy link
Contributor

@gkubisa gkubisa commented Apr 3, 2018

This PR fixes the issue where the entire server crashes when ShareDB fails to publish a message to its PubSub system.

The easiest way to reproduce the problem is to use Redis for PubSub (with node-redis configured to try to restore the connection indefinitely), stop Redis and then send some operations to ShareDB server. ShareDB will then try to publish a message to Redis and crash the server.

The fix is to simply add an empty callback to the publish function. This works fine as ShareDB can gracefully recover from losing those messages by reading the missing document revisions from the database (eg MongoDB). For the same reason I don't think it is necessary to propagate the publishing error up, although it certainly is a possibility.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 96.163% when pulling 6e43456 on Teamwork:fix-crash-on-failed-publish into ea5053b on share:master.

1 similar comment
@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 96.163% when pulling 6e43456 on Teamwork:fix-crash-on-failed-publish into ea5053b on share:master.

@coveralls
Copy link

coveralls commented Apr 3, 2018

Coverage Status

Coverage increased (+0.02%) to 96.18% when pulling 598afdb on Teamwork:fix-crash-on-failed-publish into ea5053b on share:master.

@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 96.163% when pulling 6e43456 on Teamwork:fix-crash-on-failed-publish into ea5053b on share:master.

@curran
Copy link
Contributor

curran commented Apr 11, 2018

@gkubisa The Travis build is failing. Did you run the CI check locally first? You can run it like this:

npm run jshint && npm run test-cover

@gkubisa
Copy link
Contributor Author

gkubisa commented Apr 11, 2018

When I check the build details in Travis, it says the build failed only in "Node.js 0.1" - I don't believe anyone cares about that ancient version anymore. The build succeeds in all the more recent node versions.

@curran
Copy link
Contributor

curran commented Apr 11, 2018

@gkubisa Aha great point. Perhaps remove that old version from the Travis config then? Not sure if @nateps would approve of that, but seems like a reasonable thing to do. Also it's odd that CI passes for the old version in previous commits, but breaks after this change. I wonder what could be the issue...

@gkubisa
Copy link
Contributor Author

gkubisa commented Apr 11, 2018

I first fixed the .travis.yml file - it meant to run tests on node 0.10 not 0.1. However, the tests fail on 0.10 too. It looks like the reason is an update to mingo, which now uses Symbol, which is not supported in earlier versions of nodejs. See https://travis-ci.org/share/sharedb/jobs/365189513#L545-L548.

So, I then updated .travis.yml to run tests on all supported nodejs versions only. The tests pass now.

Copy link
Contributor

@curran curran left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha great! Tricky thing it is with Mingo, totally not from your code.

The changes look good to me. Although, it would be amazing if it could be covered by a test case. Not sure if it's feasible to test this as it involves stopping Redis. Perhaps there's a way to simulate it.

@ericyhwang
Copy link
Contributor

Thanks @gkubisa for the PR and to both of you for updating the Node versions!

Nate's starting to involve me in Derby/Share PRs, so you'll see me pop in on these more and more. Sorry for the delay on this - Rachael and I have time time scheduled with Nate to go over these PRs monthly at the moment, and it happened a bit later than usual this month.

Notes from the discussion:

  • Nate agrees that ShareDB should have some way to not crash the server on publish errors. As you point out, publishing is a non-critical part of the commit process, so it doesn't necessarily need to crash the server.
  • He expressed some concern about hiding errors - if there's a connectivity issue between a specific server and Redis, then a crash would tell you about that issue very quickly.
  • Nate's good with not propagating the publish error to the Share client, as the operation did successfully commit.
  • For his concern about hiding errors... I suggested using Node's EventEmitter for the pub/sub errors, which enables logging of those errors. Nate's ok with that approach, but we ran out of time discussing details.

To keep the ball rolling, I wrote up a short concrete proposal below. I'm not tied to any particulars if someone has better ideas!

  • Everyone: Feedback on the proposal?
  • @nateps: Tagging you here as you requested.
  • @gkubisa: Couple additional questions for you:
    • How are you currently handling logging of Redis connection errors on the server? Will the EventEmitter approach work for you?
    • Once we settle on something, would you like to work on the code changes, assuming it's not too much more than what you've got here?

Proposal for handling pub/sub errors:

  • Make the base PubSub class an EventEmitter, to let you log and handle pub/sub errors. The Backend is already an EventEmitter, but you might want to handle backend DB errors differently. (Thoughts?)
  • Take the current change and replace the util.doNothing callback with a callback that emits the error on the PubSub instances.
  • For "error" events, EventEmitter will exit the process if there's no error handler.
  • Eventually, it'd also be good to document the default (crashing) behavior and how to catch the errors, in something like a Productionization Advice readme.

@curran
Copy link
Contributor

curran commented Apr 24, 2018

@ericyhwang Excellent proposal! Looks good to me.

@gkubisa
Copy link
Contributor Author

gkubisa commented Apr 24, 2018

Thanks for the review @ericyhwang. I mostly agree with this proposal and already made PubSub an EventEmitter and removed util.doNothing.

Regarding emitting the error event, I think a better place to do that would be in sharedb-redis-pubsub. All the methods (_subscribe, _unsubscribe, _publish, close) would emit an error internally but only if no callback is provided. This behavior would be consistent with most node libs out there and it would make it impossible to forget to emit the error in the client code (ShareDB in this case).

To emit errors as above, this PR will need to be merged first and a new version of ShareDB published, so that sharedb-redis-pubsub could depend on the latest version.

@ericyhwang
Copy link
Contributor

Ah, good point about the other methods. Thankfully, it looks like calls to _subscribe and _unsubscribe do always pass callbacks.

One benefit to putting error emitting in the base class: New adapter subclasses won't have to implement error emitting themselves. They just have to pass callbacks through, and it'll work. Though that doesn't necessarily help if a method gets overridden.

Something else to discuss: Should all pub/sub errors get emitted, or only pub/sub errors when there's no callback? Since unhandled "error" type events can cause the process to exit, I'm leaning towards the latter, only emitting unhandled errors.

@curran
Copy link
Contributor

curran commented Apr 25, 2018

Looking forward to see this merged :)

Tangentially related to share/sharedb-redis-pubsub#4 (comment)

@gkubisa
Copy link
Contributor Author

gkubisa commented Apr 25, 2018

@ericyhwang Great point about emitting the errors in the base class - I've just implemented this approach. Since only _subscribe, _publish and _unsubscribe are supposed to be overridden, I think we should be good. I completely agree about emitting only unhandled errors.

I believe this PR is ready for merging now.

Copy link
Contributor

@ericyhwang ericyhwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, and thanks for updating/adding tests!

It looks like GitHub will let me merge this, but I'd still like Nate to at least glance over it first, if possible.

@nateps - The pubsub/index.js change is a short 20-something line diff, can you take a look and approve in the next couple days?

it('emits an error if _publish is unimplemented and callback is not provided', function(done) {
var pubsub = new PubSub();
pubsub.on('error', function(err) {
expect(err).an(Error);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Share doesn't have official code style guidelines that I can find, but everything uses 2-space indents, so let's match that for consistency.

});

it('can emit events', function(done) {
var pubsub = new PubSub();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, 2-space indents here too.

@@ -23,22 +32,29 @@ PubSub.prototype.close = function(callback) {
map[id].destroy();
}
}
if (callback) callback();
if (callback) process.nextTick(callback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@nateps
Copy link
Contributor

nateps commented May 2, 2018

In response to @curran about Travis config ("Perhaps remove that old version from the Travis config then? Not sure if @nateps would approve of that"):

  1. My guidance on node versions to test in CLI is that we should include the current major version of Node.js and all versions under LTS. Happy to accept updates to this as time goes on. Ideally we'd do it as a separate PR in the future, but let's just get it merged this time, since it is already done. Thanks for the help! Glad we were already on the same page about what versions would be useful to test.

  2. @curran would you mind creating a contributing.md file a la https://help.github.com/articles/setting-guidelines-for-repository-contributors/? We can start with the stuff we've discussed before, and let's make it easier on ourselves by starting with something as simple as possible; we can always add more. I'm happy for you to suggest some guidelines based on other open source project guidelines you respect.

For future reference and possibly linking in contributing file, the list of active major Node versions appears to be: https://github.com/nodejs/Release#release-schedule. According to that, it is 4, 6, 8, and 9, and as of just now, 10.

Copy link
Contributor

@nateps nateps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great PR! Thanks for the rapid turnaround on the comments and thanks for implementing the solution so well.

I had one small style comment, and I'll take care of that update real quick then merge.

@@ -13,8 +16,14 @@ function PubSub(options) {
// isn't complete until the callback returns from Redis
// Maps channel -> true
this.subscribed = {};

var self = this;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a style convention, the use of self as a variable totally works and that is what I use when there is not a better name to use. (You'll see this in sharedb tests even)

However, in sharedb and other repos, we do prefer to use the instance variable name (pubSub would be the instance name of the class PubSub) rather than self. I believe this makes the code and stack traces more specific, easier to understand, and easier to grep / search.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll keep that in mind in my future contributions to the project. 👍

@@ -13,8 +16,14 @@ function PubSub(options) {
// isn't complete until the callback returns from Redis
// Maps channel -> true
this.subscribed = {};

var self = this;
this._defaultCallback = function(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm impressed you repeated this pattern from QueryEmitter! Great job. 🎉

For future people reading this PR, the pattern means that we only have to create this callback function and the closure of the pubSub reference one time. The alternative would have been to create an anonymous function every time we publish. This is an ideal application of the same optimization, and it is a good use of consistency in naming and code patterns.

this._unsubscribe(channel, function(err) {
if (err) throw err;
});
this._unsubscribe(channel, this._defaultCallback);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for spotting this inconsistency and cleaning it up!

@nateps nateps merged commit c9df995 into share:master May 2, 2018
@gkubisa gkubisa deleted the fix-crash-on-failed-publish branch May 3, 2018 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants