Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pub/Sub subscribe to +switch-master events. #25

Merged
merged 3 commits into from
Dec 9, 2014

Conversation

ldm5180
Copy link
Contributor

@ldm5180 ldm5180 commented Nov 20, 2014

The master can change even without a disconnect happening to the client. In order to get this change the clients must subscribe to the +switch-master message. This needs to be subscribed before the client connection is established to avoid race conditions.

James Allen and others added 2 commits November 20, 2014 10:56
…eton.

Problem:
    The original design of the sentinel with Pub/Sub to the switch-master
    message broke the last unit test because it was always connected to the
    previously successful servers.

Analysis:
    In order to not have always stay connected to the old servers the
    subscription to the Pub/Sub message needs to only occur when the
    successful connect occurs.
@jamessharp
Copy link
Member

This is great - thanks. I think there's a race condition

  • reconnectAllClients calls connection_gone, which in turn emits 'reconnecting' which causes redis-sentinel to request the master/slave/whatever endpoint again and then node_redis sets a timer to attempt a reconnect.
  • In the case where the sentinel responds with endpoint details before the retry timer is up then all is fine. node_redis just connects to the new endpoints.
  • However I doubt that will happen very often. In the case where the retry timer calls first then node_redis will reconnect back to the old master. Even when the sentinel response comes back we won't do a reconnect so we'll stay connected to the old master.
  • The fix is non-trivial because you can't just say 'when we get a response from sentinel check if the endpoints have changed and if they have reconnect'. This is because in the intervening period (when we've reconnected to the old master before the sentinel response has come back) commands could well have been issued to the old master and any writes to the DB will fail.

The solution is probably along these lines:

  • In refreshEndpoints initially null route the address (i.e. set client.connectionOption.host = '0.0.0.0')
  • Then once resolver returns set the correct address (as today) then call connection_gone again to make sure we do attempt a reconnect.

I'm not sure that setting the null address won't kill node_redis but is worth a try.

…plete.

The resolver needs to run before the place to connect is determined, but
node-redis will start attempting to reconnect immediately. Prevent this by
setting the port and ip to the empty string and letting the resolver
modify it.

A better fix would be to modify node-redis to be able to not retry until
told to do so.
@ldm5180
Copy link
Contributor Author

ldm5180 commented Nov 29, 2014

Yes, there is certainly a race there. I fixed the race with the fix you suggested and tested it using a slow Sentinel. It seems to work fine.

@jamessharp
Copy link
Member

Looks good. Thanks!

jamessharp added a commit that referenced this pull request Dec 9, 2014
Pub/Sub subscribe to +switch-master events.
@jamessharp jamessharp merged commit 1a493ad into ortoo:master Dec 9, 2014
@ldm5180 ldm5180 deleted the pubsub-switchmaster branch December 23, 2014 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants