You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This was discovered during the testing we've been doing as part of #346, looking into failure scenarios of Redis clusters.
Currently when executing an action against a pool, it tries to grab a connection and if none are currently available it waits on a notification channel to be told that it can now try to grab one again; but if between the action asking for a connection and one becoming available that pool is closed (for example, if the node is removed from the cluster) then it can get into a state where no connection will ever become available, no notification will ever come to wake that function up, and it'll be stuck waiting until the context associated with the action reaches its deadline.
I made a simple change to check the result of proc.ClosedCh() in the same select statement it waits for the notification in so that the function can exit earlier when this occurs. With this change our test was able to recover the loss of a master node after a failover much faster (basically straight after the next periodic resync was done).
I have the changes on a branch, and have executed the tests with the change, but wanted to present it in case there was any issue I was missing with doing this: woodsbury@4b9342a
If it all looks good, I can create a PR for it.
The text was updated successfully, but these errors were encountered:
Hi there
This was discovered during the testing we've been doing as part of #346, looking into failure scenarios of Redis clusters.
Currently when executing an action against a pool, it tries to grab a connection and if none are currently available it waits on a notification channel to be told that it can now try to grab one again; but if between the action asking for a connection and one becoming available that pool is closed (for example, if the node is removed from the cluster) then it can get into a state where no connection will ever become available, no notification will ever come to wake that function up, and it'll be stuck waiting until the context associated with the action reaches its deadline.
I made a simple change to check the result of proc.ClosedCh() in the same select statement it waits for the notification in so that the function can exit earlier when this occurs. With this change our test was able to recover the loss of a master node after a failover much faster (basically straight after the next periodic resync was done).
I have the changes on a branch, and have executed the tests with the change, but wanted to present it in case there was any issue I was missing with doing this: woodsbury@4b9342a
If it all looks good, I can create a PR for it.
The text was updated successfully, but these errors were encountered: