-
Notifications
You must be signed in to change notification settings - Fork 15
Fix race condition causing 'pool not open' errors #5882
Conversation
Need to remove first, otherwise we may attempt to use this host while the pool is already closed. (Race condition)
Generate changelog in
|
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress); | ||
currentPools.remove(removedServerAddress); | ||
try { | ||
currentPools.get(removedServerAddress).shutdownPooling(); | ||
containerToRemove.shutdownPooling(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically doesn't fully solve the issue: what if there are in-flight requests on the pool-to-be-closed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, yes - ideally we'd have something that will:
a) block new requests from being started
b) wait for existing requests to complete (or time out)
c) clean up by shutting down pooling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like once we've called borrowObject
(in CassandraClientPoolingContainer.runWithGoodResource
), we should be good?
I'm not averse to adding something like this if it's proven to be necessary (and I can follow up this PR with logging to help us to determine this), but for now I think in-flight requests should continue to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty nice, actually!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just one nit, but otherwise nice! Also, your PR title has a typo
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress); | ||
currentPools.remove(removedServerAddress); | ||
try { | ||
currentPools.get(removedServerAddress).shutdownPooling(); | ||
containerToRemove.shutdownPooling(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's pretty nice, actually!
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress); | ||
currentPools.remove(removedServerAddress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove
returns the value removed, so we shouldn't need the get as well.
Released 0.530.0 |
Goals (and why):
In PDS-239577, we had a report of various failures, and the 'pool not open' error was a common theme. We know of a race condition where, on removing a host from the client pool, we shut down the pooling container before removing the host from the map. This means that there exists a short time when hosts can be present in the pool, but closed (and thus unable to serve requests).
==COMMIT_MSG==
Fixed a race condition where we would potentially attempt to use a Cassandra client pool that we had already closed.
==COMMIT_MSG==
Implementation Description (bullets):
Testing (What was existing testing like? What have you done to improve it?): Tricky to test for this; we would expect to see a reduction in failure frequency once this has rolled out.
Concerns (what feedback would you like?): Is there more we can do to avoid thrashing of client pools?
Where should we start reviewing?: +14/-5
Priority (whenever / two weeks / yesterday): yesterday