Skip to content
This repository has been archived by the owner on Nov 14, 2024. It is now read-only.

Fix race condition causing 'pool not open' errors #5882

Merged
merged 4 commits into from
Jan 27, 2022
Merged

Conversation

gsheasby
Copy link
Contributor

Goals (and why):
In PDS-239577, we had a report of various failures, and the 'pool not open' error was a common theme. We know of a race condition where, on removing a host from the client pool, we shut down the pooling container before removing the host from the map. This means that there exists a short time when hosts can be present in the pool, but closed (and thus unable to serve requests).

==COMMIT_MSG==
Fixed a race condition where we would potentially attempt to use a Cassandra client pool that we had already closed.
==COMMIT_MSG==

Implementation Description (bullets):

  • Fix race condition
  • Drive-by improvement to logging

Testing (What was existing testing like? What have you done to improve it?): Tricky to test for this; we would expect to see a reduction in failure frequency once this has rolled out.

Concerns (what feedback would you like?): Is there more we can do to avoid thrashing of client pools?

Where should we start reviewing?: +14/-5

Priority (whenever / two weeks / yesterday): yesterday

Need to remove first, otherwise we may attempt to use this host while the pool is already closed. (Race condition)
@changelog-app
Copy link

changelog-app bot commented Jan 27, 2022

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Fixed a race condition where we would potentially attempt to use a Cassandra client pool that we had already closed.

Check the box to generate changelog(s)

  • Generate changelog entry

Comment on lines 402 to 405
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress);
currentPools.remove(removedServerAddress);
try {
currentPools.get(removedServerAddress).shutdownPooling();
containerToRemove.shutdownPooling();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically doesn't fully solve the issue: what if there are in-flight requests on the pool-to-be-closed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, yes - ideally we'd have something that will:
a) block new requests from being started
b) wait for existing requests to complete (or time out)
c) clean up by shutting down pooling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like once we've called borrowObject (in CassandraClientPoolingContainer.runWithGoodResource), we should be good?

I'm not averse to adding something like this if it's proven to be necessary (and I can follow up this PR with logging to help us to determine this), but for now I think in-flight requests should continue to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty nice, actually!

Copy link
Contributor

@Jolyon-S Jolyon-S left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Just one nit, but otherwise nice! Also, your PR title has a typo

Comment on lines 402 to 405
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress);
currentPools.remove(removedServerAddress);
try {
currentPools.get(removedServerAddress).shutdownPooling();
containerToRemove.shutdownPooling();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's pretty nice, actually!

Comment on lines 402 to 403
CassandraClientPoolingContainer containerToRemove = currentPools.get(removedServerAddress);
currentPools.remove(removedServerAddress);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove returns the value removed, so we shouldn't need the get as well.

@gsheasby gsheasby changed the title Fix race condtion causing 'pool not open' errors Fix race condition causing 'pool not open' errors Jan 27, 2022
@bulldozer-bot bulldozer-bot bot merged commit 4ebeb74 into develop Jan 27, 2022
@bulldozer-bot bulldozer-bot bot deleted the fix/pool-race branch January 27, 2022 17:20
@svc-autorelease
Copy link
Collaborator

Released 0.530.0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants