Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][meta] Fixed race condition between ResourceLock update and invalidation #19817

Merged
merged 2 commits into from
Mar 16, 2023

Conversation

merlimat
Copy link
Contributor

Motivation

The test LockManagerTest.updateValueWhenKeyDisappears() can having infrequent failures, especially when running with Etcd backend.

The problem is really in the logic of the ResourceLock itself. After a normal lock-acquire operation, the test is triggering 2 events at the same time:

  1. A deletion of the lock key-value (eg: the ephemeral node might have disappeared)
  2. A lock.update() operation to change the value of the lock.

The race condition is due to the fact that operation (1) will trigger a lock revalidation in background and this now happens at the same time as operation (2).

The result is that it's possible for the ResourceLock to enter into a loop of revalidations and deletions.

eg:

2023-03-14T20:40:40,350 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,376 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,392 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,392 - INFO  - [-8-1:ResourceLockImpl@268] - Successfully re-acquired missing lock at /key-9166073612375/1
2023-03-14T20:40:40,392 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,402 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,407 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,407 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,417 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,422 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,422 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,422 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,433 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,437 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,437 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,437 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,447 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,452 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,452 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,452 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,467 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,472 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,472 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,472 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,482 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
2023-03-14T20:40:40,486 - INFO  - [-8-1:ResourceLockImpl@180] - Acquired resource lock on /key-9166073612375/1
2023-03-14T20:40:40,486 - INFO  - [-8-1:ResourceLockImpl@325] - Successfully re-acquired lock at /key-9166073612375/1
2023-03-14T20:40:40,487 - INFO  - [-8-1:ResourceLockImpl@203] - Successfully revalidated the lock on /key-9166073612375/1
2023-03-14T20:40:40,497 - INFO  - [-8-1:ResourceLockImpl@201] - Lock on resource /key-9166073612375/1 was invalidated
....

Modifications

Similarly as to what we're already doing to prevent concurrent revalidate operations, we need to also wait for any existing operations to complete before attempting to do the update.

The same is true if there is an update() operation in progress: the next revalidate should wait.

We then use a single future to signal pending operation and to piggy back and try when it's the right time.

Verifying this change

  • Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

  • Added integration tests for end-to-end deployment with large payloads (10MB)
  • Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@merlimat merlimat added type/bug The PR fixed a bug or issue reported a bug release/2.11.1 ready-to-test labels Mar 15, 2023
@merlimat merlimat added this to the 3.0.0 milestone Mar 15, 2023
@merlimat merlimat self-assigned this Mar 15, 2023
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Mar 15, 2023
@merlimat merlimat changed the title [fix][metadata-store] Fixed race condition between ResourceLock update and invalidation [fix][meta] Fixed race condition between ResourceLock update and invalidation Mar 15, 2023
@Technoboy-
Copy link
Contributor

/pulsarbot run-failure-checks

Copy link
Member

@mattisonchao mattisonchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, and left some comments. :)

pendingOperationFuture = pendingOperationFuture.thenCompose(v -> {
synchronized (ResourceLockImpl.this) {
if (state != State.Valid) {
return CompletableFuture.failedFuture(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have returned a new exception. Do we need to change the relevant exception catch?

return acquire(newValue);
// If there is an operation in progress, we're going to let it complete before attempting to
// update the value
pendingOperationFuture = pendingOperationFuture.thenCompose(v -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pendingOperationFuture may throw exceptions LockBusyException, BadVersionException or other connectivity issues and return errors directly. Do we need to check that the relevant usage logic has handled this exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I think we're missing to reset it once it gets completed once.

@mattisonchao
Copy link
Member

mattisonchao commented Mar 15, 2023

LockManagerTest.updateValueWhenKeyDisappears this test to be unstable. Maybe it was affected by this PR.

image

@merlimat merlimat merged commit 9adec1b into apache:master Mar 16, 2023
@merlimat merlimat deleted the fix-lock-update-race branch March 16, 2023 01:35
revalidateFuture.whenComplete((unused, throwable) -> {
doRevalidate(newValue).thenRun(() -> newFuture.complete(null))
trackFuture = new CompletableFuture<>();
trackFuture.whenComplete((unused, throwable) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is wrong... the track future will always be incomplete. :(

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#19844 is trying to perfect it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cherry-picked/branch-2.11 doc-not-needed Your PR changes do not impact docs ready-to-test release/2.11.1 type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants