Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixing Deadlock Bug Between ZK Callback & Event Thread On Acquiring Coordinator Object #964

Conversation

shrinandthakkar
Copy link
Collaborator

@shrinandthakkar shrinandthakkar commented Oct 25, 2023

Summary

Bug In Detail

  • In a rare situation, the runnable of the Coordinator event thread (aka event thread) and the callback thread of the ZK session expiry could become deadlocked when the callback thread attempts to interrupt the event thread by acquiring the lock of the Coordinator object. The event thread on the other side cannot be interrupted because it is blocked on acquiring the Coordinator object's lock to handle an event before breaking out of the runnable loop.

  • The code snippets shown below are pseudocode examples of the actual implementation, used to explain the deadlock scenario. Essentially, if the callback thread of the ZK client enters the stopEventThread() method (and acquires the Coordinator object's lock) at the same time that the event thread is about to enter the handleEvent() method, a deadlock scenario will occur.

    • The Coordinator event thread remains blocked until the lock of the Coordinator's object is released. code link
    • The Session expiry callback thread will not release the Coordinator object's lock until the event thread is interrupted. code link
// Coordinator event thread's runnable

private boolean run() {
  while (isNotIterrupted()) {
    handleEvent(); // this method needs to acquire lock on the Coordinator object.
  }
}
// ZK client's callback thread on session expiry
// "synchronized" method of the Coordinator class.

private synchronized boolean stopEventThread() {
  while (eventThread.isAlive()) {
    eventThread.interrupt(); // continuously attempts to interrupt the event thread.
  }
}

BugFix In Detail

  • An intrinsic Conditional Variable is used to halt threads (such as ZK callback threads and the main server thread) to release the lock on the Coordinator object and before attempting to re-acquire the Coordinator object.
  • In the stopEventThread() method, the calling thread (such as zk callback threads and the main server thread) waits for a maximum of the _heartbeat period to acquire the Coordinator object. This time-bound waiting prevents the calling thread from waiting indefinitely if the event thread has already been shut down.
  • Similarly, in the waitForEventThreadToJoin() method, the calling thread (in this case, the main server thread) must wait for a maximum of the _heartbeat period.
  • This conditional variable is not used to explicitly halt the Coordinator event thread.
  • The Coordinator event thread will use this intrinsic conditional variable to notify other threads that may be waiting to acquire access to the Coordinator object after it releases the lock on the object and before it loops and attempts to reacquire it.

Pseudocode examples of the proposed implementation.

// Coordinator event thread's runnable

private boolean run() {
  while (isNotIterrupted()) {
    handleEvent(); // this method needs to acquire lock on the Coordinator object.
    notifyThreadsWaitingForCoordinatorObjectSynchronization();
  }
}
// ZK client's callback thread on session expiry

private synchronized boolean stopEventThread() {
  while (eventThread.isAlive()) {
    wait(heartbeatPeriod); // released the lock on coordinator object for heartbeat period and re-acquires it. 
    eventThread.interrupt(); // continuously attempts to interrupt the event thread.
  }
}

Tests

  1. Testing the scenario when the session expiry callback thread is attempting to acquire the Coordinator object's lock before the event thread calls for handling an event.
  2. Testing the scenario when the session expiry callback thread is attempting to acquire the Coordinator object's lock after the event thread calls for handling an event.

Important: DO NOT REPORT SECURITY ISSUES DIRECTLY ON GITHUB.
For reporting security issues and contributing security fixes,
please, email [email protected] instead, as described in
the contribution guidelines.

Please, take a minute to review the contribution guidelines at:
https://github.com/linkedin/Brooklin/blob/master/CONTRIBUTING.md

@@ -332,12 +336,19 @@ private synchronized void startEventThread() {
}
}

private synchronized boolean stopEventThread() {
private boolean stopEventThread() {
// interrupt the thread if it's not gracefully shutdown
while (_eventThread.isAlive()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it ok to not synchronize the thread state check here ? What is it that we are actually synchronizing in this method ?

Copy link
Collaborator

@atoomula atoomula Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the synchronized block should be around while loop with cv. Rest sounds good. I think it is ok to not have waitForNotificationFromEventThread() method and have the code inline within this code and at the other place.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean something like this ??

private boolean stopEventThread() {
  synchronized (_conditionalVariableForCoordinatorObjectSynchronization) {
    _conditionalVariableForCoordinatorObjectSynchronization.wait(duration.toMillis());
    while (_eventThread.isAlive()) {
      try {
        synchronized (this) {
          _eventThread.interrupt();
          _eventThread.join(EVENT_THREAD_SHORT_JOIN_TIMEOUT);
        }
      } catch (InterruptedException e) {
        return true;
      }
    }
  }
  return false;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define a lock explicitly (comment above), we can only use one lock for cv and also the object. You will have to replace all synchronize constructs on a method/object with this lock. That would be much cleaner, IMO ? Wdyt ?

Copy link
Collaborator Author

@shrinandthakkar shrinandthakkar Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the synchronized block should be around while loop with cv.

We cannot have the conditional variable's synchronized block around the while loop. The conditional variable block has to release the lock after calling the wait method so that the notify method can acquire the lock on the conditional variable block.

Otherwise it'll be looping around constantly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define a lock explicitly (comment above), we can only use one lock for cv and also the object. You will have to replace all synchronize constructs on a method/object with this lock. That would be much cleaner, IMO ? Wdyt ?

We can define another lock explicitly I guess (instead of sync blocks), but we cannot combine both the locks (conditional variable lock and the coordinator object's lock).

With the conditional variable we are making sure of the following:

  1. The Coordinator Event thread never gets halted on conditional variable.
  2. The "ZK Session Expiry callback thread" or "the main server thread" (trying to stop the coordinator) gets halted on the conditional variable. It waits for about the heartbeat period for "the coordinator event thread" to finish the event handling and then "the coordinator event thread" itself would notify these waiting threads to acquire the coordinator object's lock, before "the coordinator event thread" enter's handleEvent again and acquires the lock again.

I have added two different test scenarios with detailed comments, it might explain it more clearly.

Copy link
Collaborator Author

@shrinandthakkar shrinandthakkar Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked up and there was an intrinsic conditional variable of the class which we could use along with keeping the method definition synchronized. While waiting on that intrinsic CV, the lock will be released and that is how we can let the other thread to be prioritized.

With this change, it made the change very cleaner. Thank you for the insights @atoomula and @hshukla !

@@ -236,6 +236,10 @@ public class Coordinator implements ZkAdapter.ZkAdapterListener, MetricsAware {
private final Lock _throughputViolatingTopicsMapWriteLock = _throughputViolatingTopicsMapReadWriteLock.writeLock();
private final Lock _throughputViolatingTopicsMapReadLock = _throughputViolatingTopicsMapReadWriteLock.readLock();

// This CV helps to halt threads (zk callback threads, main server thread) before attempting to acquire the Coordinator
// object. We never halt the event thread (coordinator thread) explicitly via this CV.
private final Object _conditionalVariableForCoordinatorObjectSynchronization = new Object();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think another way to define condition variable in Java is to explicitly define a lock and create a CV out of it. Instead of using synchronized blocks on the object around the methods, we can use this lock explicitly. https://www.iitk.ac.in/esc101/05Aug/tutorial/essential/threads/explicitlocks.html#:~:text=To%20wait%20on%20an%20explicit,that%20the%20condition%20has%20occurred.

@@ -332,12 +336,19 @@ private synchronized void startEventThread() {
}
}

private synchronized boolean stopEventThread() {
private boolean stopEventThread() {
// interrupt the thread if it's not gracefully shutdown
while (_eventThread.isAlive()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define a lock explicitly (comment above), we can only use one lock for cv and also the object. You will have to replace all synchronize constructs on a method/object with this lock. That would be much cleaner, IMO ? Wdyt ?

@shrinandthakkar shrinandthakkar marked this pull request as ready for review October 26, 2023 19:23
@shrinandthakkar shrinandthakkar force-pushed the fixZKAndCoordinatorThreadDeadlock branch from 8dde23e to d9a457a Compare October 26, 2023 22:06
}
}

private synchronized boolean stopEventThread() {
// interrupt the thread if it's not gracefully shutdown
while (_eventThread.isAlive()) {
CoordinatorEventProcessor eventThread = getEventThread();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are we not pointing at _eventThread instead of calling getEventThread() ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an API makes it much better to test it with overrides instead of mocking the var, hence I changed the references to use the APIs to fetch the variable value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

atoomula
atoomula previously approved these changes Oct 27, 2023
Copy link
Collaborator

@atoomula atoomula left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Minor comments. Pls feel free to ignore.

}
}

private synchronized boolean stopEventThread() {
// interrupt the thread if it's not gracefully shutdown
while (_eventThread.isAlive()) {
CoordinatorEventProcessor eventThread = getEventThread();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

// Waits to acquire the Coordinator object for a maximum of _heartbeat period.
// The time bound waiting prevents the caller thread to not infinitely wait if
// the event thread is already shutdown.
waitForNotificationFromEventThread(_heartbeatPeriod);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we wait for (_heartbeatPeriod - some delta) ? This comment is valid only of this thread is responsible for punching heartbeats.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the coordinator thread is responsible for punching heartbeats. And we are never waiting on the coordinator thread. We are only waiting on any ZK callback threads or the main datastream server thread.
So, this should not be a problem.

} catch (InterruptedException e) {
_log.warn("Exception caught while waiting the event thread to stop", e);
return true;
}
return false;
}

// Waits for a notification for specified duration from the event thread before acquiring the Coordinator object.
private void waitForNotificationFromEventThread(Duration duration) {
try {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there anyway to validate if the object lock is held when we enter this method (as we are calling wait here) ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validate that the wait releases the lock on the "this" object? Since its an intrinsic property of the wait function to release the lock on the object from where it is called, not sure if we need validation in code.

In the tests, it should be validating that. I have tried reproducing the deadlock scenario in one of the tests and ensuring no deadlock would mean that the object lock is released when the zk session expiry thread is waiting.

// be waiting on acquiring access on the Coordinator object.
// We are only calling notify on the synchronized Coordinator Object's ("this") waiting threads.
// Suppressing the Naked_Notify warning on this.
protected synchronized void notifyThreadsWaitingForCoordinatorObjectSynchronization() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need this helper method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to use this from couple of tests and hence decided to keep it in a generic function which can be reused.

@@ -29,4 +29,11 @@
<Class name="com.linkedin.datastream.common.zk.ZkClient" />
<Bug pattern="NM_SAME_SIMPLE_NAME_AS_SUPERCLASS" />
</Match>

<!-- Suppress warning about naked notify on the synchronized coordinator object -->
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you are mentioning this.notifyAll() do we still need to suppress this check?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, even with "this.notifyAll()", the warning does show up. It does not recognize the "this" object as a monitor object somehow. 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it safe to not use this. ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The warning says,

A call to notify() or notifyAll() was made without any (apparent) accompanying modification to mutable object state. In general, calling a notify method on a monitor is done because some condition another thread is waiting for has become true. However, for the condition to be meaningful, it must involve a heap object that is visible to both threads.
This bug does not necessarily indicate an error, since the change to mutable object state may have taken place in a method which then called the method containing the notification.

  • The warning indicates that since there are no distinct heap objects and no specific met conditions, it is being thrown out as a warning.
  • We are using wait and notify without any specific conditions. We simply wait for low-priority threads with the coordinator object's monitor lock and notify on the same monitor lock.
  • Given this, since there are no conditions in our case, we may call notify multiple times from the coordinator thread even when no threads are waiting on the coordinator's object monitor. However, if a thread calls notify() when no threads are waiting, it is effectively a no-op and does not result in any changes to the program's state.

Hence we should not have any problem here with this warning.

hshukla
hshukla previously approved these changes Oct 27, 2023
Copy link
Collaborator

@hshukla hshukla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Work!

@shrinandthakkar shrinandthakkar force-pushed the fixZKAndCoordinatorThreadDeadlock branch from a76ee6c to 87938c2 Compare October 28, 2023 01:01
@shrinandthakkar shrinandthakkar merged commit 55d2b08 into linkedin:master Oct 30, 2023
1 check passed
@1arrow
Copy link

1arrow commented Oct 30, 2023

Any timeline for releasing a new version with this fix?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants