Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] zk EventThread get block after zk leader stop #18269

Closed
2 tasks done
Shawyeok opened this issue Oct 31, 2022 · 5 comments
Closed
2 tasks done

[Bug] zk EventThread get block after zk leader stop #18269

Shawyeok opened this issue Oct 31, 2022 · 5 comments
Labels
type/bug The PR fixed a bug or issue reported a bug

Comments

@Shawyeok
Copy link
Contributor

Search before asking

  • I searched in the issues and found nothing similar.

Version

# pulsar version: 2.8.1

# cat /etc/redhat-release
CentOS Linux release 7.9.2009 (Core)
# java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)
# uname  -r
3.10.0-123.el7.x86_64

Minimal reproduce step

Setup a pulsar cluster and zk cluster, could get a chance to reproduce (not always) after stop zk leader

What did you expect to see?

zk client should reconnect another zk node properly,

What did you see instead?

main-EventThread is blocking on org.apache.pulsar.metadata.impl.ZKSessionWatcher#process method.

"metadata-store-zk-session-watcher-7-1" #16 prio=5 os_prio=0 tid=0x00007fd7ca112000 nid=0x6991 waiting on condition [0x00007fd742afa000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000007953debd8> (a java.util.concurrent.CompletableFuture$Signaller)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1695)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1775)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher.checkConnectionStatus(ZKSessionWatcher.java:104)
        - locked <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher$$Lambda$38/909132503.run(Unknown Source)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.lang.Thread.run(Thread.java:748)

"main-EventThread" #15 daemon prio=5 os_prio=0 tid=0x00007fd7ca108800 nid=0x6990 waiting for monitor entry [0x00007fd768380000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.pulsar.metadata.impl.ZKSessionWatcher.process(ZKSessionWatcher.java:120)
        - waiting to lock <0x00000006c66130e0> (a org.apache.pulsar.metadata.impl.ZKSessionWatcher)
        at org.apache.pulsar.metadata.impl.ZKMetadataStore.lambda$new$0(ZKMetadataStore.java:75)
        at org.apache.pulsar.metadata.impl.ZKMetadataStore$$Lambda$35/1305486145.process(Unknown Source)
        at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.notifyEvent(ZooKeeperWatcherBase.java:180)
        at org.apache.bookkeeper.zookeeper.ZooKeeperWatcherBase.process(ZooKeeperWatcherBase.java:146)
        at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:588)
        at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:563)

main-EventThread log:
image

metadata-store-zk-session-watcher-7-1 log:
image

org.apache.pulsar.metadata.impl.ZKSessionWatcher#currentStatus is SessionLost
image

zk EventThread internal task queue waitingEvents has many events to process
image

The keeperState of current event that zk EventThread processing is SyncConnected
image

It seems that zk exists operation will always timeout cause zk EventThread is blocking and metadata-store-zk-session-watcher thread can always acquire lock (I don't understand why, a jvm bug?).

this.task =
scheduler.scheduleAtFixedRate(catchingAndLoggingThrowables(this::checkConnectionStatus), tickTimeMillis,
tickTimeMillis,
TimeUnit.MILLISECONDS);

zk.exists("/", false, (StatCallback) (rc, path, ctx, stat) -> {
switch (KeeperException.Code.get(rc)) {
case CONNECTIONLOSS:
future.complete(Watcher.Event.KeeperState.Disconnected);
break;
case SESSIONEXPIRED:
future.complete(Watcher.Event.KeeperState.Expired);
break;
case OK:
default:
future.complete(Watcher.Event.KeeperState.SyncConnected);
}
}, null);
Watcher.Event.KeeperState zkClientState;
try {
zkClientState = future.get(tickTimeMillis, TimeUnit.MILLISECONDS);
} catch (TimeoutException e) {
// Consider zk disconnection if zk operation takes more than TICK_TIME
zkClientState = Watcher.Event.KeeperState.Disconnected;
}

Anything else?

No response

Are you willing to submit a PR?

  • I'm willing to submit a PR!
@Shawyeok Shawyeok added the type/bug The PR fixed a bug or issue reported a bug label Oct 31, 2022
@Shawyeok
Copy link
Contributor Author

Similar with #16712

@codelipenghui
Copy link
Contributor

@Shawyeok The issue should be fixed by #17909

@Shawyeok
Copy link
Contributor Author

Update:

metadata-store-zk-session-watcher thread can always acquire lock (I don't understand why, a jvm bug?)

synchronized lock is unfair, a thread has big chance acquire lock if it just release it, see example below:

class Scratch {

    private static final CountDownLatch latch = new CountDownLatch(1);

    public static void main(String[] args) throws InterruptedException {
        ScheduledExecutorService scheduler = Executors.newSingleThreadScheduledExecutor();
        Scratch s = new Scratch();
        scheduler.scheduleAtFixedRate(s::checkState, 2, 2, TimeUnit.SECONDS);

        latch.await();
        while (true) {
            s.process();
        }
    }

    private void process() {
        checkState();
    }

    private synchronized void checkState() {
        System.out.println(Thread.currentThread().getName() + " checking state");
        if (latch.getCount() > 0) {
            latch.countDown();
        }
        CompletableFuture<Void> future = new CompletableFuture<>();
        try {
            future.get(2, TimeUnit.SECONDS);
        } catch (TimeoutException ignore) {

        } catch (Exception e) {
            e.printStackTrace();
        }
        System.out.println(Thread.currentThread().getName() + " finished");
    }
}

Output:

pool-1-thread-1 checking state
pool-1-thread-1 finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
main checking state
main finished
...

@tisonkun
Copy link
Member

tisonkun commented Oct 31, 2022

I think this bug should be resolved at #17909 also.

It seems that the reported version is 2.8.1, while the fix is picked to only after 2.9.x. @codelipenghui do we have a released version including the fix that @Shawyeok can try to use?

@Shawyeok
Copy link
Contributor Author

I think this bug should be resolved at #17909 also.

It seems that the reported version is 2.8.1, while the fix is picked to only after 2.9.x. @codelipenghui do we have a released version including the fix that @Shawyeok can try to use?

I've talked with @codelipenghui, I'll pick that PR into our release branch in my org.

Thanks again for point this out @codelipenghui @tisonkun, I'm going close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

3 participants