ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers #1925

jonmv · 2022-09-23T13:13:47Z

This fixes two bugs in shutdown logic, in the zookeeper server.

The SendAckRequestProcessor may die when attempting to close its Learner owner's socket, to signal that something went wrong, if the learner already closed the socket because something (the same thing) went wrong (namely, the leader disconnecting). This is fixed by simply checking for nullity.
ZooKeeperServer.shutdown(boolean) is not present in child classes, so many uses here fail to properly shut down child resources, such as the SyncRequestProcessor. This is fixed by refactoring shutdown for the child classes.

A unit test is also added, that fails when either of the two fixes are not present.
To be precise, it fails only because the SyncRequestProcessor is never shut down (thread leak), once the first fix is applied; I didn't spend more time looking for other weird failures that may arise from what is obviously a bug anyway.

See ZOOKEEPER-4541 for full details.

jonmv · 2022-09-23T14:03:53Z

@hanm I believe you review a related PR earlier, so perhaps you're the right reviewer here as well?

eolivelli

LGTM

Great work.
Did you experiment this fix in some test environment ?

eolivelli · 2022-09-23T14:06:20Z

I don't know why I cannot add @hanm as "reviewer", btw I hope he will receive the notification

mpolden · 2022-09-23T14:10:46Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SendAckRequestProcessor.java

@@ -64,7 +66,8 @@ public void flush() throws IOException {
        } catch (IOException e) {
            LOG.warn("Closing connection to leader, exception during packet send", e);
            try {
-                if (!learner.sock.isClosed()) {
+                Socket socket = learner.sock;
+                if ( socket != null && ! learner.sock.isClosed()) {


Should probably use socket in the second condition too? In case it changes after the first check.

Thanks, that was of course the intention :) Fixed!

hi jonmv, i have read the jira-4541. I am confusing. ZK1 does not send ack to leader , ZK1 recieves commit from leader. It seems not conform ZAB protocol. Please help me figure out, thanks

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SendAckRequestProcessor.java

jonmv · 2022-09-23T15:31:24Z

LGTM

Great work. Did you experiment this fix in some test environment ?

Not yet, but we may do that before we merge, if you wish. We will find a way to run a patched 3.8.0, probably early next week.

jonmv · 2022-09-23T15:32:57Z

Thanks for the quick reply!

jonmv · 2022-09-30T13:30:27Z

We've had this running for almost a week now, without any issues, and the data inconsistencies have not been observed. The sample size isn't large enough to conclude, though :)

Anyway, we saw some other digest mismatches, and I started digging around for their cause. I found one problem introduced with this commit, fixed in 8121711.
The problem was a COMMIT between NEWLEADER (which flushes the packetsNotCommited) and UPTODATE would crash the learner, which would peek at this queue and expect entries in it. This is fixed by not clearing the packetsNotCommited on NEWLEADER; instead, the already written entries are simply skipped when updating the log after UPTODATE.

Working on the above, I also found the fix for reconfig between NEWLEADER and UPTODATE, in this commit, to be incomplete: since the packetsNotCommited is no longer emptied after NEWLEADER, the head doesn't change, and if there are other PROPOSALs between the NEWLEADER and the PROPOSAL that the COMMITANDACTIVATE is meant for, then reconfig still doesn't happen. The unit test added back then was insufficient to test this. My fix to this is in f24eb51, and is to simply traverse the packetsNotCommitted, looking for a matching Zxid. I can't imaging this being a performance issue.

sonatype-lift · 2022-09-30T13:34:51Z

⚠️ 52 God Classes were detected by Lift in this project. Visit the Lift web console for more details.

jonmv · 2022-09-30T13:35:32Z

The symptom of the bug fixed in the first of these two commits is that the learner crashes with an NPE during sync, and then, when it restarts, it typically writes a duplicated series of transactions to its transaction log, complains about that, and later may observe a digest mismatch when replaying that transaction log from file, during startup.

jeffrey-xiao · 2022-09-30T15:26:47Z

Ah, I think we raced -- I just minted a PR to resolve ZOOKEEPER-4394 which is the problem you described. I'm happy with either solution, but perhaps you can also take a look at my PR?

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java

jonmv · 2022-10-01T12:02:02Z

We raced indeed :) Given that this PR addresses some additional concerns, I'd vote for it to be merged.

jonmv · 2022-10-03T07:53:16Z

Actually, your version is better. I'm incorporating that here instead, if you don't mind? I still think it's a good idea to ensure the pending writes are actually flushed, before ack'ing the NEWLEADER, both because we should ensure they're on persistent storage before ack'ing, and also because that reduces the otherwise random order of ACKs the leader would observe, when SyncRequestProcessor.run races with Learner.syncWithLeader.

jonmv · 2022-10-03T10:36:19Z

Hmm, no, this isn't quite right either, although none of the tests fail.
Not sure why the rendezvous-with-sync-thread didn't work, and that's probably the right way to do this. It could be just insufficient test-setup, of course.

jonmv · 2022-10-03T11:49:11Z

Hmm, no, the code has a race, as it is now. The LeaderHandler expects the first ACK after starting a DIFF sync to be the NEWLEADER ack, but if there are lots of PROPOSALs in the diff, before the NEWLEADER, then this may also cause an ACK to be sent, which will crash the Leader.wantForNewLeaderAck.

Ensuring the transactions are indeed flushed (through the usual request processor pipeline) guarantees these ACKs, and thus always crashes the leader. Meh.

jonmv · 2022-10-03T14:19:10Z

There ... Not pretty, but seems like the only way to make this right is by making it possible to delay the ACKs that would otherwise be sent once the SyncRequestProcessor flushes, so they arrive after the ACK og the NEWLEADER.
This is implemented by adding additional special requests that can be enqueued by the sync processor—one to rendezvous with it, from the syncing thread at startup, to ensure TXNs are actually flushed; and two to toggle whether to delay forwarding to the SendAckRequestProcessor, used by the FollowerZooKeeperServer.

jeffrey-xiao · 2022-10-03T15:23:06Z

FWIW, your recent commit looks good to me.

Another possible approach is to bypass the request processor pipeline entirely like what was done in #1848.

jeffrey-xiao · 2022-10-03T16:03:54Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java

+                    if (zk instanceof FollowerZooKeeperServer) {
+                        FollowerZooKeeperServer fzk = (FollowerZooKeeperServer) zk;
+                        fzk.syncProcessor.setDelayForwarding(false);
+                        fzk.syncProcessor.syncFlush();


What's the reason why we need the second syncFlush?

To ensure consistent ordering of the UPTODATE ACK, vs ACKs from PROPOSALs. The real leader doesn't care, but unit tests may.

jonmv · 2022-10-19T02:32:48Z

So ... you may be right we don't need to keep all this auxiliary structure during sync, but I believe we need to if we want to precisely preserve today's behaviour (except what's needed to fixi those bugs, obviously).
What complicates the sync is that some transactions aren't logged and ack'ed; and possibly also that state is used while syncing. I can't say whether the first behaviour is needed, and whether the second is actually the case, without a much deeper dive into all of this, but I do see tests failing when I change this behaviour.
Perhaps it is a good idea to first fix these bugs, and then possibly look for ways to simplify?

jonmv · 2023-01-03T12:25:28Z

Any further thoughts on this @breed, @eolivelli ?

jonmv · 2023-01-23T07:49:09Z

I could add that after patching with these commits (not the last one, which should be purely refactoring), we've had zero issues with inconsistent ZK clusters. This is across several hundred thousand rolling cluster restarts. Previously, we typically had one or two broken clusters each week, and had to intervene manually in each case, i.e., a one-in-thousand chance of breaking across a restart.

jeffrey-xiao · 2023-01-26T18:58:04Z

As another data point, we're also running multiple clusters with weekly restarts and have not seen issues with inconsistent ZK clusters with this patch. I am very interested in getting this merged and was wondering what's left to push this PR through?

IMO, this is pretty high priority bug to fix because it requires manual intervention to recover from. Otherwise, the cluster is in a permanently inconsistent state.

jeffrey-xiao

As a meta note, I wonder if it's worth splitting this PR into multiple PRs because it fixes distinct bugs (ZOOKEEPER-4409, ZOOKEEPER-4502, ZOOKEEPER-4394, ZOOKEEPER-4541). Likely not worth the effort, but perhaps it will get through review more easily that way ;)

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ReadOnlyZooKeeperServer.java

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java

zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/DIFFSyncConsistencyTest.java

jonmv · 2023-02-03T11:09:15Z

As a meta note, I wonder if it's worth splitting this PR into multiple PRs because it fixes distinct bugs (ZOOKEEPER-4409, ZOOKEEPER-4502, ZOOKEEPER-4394, ZOOKEEPER-4541). Likely not worth the effort, but perhaps it will get through review more easily that way ;)

Oof, would probably have been a good idea, but one thing led to another, and now it'd be a lot of work to split this 😬 🙂

jonmv · 2023-03-16T08:42:35Z

@eolivelli any thoughts on what to do next here? I think it would be good to conclude this work soon. It's getting a bit stale :)

fanyang89

Hi, jonmv. I'm very interested in your work and read it carefully.
Some questions are coming below, maybe we could discuss them.

fanyang89 · 2023-03-30T06:31:56Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/ObserverZooKeeperServer.java

@@ -106,6 +106,8 @@ protected void setupRequestProcessors() {
        if (syncRequestProcessorEnabled) {
            syncProcessor = new SyncRequestProcessor(this, null);
            syncProcessor.start();
+        } else {
+            syncProcessor = null;


syncProcessor as an ObserverZooKeeperServer field should have a default value of null.
Does setting null here makes a difference?

No, I'm just used to always assigning (to final fields). This can be removed again.

fanyang89 · 2023-03-30T07:35:31Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SyncRequestProcessor.java

@@ -174,6 +224,21 @@ public void run() {
                    break;
                }

+                if (si == turnForwardingDelayOn) {
+                    nextProcessor.close();


At the ctor of SyncRequestProcessor, nextProcessor may be null.
Can this be an NPE at ObserverZooKeeperServer(with syncRequestProcessorEnabled=true)?

Only followers enqueue these special requests, so that can't happen. Observers don't ack txns, as far as I remember?

fanyang89 · 2023-03-30T08:26:40Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/SyncRequestProcessor.java

+                    continue;
+                }
+                if (si == turnForwardingDelayOff) {
+                    nextProcessor.open();


This naming here is confusing.
The intention here is: to get the turning delaying off request, open the gate, then flush all pending requests to the downstream processor.
nextProcessor.open() is to open the gate or turn the delay on?

open() opens the gate. What about startDelaying() and flushAndStopDelaying()?

fanyang89 · 2023-04-04T11:37:46Z

zookeeper-server/src/test/java/org/apache/zookeeper/server/quorum/Zab1_0Test.java

+            while (createZxid1 != follower.fzk.getLastProcessedZxid() && System.currentTimeMillis() < doom) {
+                Thread.sleep(1);
+            }
+            assertEquals(createZxid1, follower.fzk.getLastProcessedZxid());


Running the unit test shows that this assertion is not always true.
After txn(1, counter=3) is flushed, SyncRequestProcessor can take() txn(1, 4) and add to the toFlush queue without a flush(), then poll() returns a null, and the processor flushes. Txn(1,4) may or may not flush; it depends on the order.
A simple workaround is to enable flushDelay (via zookeeper.flushDelay) so that a flush for txn(1,4) is not called from an incoming null request. Add a barrier maybe?
It's likely to happen in an earlier version of the JDK(e.g., 1.8, 10, 11, etc., it has yet to occur at JDK18 in my test env, but why?)

Ah, thanks for spotting this. The intention was to wait for the second create txn ID, not the first.

The read variable is volatile, so changes should be visible, given enough (sleep) time to let the other thread do its work.
The construct was copied from a different test in the same class (lines 1101–1107), so if this is still unstable, I'd expect that test to also need an update.

kezhuw · 2023-05-04T20:30:36Z

Hmm, that other PR doesn't look right to me. It fails to store the TXNs that aren't already committed before ACKing the NEWLEADER, which was what ZOOKEEPER-3911 was all about, in the first place. Agree?

// ZOOKEEPER-3911: make sure sync the uncommitted logs before commit them (ACK NEWLEADER).

Given that, I would say #1445 itself is misleading or we are misunderstanding the jira part of ZOOKEEPER-3911, or both. It should sync only txns in DIFF sync, all proposals in this phase are considered committed by new leader. I think ZOOKEEPER-4394 already/almost listed the point: NEWLEADER is not appended intermediately/atomically after proposals in DIFF sync. I guess we are not aware of this by the time of #1445.

These ongoing(committed or not) proposals are simply beyond of discussion, these belongs to broadcast phase(See Zab, Zab1.0). I considered them as gap between paper and implementation. Steps in paper are mostly like atomic, but implementations are not.

Back to this pr, the good part is that it is verified in production. The bad part is that it is giant and mixed several issues and areas.

ZOOKEEPER-4409 NullPointerException in SendAckRequestProcessor
ZOOKEEPER-4502 SyncRequestProcessor leaks in ZooKeeperServer::shutdown.
Drop preZab1.0. This might make this pr not cherry-pick-able, as it could break clients to 3.7.x and 3.8.x. Though, I think it was probably broken by ZOOKEEPER-3104 long before.
DIFF syncs are not persisted. There are several issues and prs.
- ZOOKEEPER-3023: Flaky testZab1_0Test.testNormalFollowerRunWithDiff. This predates ZOOKEEPER-3911, basically we can consider ZOOKEEPER-3023: Sync and commit diff log entries before NEWLEADER ack #1848 as fix to ZOOKEEPER-2678.
- ZOOKEEPER-3911: Data inconsistency caused by DIFF sync uncommitted log. ZOOKEEPER-3911: Data inconsistency caused by DIFF sync uncommitted log #1445 tried, but failed, so we are here.
- ZOOKEEPER-4394: Learner.syncWithLeader got NullPointerException. I think this is the symptom of ZOOKEEPER-3911: Data inconsistency caused by DIFF sync uncommitted log #1445 and lagging NEWLEADER. Targeting a fix to avoid NullPointerException is going wrong direction from my point of view. ZOOKEEPER-4394: Learner.syncWithLeader got NullPointerException #1930 will fix the NullPointerException but no more than that.
- ZOOKEEPER-4646(Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk), ZOOKEEPER-4685(Unnecessary system unavailability due to Leader shutdown when follower sent ACK of PROPOSAL before sending ACK of NEWLEADER in log recovery). I was impressed by the analysis. ZOOKEEPER-4646: Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk #1993 is the ongoing effort. But i think it was misled by ZOOKEEPER-3911: Data inconsistency caused by DIFF sync uncommitted log #1445, hence does not fix ZOOKEEPER-4394.
- This pr. To be honest, I don't like this part of this pr. It tied up synchronization and broadcast phase. Personally, I prefer a synchronous approach as ZOOKEEPER-3023: Sync and commit diff log entries before NEWLEADER ack #1848 and ZOOKEEPER-4646: Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk #1993 adopted, I think they are clean and straightforward. Though, they have cons/drawbacks as ZOOKEEPER-4646: Committed txns may still be lost if followers crash after replying ACK of NEWLEADER but before writing txns to disk #1993 pointed out.
In long term, I would suggest to send NEWLEADER just after DIFF sync, this aligns to the paper. But it won't fix this persistent issue.
Refactor. I like this part, but it may increase review burden.

Personally, I would suggest to fix above issues separately. There are other possible issues in synchronization phase.

ZOOKEEPER-4643: Committed txns may be improperly truncated if follower crashes right after updating currentEpoch but before persisting txns to disk
NEWLEADER is not sent intermediately after DIFF syncs. This diverges from the paper. ZOOKEEPER-4394 pointed out this.
It is not guaranteed/asserted in LearnerHandler that the first ACK is targeting NEWLEADER. ZOOKEEPER-4685 pointed out this.
Unmatched commits are causing log.warns in synchronization phase, while they cause system exit in FollowerZooKeeperServer.commit.

IMO, this is pretty high priority bug to fix because it requires manual intervention to recover from. Otherwise, the cluster is in a permanently inconsistent state.

I think it would be good to conclude this work soon. It's getting a bit stale :)

Hmm, some of my prs #1820(merged), #1859(approved) and #1847 are all almost one year from its creation. I also opened apache/bookkeeper#3041(data loss, 1 year), apache/pulsar#7490(data duplication, 3 years) and google/leveldb#375(data inconsistency, 3 years until google/leveldb#339 got merged). I guess we should believe in time 😮‍💨.

Oof, would probably have been a good idea, but one thing led to another, and now it'd be a lot of work to split this 😬 🙂

Maybe we can start from fresh fixes ? Anyway, it may not be a pleasure process 😨 😵‍💫. I believed that getting pr concentrated is helpful to get merged. People might fear of giant pr, at least, I was hesitated (part of this) to get involved till today.

Could you please take a look at this and alternatives #1848 and #1993 ? @eolivelli @breed @cnauroth @hanm @nkalmar @ztzg @anmolnar @tisonkun @li4wang

changruill · 2023-05-19T06:26:03Z

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/LeaderZooKeeperServer.java

@@ -155,11 +155,11 @@ protected void unregisterMetrics() {
    }

    @Override
-    public synchronized void shutdown() {
+    public synchronized void shutdown(boolean fullyShutDown) {


I'm a little worried about the modification here has an impact on the invoking chain.

Before modification: Leader.shutdown(String) -> LeaderZooKeeperServer.shutdown() -> ZooKeeperServer.shutdown()
After modification: Leader.shutdown() -> ZooKeeperServer.shutdown()

LeaderZooKeeperServer.shutdown is skipped and containerManager does not stop.

ZooKeeperServer.shutdown() only calls shutdown(false), which is implemented in LeaderZooKeeperServer, and which stops the containerManager. shutdown() isn't overridden anywhere anymore.

tsuna · 2024-06-13T12:37:54Z

Is this PR officially getting superseded by #2152 and #2154?

jonmv · 2024-06-13T12:51:28Z

Not for me to decide, but I'd think so.

jonmv · 2024-09-19T07:57:39Z

Superseded by #2111, #2152 and #2154.

jonmv added 3 commits September 23, 2022 14:53

Add failing unit test for ZOOKEEPER-4541

e670dac

Avoid dying when trying to close already closed socket

0444ede

Propagate shutdown correctly to ZooKeeperServer and its children

02d9f8a

eolivelli changed the title ~~Jonmv/zookeeper 4541~~ ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers Sep 23, 2022

eolivelli approved these changes Sep 23, 2022

View reviewed changes

mpolden reviewed Sep 23, 2022

View reviewed changes

jonmv commented Sep 23, 2022

View reviewed changes

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/SendAckRequestProcessor.java Outdated Show resolved Hide resolved

jonmv commented Sep 30, 2022

View reviewed changes

zookeeper-server/src/main/java/org/apache/zookeeper/server/quorum/Learner.java Outdated Show resolved Hide resolved

jonmv added 2 commits October 3, 2022 09:34

Use local variable for socket

96f2104

Modify test to crash Follower on COMMIT after NEWLEADER during DIFF sync

83812a9

Handle COMMIT after NEWLEADER in DIFF sync without NPE

4aab1fd

jonmv force-pushed the jonmv/ZOOKEEPER-4541 branch 3 times, most recently from abaf8af to 4aab1fd Compare October 3, 2022 10:35

jonmv force-pushed the jonmv/ZOOKEEPER-4541 branch from d30c2f5 to 5e0c559 Compare October 3, 2022 14:18

jeffrey-xiao reviewed Oct 3, 2022

View reviewed changes

jonmv added 2 commits October 17, 2022 19:03

Flush delayed acks later, on UPTODATE

47eb1e8

Remove pre-ZAB1.0 legacy code

84be071

jonmv force-pushed the jonmv/ZOOKEEPER-4541 branch from 961aa17 to 84be071 Compare October 17, 2022 17:05

Merge branch 'master' into jonmv/ZOOKEEPER-4541

ffa326b

jeffrey-xiao reviewed Feb 2, 2023

View reviewed changes

jonmv added 2 commits February 3, 2023 08:32

Remove unintended double shutdown of ReadOnlyZooKeeperServer

840808c

Restore longer test timeout

882d640

Simplify sync helper by inlining INFORM action

5f5834b

fanyang89 reviewed Mar 30, 2023

View reviewed changes

Improve naming, and remove redundant null assignment

3483274

fanyang89 reviewed Apr 4, 2023

View reviewed changes

Wait for correct zxid in unit test

0507f7c

kezhuw mentioned this pull request May 4, 2023

ZOOKEEPER-3023: Sync and commit diff log entries before NEWLEADER ack #1848

Closed

tisonkun self-requested a review May 4, 2023 23:30

changruill reviewed May 19, 2023

View reviewed changes

ztzg force-pushed the master branch from 1c60545 to e2070be Compare October 3, 2023 12:57

kezhuw mentioned this pull request Apr 1, 2024

ZOOKEEPER-4394: Apply only committed requests in sync with leader before NEWLEADER ACK #2152

Merged

jonmv mentioned this pull request Apr 4, 2024

ZOOKEEPER-4712: Fix partially shutdown of ZooKeeperServer and its processors #2154

Merged

jonmv closed this Sep 19, 2024

jonmv deleted the jonmv/ZOOKEEPER-4541 branch September 19, 2024 07:57

ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers #1925

ZOOKEEPER-4541 Ephemeral znode owned by closed session visible in 1 of 3 servers #1925

Conversation

jonmv commented Sep 23, 2022 • edited Loading

jonmv commented Sep 23, 2022

eolivelli left a comment

Choose a reason for hiding this comment

eolivelli commented Sep 23, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmv commented Sep 23, 2022

jonmv commented Sep 23, 2022

jonmv commented Sep 30, 2022 • edited Loading

sonatype-lift bot commented Sep 30, 2022

jonmv commented Sep 30, 2022

jeffrey-xiao commented Sep 30, 2022 • edited Loading

jonmv commented Oct 1, 2022

jonmv commented Oct 3, 2022

jonmv commented Oct 3, 2022

jonmv commented Oct 3, 2022

jonmv commented Oct 3, 2022

jeffrey-xiao commented Oct 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jonmv commented Oct 19, 2022

jonmv commented Jan 3, 2023

jonmv commented Jan 23, 2023

jeffrey-xiao commented Jan 26, 2023

jeffrey-xiao left a comment

Choose a reason for hiding this comment

jonmv commented Feb 3, 2023

jonmv commented Mar 16, 2023

fanyang89 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kezhuw commented May 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tsuna commented Jun 13, 2024

jonmv commented Jun 13, 2024

jonmv commented Sep 19, 2024

jonmv commented Sep 23, 2022 •

edited

Loading

jonmv commented Sep 30, 2022 •

edited

Loading

jeffrey-xiao commented Sep 30, 2022 •

edited

Loading