Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZOOKEEPER-2678: Discovery and Sync can take a very long time on large DB (3.4) #157

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -507,9 +507,12 @@ public synchronized void shutdown() {
if (firstProcessor != null) {
firstProcessor.shutdown();
}
if (zkDb != null) {
zkDb.clear();
}

// There is no need to clear the database
// * When a new quorum is established we can still apply the diff
// on top of the same zkDb data
// * If we fetch a new snapshot from leader, the zkDb will be
// cleared anyway before loading the snapshot

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one case we may still want to clear db here - when one of the ZooKeeper critical threads (such as * processors, session trackers) fail, ZooKeeper server will shutdown (see runFromConfig) and consequently invoke ZooKeeper#shutdown. In this case, I don't see a particular reason not to clear the db, though not doing it might be fine (as one could argue the server will be dead anyway), but I tend to lean towards the safe side on cleaning the db. One way to conditionally do that is to add a Boolean parameter to ZooKeeper#shutdown so we can have fine grained control over when to clear db in what code path.

unregisterJMX();
}
Expand Down
31 changes: 20 additions & 11 deletions src/java/main/org/apache/zookeeper/server/quorum/Learner.java
Original file line number Diff line number Diff line change
Expand Up @@ -321,13 +321,16 @@ protected void syncWithLeader(long newLeaderZxid) throws IOException, Interrupte
QuorumPacket ack = new QuorumPacket(Leader.ACK, 0, null, null);
QuorumPacket qp = new QuorumPacket();
long newEpoch = ZxidUtils.getEpochFromZxid(newLeaderZxid);

readPacket(qp);
//In the DIFF case we don't need to do a snapshot because the edits will sync on top of any existing snapshot
// For SNAP and TRUNC the snapshot is needed to save that history
boolean snapshotNeeded = true;
readPacket(qp);
LinkedList<Long> packetsCommitted = new LinkedList<Long>();
LinkedList<PacketInFlight> packetsNotCommitted = new LinkedList<PacketInFlight>();
synchronized (zk) {
if (qp.getType() == Leader.DIFF) {
LOG.info("Getting a diff from the leader 0x" + Long.toHexString(qp.getZxid()));
LOG.info("Getting a diff from the leader 0x{}", Long.toHexString(qp.getZxid()));
snapshotNeeded = false;
}
else if (qp.getType() == Leader.SNAP) {
LOG.info("Getting a snapshot from leader");
Expand Down Expand Up @@ -364,10 +367,12 @@ else if (qp.getType() == Leader.SNAP) {

long lastQueued = 0;

// in V1.0 we take a snapshot when we get the NEWLEADER message, but in pre V1.0
// in Zab V1.0 (ZK 3.4+) we might take a snapshot when we get the NEWLEADER message, but in pre V1.0
// we take the snapshot at the UPDATE, since V1.0 also gets the UPDATE (after the NEWLEADER)
// we need to make sure that we don't take the snapshot twice.
boolean snapshotTaken = false;
boolean isPreZAB1_0 = true;
//If we are not going to take the snapshot be sure the edits are not applied in memory
boolean writeToEditLog = !snapshotNeeded;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are using edit to refer to txns. I'd rather use txn to be consistent across the project. Specifically here, you're using EditLog to refer to the TxnLog, please change accordingly to have it consistent across the project.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that I am still a bit new at the internal terminology. I will update it.

// we are now going to start getting transactions to apply followed by an UPTODATE
outerLoop:
while (self.isRunning()) {
Expand All @@ -387,7 +392,7 @@ else if (qp.getType() == Leader.SNAP) {
packetsNotCommitted.add(pif);
break;
case Leader.COMMIT:
if (!snapshotTaken) {
if (!writeToEditLog) {
pif = packetsNotCommitted.peekFirst();
if (pif.hdr.getZxid() != qp.getZxid()) {
LOG.warn("Committing " + qp.getZxid() + ", but next proposal is " + pif.hdr.getZxid());
Expand Down Expand Up @@ -415,7 +420,7 @@ else if (qp.getType() == Leader.SNAP) {
+ Long.toHexString(lastQueued + 1));
}
lastQueued = packet.hdr.getZxid();
if (!snapshotTaken) {
if (!writeToEditLog) {
// Apply to db directly if we haven't taken the snapshot
zk.processTxn(packet.hdr, packet.rec);
} else {
Expand All @@ -424,13 +429,14 @@ else if (qp.getType() == Leader.SNAP) {
}
break;
case Leader.UPTODATE:
if (!snapshotTaken) { // true for the pre v1.0 case
if (isPreZAB1_0) {
zk.takeSnapshot();
self.setCurrentEpoch(newEpoch);
}
self.cnxnFactory.setZooKeeperServer(zk);
break outerLoop;
case Leader.NEWLEADER: // it will be NEWLEADER in v1.0
case Leader.NEWLEADER: // Getting NEWLEADER here instead of in discovery
// means this is Zab 1.0
// Create updatingEpoch file and remove it after current
// epoch is set. QuorumPeer.loadDataBase() uses this file to
// detect the case where the server was terminated after
Expand All @@ -441,13 +447,16 @@ else if (qp.getType() == Leader.SNAP) {
throw new IOException("Failed to create " +
updating.toString());
}
zk.takeSnapshot();
if (snapshotNeeded) {
zk.takeSnapshot();
}
self.setCurrentEpoch(newEpoch);
if (!updating.delete()) {
throw new IOException("Failed to delete " +
updating.toString());
}
snapshotTaken = true;
writeToEditLog = true; //Anything after this needs to go to the edit log, not applied directly in memory
isPreZAB1_0 = false;
writePacket(new QuorumPacket(Leader.ACK, newLeaderZxid, null, null), true);
break;
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -839,6 +839,13 @@ public void converseWithFollower(InputArchive ia, OutputArchive oa,
Assert.assertEquals(1, f.self.getAcceptedEpoch());
Assert.assertEquals(1, f.self.getCurrentEpoch());

//Wait for the edits to be written out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think some more whether it makes any sense to add test cases for this. The test cases we already have probably cover this enough given that there is no real change of behavior.

This change here is necessary, though. We don't really care about time in general in our tests because we can never be sure of the timing we will get across runs and with different settings.

long start = System.currentTimeMillis();
while (createSessionZxid != f.fzk.getLastProcessedZxid() && (System.currentTimeMillis() - start) < 50) {
Copy link
Contributor

@afine afine Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just an idea, not sure if it is worth the effort and it may be outside the scope of this patch.

we could play with the test infrastructure here a little bit and do some dependency injection in createFollower that can let us track if db clearing and snapshotting occurs when expected.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem a bit beyond the scope of this. But if you really want me to I can look into it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would be nice, a good way of actually validating everything is behaving as expected

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK Will see what I can do

Thread.sleep(1);
}
LOG.info("Took < {}ms to sync all edits", System.currentTimeMillis() - start);

Assert.assertEquals(createSessionZxid, f.fzk.getLastProcessedZxid());

// Make sure the data was recorded in the filesystem ok
Expand Down