Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-28140 AbstractWALProvider may miss the WAL which is under creat… #5455

Merged
merged 1 commit into from
Oct 13, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,9 @@
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ConcurrentMap;
import java.util.concurrent.atomic.AtomicBoolean;
import java.util.concurrent.locks.Condition;
import java.util.concurrent.locks.Lock;
import java.util.concurrent.locks.ReentrantLock;
import java.util.function.BiPredicate;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
Expand Down Expand Up @@ -87,6 +89,15 @@ public abstract class AbstractWALProvider implements WALProvider, PeerActionList

private final KeyLocker<String> createLock = new KeyLocker<>();

// in getWALs we can not throw any exceptions out, so we use lock and condition here as it
// supports awaitUninterruptibly which will not throw a InterruptedException
private final Lock numRemoteWALUnderCreationLock = new ReentrantLock();
private final Condition noRemoteWALUnderCreationCond =
numRemoteWALUnderCreationLock.newCondition();
// record the number of remote WALs which are under creation. This is very important to not
// missing a WAL instance in getWALs method. See HBASE-28140 and related issues for more details.
private int numRemoteWALUnderCreation;

// we need to have this because when getting meta wal, there is no peer info provider yet.
private SyncReplicationPeerInfoProvider peerInfoProvider = new SyncReplicationPeerInfoProvider() {

Expand Down Expand Up @@ -150,11 +161,26 @@ private WAL getRemoteWAL(RegionInfo region, String peerId, String remoteWALDir)
WAL wal = createRemoteWAL(region, ReplicationUtils.getRemoteWALFileSystem(conf, remoteWALDir),
ReplicationUtils.getPeerRemoteWALDir(remoteWALDir, peerId), getRemoteWALPrefix(peerId),
ReplicationUtils.SYNC_WAL_SUFFIX);
numRemoteWALUnderCreationLock.lock();
try {
numRemoteWALUnderCreation++;
} finally {
numRemoteWALUnderCreationLock.unlock();
}
initWAL(wal);
peerId2WAL.put(peerId, Optional.of(wal));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These code modifications look good to me, but I am not familiar with AbstractWALProvider. I have a question: What happens if a 'remoteWAL' is being constructed, but getWALs() doesn't return it? (The 'remoteWAL' being constructed has not been put into peerId2WAL yet.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see the analysis in this PR.

#5443 (comment)

It will cause problem for replication. We rely on whether a WAL file is beingWritten to determine whether we can move on to the next file. If we miss a WAL here but the WAL file has been put into the replication queue, we could get beingWritten == false but actually the file is beingWritten, which causes unexpected behavior...

return wal;
} finally {
lock.unlock();
numRemoteWALUnderCreationLock.lock();
try {
numRemoteWALUnderCreation--;
if (numRemoteWALUnderCreation == 0) {
noRemoteWALUnderCreationCond.signalAll();
}
} finally {
numRemoteWALUnderCreationLock.unlock();
}
}
}

Expand All @@ -179,6 +205,17 @@ public final WAL getWAL(RegionInfo region) throws IOException {

@Override
public final List<WAL> getWALs() {
List<WAL> wals = new ArrayList<WAL>();
numRemoteWALUnderCreationLock.lock();
try {
while (numRemoteWALUnderCreation > 0) {
noRemoteWALUnderCreationCond.awaitUninterruptibly();
}
peerId2WAL.values().stream().filter(Optional::isPresent).map(Optional::get)
.forEach(wals::add);
} finally {
numRemoteWALUnderCreationLock.unlock();
}
return Streams
.concat(peerId2WAL.values().stream().filter(Optional::isPresent).map(Optional::get),
getWALs0().stream())
Expand Down