-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25932: Ensure replication reads the trailer bytes from WAL. #3332
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 ltgm pending QA.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Test failures are not related, looped them over 100 times and they passed locally (failed-to-read.. is a known issue). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left one small nit. +1 otherwise
...erver/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestWALEntryStream.java
Show resolved
Hide resolved
Waiter.waitFor(CONF, TEST_TIMEOUT_MS, (Waiter.Predicate<Exception>) () -> (result | ||
= WALEntryStreamWithRetries.super.next()) != null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
Waiter.waitFor(CONF, TEST_TIMEOUT_MS, () -> (result = WALEntryStreamWithRetries.super.next()) != null);
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Nice find
FSHLog also been closed in async way in Master branch. Pls see FSHLog#doReplaceWriter. No applicable there? I have not seen the original issue description though! |
@anoopsjohn Thanks for pointing it out, I incorrectly assumed that FSHLog behavior stayed the same in branch-2/master. This async closing of FSHLog was not ported to branch-1. On top of that branch-2/master tests have coverage only for AsyncWAL since that is the default and that didn't help. The latest patch fixes this issue for both the implementations along with test coverage. PTAL. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/AbstractFSWAL.java
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
This bug was exposed by the test from HBASE-25924. Since this wal implementations close the wal asynchronously, replication can potentially miss the trailer bytes. (see jira comment for detailed analysis). While this is not a correctness problem (since trailer does not have any entry data), it erroneously bumps a metric that is used to track skipped bytes in WAL resulting in false alarms which is something we should avoid.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Added 2 suggestions in test part. Nice work
@@ -530,7 +557,8 @@ public void testReplicationSourceWALReaderWrongPosition() throws Exception { | |||
|
|||
@Override | |||
public boolean evaluate() throws Exception { | |||
return fs.getFileStatus(walPath).getLen() > 0; | |||
return fs.getFileStatus(walPath).getLen() > 0 && | |||
((AbstractFSWAL) log).getInflightWALCloseCount() == 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wait for 5 sec and check. Should be wait for getInflightWALCloseCount to be zero before coming here? Or will that be a better guarantee for WAL close has happened where we come here for FS.getLen() check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't wait for 5s and check. We wait for the condition to pass within 5s or timeout, this is the javadoc, think the test is doing what you are saying (unless I misunderstood what you are saying)..
/**
* Waits up to the duration equal to the specified timeout multiplied by the
* {@link #getWaitForRatio(Configuration)} for the given {@link Predicate} to become
* <code>true</code>, failing the test if the timeout is reached and the Predicate is still
* <code>false</code>.
* <p/>
* @param conf the configuration
* @param timeout the timeout in milliseconds to wait for the predicate.
* @param predicate the predicate to evaluate.
* @return the effective wait, in milli-seconds until the predicate becomes <code>true</code> or
* wait is interrupted otherwise <code>-1</code> when times out
*/
/** | ||
* Test helper that waits until a non-null entry is available in the stream next or times out. | ||
*/ | ||
private static class WALEntryStreamWithRetries extends WALEntryStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can u also pls explain why we use this special subclass here for the tests? (as comments) will be easy to read and digest later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in #3344
Oops, I thought I had a +1 from you @anoopsjohn and I merged the change. Let me open a follow up patch to address your comments. |
…ache#3332) This bug was exposed by the test from HBASE-25924. Since this wal implementations close the wal asynchronously, replication can potentially miss the trailer bytes. (see jira comment for detailed analysis). While this is not a correctness problem (since trailer does not have any entry data), it erroneously bumps a metric that is used to track skipped bytes in WAL resulting in false alarms which is something we should avoid. Reviewed-by: Rushabh Shah <[email protected]> Signed-off-by: Viraj Jasani <[email protected]> Signed-off-by Anoop Sam John <[email protected]> (cherry picked from commit b04c3c7)
…ache#3332) This bug was exposed by the test from HBASE-25924. Since this wal implementations close the wal asynchronously, replication can potentially miss the trailer bytes. (see jira comment for detailed analysis). While this is not a correctness problem (since trailer does not have any entry data), it erroneously bumps a metric that is used to track skipped bytes in WAL resulting in false alarms which is something we should avoid. Reviewed-by: Rushabh Shah <[email protected]> Signed-off-by: Viraj Jasani <[email protected]> Signed-off-by Anoop Sam John <[email protected]> (cherry picked from commit b04c3c7)
) (#3346) This bug was exposed by the test from HBASE-25924. Since this wal implementations close the wal asynchronously, replication can potentially miss the trailer bytes. (see jira comment for detailed analysis). While this is not a correctness problem (since trailer does not have any entry data), it erroneously bumps a metric that is used to track skipped bytes in WAL resulting in false alarms which is something we should avoid. Reviewed-by: Rushabh Shah <[email protected]> Signed-off-by: Viraj Jasani <[email protected]> Signed-off-by Anoop Sam John <[email protected]> (cherry picked from commit b04c3c7)
) (#3345) This bug was exposed by the test from HBASE-25924. Since this wal implementations close the wal asynchronously, replication can potentially miss the trailer bytes. (see jira comment for detailed analysis). While this is not a correctness problem (since trailer does not have any entry data), it erroneously bumps a metric that is used to track skipped bytes in WAL resulting in false alarms which is something we should avoid. Reviewed-by: Rushabh Shah <[email protected]> Signed-off-by: Viraj Jasani <[email protected]> Signed-off-by Anoop Sam John <[email protected]> (cherry picked from commit b04c3c7)
This bug was exposed by the test from HBASE-25924. Since this wal
implementations close the wal asynchronously, replication can potentially
miss the trailer bytes. (see jira comment for detailed analysis).
While this is not a correctness problem (since trailer does not have any entry data),
it erroneously bumps a metric that is used to track skipped bytes in WAL resulting
in false alarms which is something we should avoid.