-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28114 Add more comments to explain why replication log queue could never be empty for normal replication queue #5443
Conversation
I've found a way to deal with the problem without checking whether the replication source is recovered or not. And Why I notice this is because I can not reproduce the problem simply when writing UT... As if we want to simulate the problem on branch-2+, we need to first let the shipper thread finish calling Let me think how to simuate this in a UT... |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
I tried to reproduce the problem by matching the above executing sequence, but then I found that the newly added code was not executed... The problem here is that, we are only safe to move to the next file when beingWritten == true when we get a EOF_WITH_TRAILER, and we will schedule a close writer task, which will write the trailer, under the rollWriterLock. And the trailerPresent flag is set while opening the wal reader. This means that if we get beingWritten == true, then it is impossible the reader has a trailerPresent == true at the same time, because we can make sure that the wal reader is opened before we write the trailer, so we can only get a EOF_AND_RESET, and next time when we hit Let me just add more comments in code to describe the logic here. |
💔 -1 overall
This message was automatically generated. |
c702eb5
to
0bbca07
Compare
@sunhelly PTAL. Thanks. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Got this assertion error while runing UTs
Let me think how could this happen. |
OK, I think the problem is
Since we close the WAL writers in a background thread, to speed up the log rolling, it is possible that, we have write the trailer and closed the WAL writer, but we haven't removed it from the Let me think what is the best way to handle this here. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
OK, seems there are still possible ways to have Anyway, let me change the code to retry without polling out the last WAL file, to not allow empty queue. It could fix the problem and make us safe but I still cannot find out how to reproduce this problem in real world... Can file another issue to track this problem in long term... Thanks. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Oh I wrote a wrong condition... Let's see if the new assertion can pass all the replication related UTs. But anyway, now I do not think we should add such strict assertion(where we even do not enable it in production), instead, we'd better deal with the situation in code and add a WARN log to say that typically this should not happen... |
🎊 +1 overall
This message was automatically generated. |
You can see that, the beingWritten flag for a newly created WAL file is false, which obviously incorrect...
Here, we will init the WAL first, before putting it into the peerId2WAL map, and in WAL.init, we will call rollWriter, so we will insert the WAL file into the replication queue, and if we test whether the file is beingWritten before putting it into the peerId2WAL map, we will get a false, which is incorrect... For normal replication, in AbstractFSWAL, if the wal field is null, we will hold the createLock before returning nothing so it will be safe. Let me think how to fix this, should be another issue... |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
The ony failed UT is TestZooKeeper, which is not related. So the assertion works fine. Anyway, let me change the assertion to a warning log to make the code more robust. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
For And for TestSyncReplicationStandbyKillRS, this should be a bug as we hit a NPE... Let me dig more...
|
…uld never be empty for normal replication queue Also add a retry logic to make the code more robust
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
OK, the only failed UT is TestZooKeeper, which is not related. Let me kick another run to see if the result is stable. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
OK, good. All green. Let me merge. |
…uld never be empty for normal replication queue (#5443) Also add a retry logic to make the code more robust Signed-off-by: Xiaolin Ha <[email protected]> (cherry picked from commit 4429de4)
…uld never be empty for normal replication queue (#5443) Also add a retry logic to make the code more robust Signed-off-by: Xiaolin Ha <[email protected]> (cherry picked from commit 4429de4)
…uld never be empty for normal replication queue (apache#5443) Also add a retry logic to make the code more robust Signed-off-by: Xiaolin Ha <[email protected]>
No description provided.