-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28081 Snapshot working dir does not retain ACLs after snapshot commit phase #5437
Conversation
I do not think this is the correct way to fix the problem. We need to know what is the root cause of this problem, before separating the test... |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
@Apache9 the acl permissions at global and namespace level are getting messed up since resources are shared across tests. HBASE-24097 also has similar explanation and I found that we have had multiple flakies on TestSnapshotScannerHDFSAclController in the past. HBASE-24097 also had the same problem that due to sharing of mini cluster and acl among multiple tests, a particular test was failing but when run individually, it passes all the time. I don't think these tests are well written from the viewpoint of resource sharing. We should have had proper acl allocation and de-allocation with |
We can debug a given test better only if the individual test run fails, but that is not the case here at all. |
But reverting HBASE-28042 can solve the problem, so I think we need to find out why HBASE-28042 has this side effect, instead of just saying 'things are messed up'. And on debugging, it is very easy to reproduce in the individual test, just create a snapshot before doing anything, the test will fail. |
It is more like hdfs file atomic rename (instead of copy) has caused this regression because the test is fragile. We used to have atomic rename in hbase 1 already and hbase 2 somehow made it non-atomic, causing many regressions in snapshot creation in our prod env. However this test failure is not what represents issue with atomic rename, it is more about hdfs acl not working fine. So we either make test reliable with separate resource utilization or fix coproc that provides hdfs acl usage, but reverting HBASE-28042 is not solution because it addresses totally different concern. I am just not sure about the reliability of SnapshotScannerHDFSAclController. |
Going through SnapshotScannerHDFSAclController to understand it better might take a bit more time if we are fine to delay release? HBASE-28042 is really important, we should not have made snapshot commit phase as non-atomic file copy operation, we should have maintained snapshot commit phase as atomic file rename operation (at least for hdfs file system). Hence, simply reverting it is also not any better solution IMO. I have added more details on Jira HBASE-28042 with series of steps that can lead to corrupt snapshot metadata. |
This is also another indication that test reliability is questionable. Before merging the PR, i made sure all tests are passing (including testModifyTable1) and now somehow the test is failing on flaky dashboard. I still don't understand how this was the case. |
Let me check SnapshotScannerHDFSAclController |
I did this, however i see that for both behaviors (i.e. with or without HBASE-28042 patch) the test fails anyways. Sample patch to reproduce by reverting HBASE-28042:
Without reverting HBASE-28042:
In both cases, we see same test failure as the one we see on flaky dashboard. But by granting the global user READ permission before creating snapshot, everything works fine (only if we separate the test out). |
After some digging, I found the root cause. Will update here after some time. |
…tModifyTable1" This reverts commit 2beaee7.
Thanks for digging. Hope this time we could find the root cause and make the test reliable. |
This needs source code change because once hdfs rename operation is performed from let's say Let me also update Jira and PR title accordingly, but let me perform some rounds of testing. |
💔 -1 overall
This message was automatically generated. |
Thanks @virajjasani . Let's see if this could fix the flaky dashboard. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
AclStatus snapshotWorkingParentDirStatus; | ||
try { | ||
snapshotWorkingParentDirStatus = | ||
workingDirFS.getAclStatus(workingDir.getParent().getParent()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is workingDir
guaranteed to have a parent of a parent to avoid any possible NullPointerException? Or could workingDir
theoretically be /
or /tmp
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like it can be configured to be different, and in that case even original ACLs would not be applicable so better to have null check and return in such cases:
/**
* Get the general working directory for snapshots - where they are built, where they are
* temporarily copied on export, etc.
* @param rootDir root directory of the HBase installation
* @param conf Configuration of the HBase instance
* @return Path to the snapshot tmp directory, relative to the passed root directory
*/
public static Path getWorkingSnapshotDir(final Path rootDir, final Configuration conf) {
return new Path(
conf.get(SNAPSHOT_WORKING_DIR, getDefaultWorkingSnapshotDir(rootDir).toString()));
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
1 similar comment
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
…commit phase (#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]>
…commit phase (#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]>
…commit phase (#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]>
…commit phase (#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]>
…commit phase (apache#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]>
…commit phase (apache#5437) Signed-off-by: Duo Zhang <[email protected]> Signed-off-by: Andrew Purtell <[email protected]> Signed-off-by: Aman Poonia <[email protected]> (cherry picked from commit 0d4c756) Change-Id: I8003de7b96ad7db6778a2187ad21557661030c9c
No description provided.