Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBASE-23202 ExportSnapshot (import) will fail if copying files to root directory takes longer than cleaner TTL #769

Closed
wants to merge 1 commit into from

Conversation

guangxuCheng
Copy link
Member

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 0m 36s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
💚 mvninstall 6m 0s master passed
💚 compile 1m 1s master passed
💚 checkstyle 1m 37s master passed
💚 shadedjars 5m 35s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 40s master passed
💙 spotbugs 4m 36s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 4m 34s master passed
_ Patch Compile Tests _
💚 mvninstall 5m 45s the patch passed
💚 compile 0m 57s the patch passed
💚 javac 0m 57s the patch passed
💔 checkstyle 1m 18s hbase-server: The patch generated 2 new + 5 unchanged - 0 fixed = 7 total (was 5)
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 49s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 15m 52s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 36s the patch passed
💚 findbugs 4m 12s the patch passed
_ Other Tests _
💚 unit 160m 36s hbase-server in the patch passed.
💚 asflicense 0m 35s The patch does not generate ASF License warnings.
220m 59s
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/artifact/out/Dockerfile
GITHUB PR #769
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux 228abb5f3da1 4.15.0-60-generic #67-Ubuntu SMP Thu Aug 22 16:55:30 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-769/out/precommit/personality/provided.sh
git revision master / 4c75485
Default Java 1.8.0_181
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/artifact/out/diff-checkstyle-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/testReport/
Max. process+thread count 4611 (vs. ulimit of 10000)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

Comment on lines 272 to 279
try {
snapshotInProgress.addAll(fileInspector.filesUnderSnapshot(run.getPath()));
} catch (CorruptedSnapshotException e) {
// See HBASE-16464
if (e.getCause() instanceof FileNotFoundException) {
// If the snapshot is corrupt, we will delete it
fs.delete(run.getPath(), true);
LOG.warn("delete the " + run.getPath() + " due to exception:", e.getCause());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this actually work for the ExportSnapshot case? The snapshot manifest is added to tmp before all the files are present on cluster so it looks like this will delete the snapshot manifest which would mess up the import job.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, there maybe race condition between ExportSnapshot and SnapshotCleaner.
Copying Snapshot Manifest is a fast operation. Maybe we can add a time threshold. When we catch CorruptedSnapshotException, if the modification time of the snapshot folder exceeds a certain time threshold, we will delete it, otherwise we will ignore this cleanup operation. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

copying the snapshot manifest is not always fast since it can be hundreds of MB and the link between clusters can be poor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and when the snapshot contains a large number of files, copying the snapshot can take a long time even when there isn't a lot of data. Also copying the actual data for a large export can take tens-of-days.

Copy link
Member Author

@guangxuCheng guangxuCheng Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, when CorruptedSnapshotException is thrown, we can ignore the exception and continue to clean up HFile instead of skip.

If the CorruptedSnapshotException is thrown, which means that the ExportSnapshot has not copy the snapshot manifest successfully, and the data file of the snapshot has not yet started to copy, so it will have no effect on the snapshot if the snapshotCleaner continues.

The main purpose of adding a delete snapshot manifest logic is to clean up the abnormal snapshot manifest. Of course, it is OK to remove the logic.

Copy link
Contributor

@eomiks eomiks Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any progress on this issue review?? I faced exactly same problem, and hope it to be resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if it reads into the middle of copying manifest files, it is ok to remove this snapshot as copying HFiles has not started yet. So there is no impact for the logic in snapshotCleaner.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic of getUnreferencedFiles() is that for an HFile which is not in cache, it will refreshCache to get the latest snapshot hfiles. If one hfile from this exortSnapshot job is in the list, this means that manifest files have been copied over, so refreshCache() will get the latest snapshot file list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@busbey @z-york Unless you see something missing, I think this one is good to go, thanks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I rebased the patch and posted a new pull request,
#1791

It is same as the original one, except some minor changes (like some of utilities are moved, change to use new utility class).

@binlijin
Copy link
Contributor

LGTM

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 6s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
💚 mvninstall 5m 53s master passed
💚 compile 0m 57s master passed
💚 checkstyle 1m 29s master passed
💚 shadedjars 5m 3s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 37s master passed
💙 spotbugs 4m 32s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 4m 28s master passed
_ Patch Compile Tests _
💚 mvninstall 5m 25s the patch passed
💚 compile 0m 59s the patch passed
💚 javac 0m 59s the patch passed
💔 checkstyle 1m 27s hbase-server: The patch generated 3 new + 5 unchanged - 0 fixed = 8 total (was 5)
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 4m 58s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 17m 18s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 36s the patch passed
💚 findbugs 4m 36s the patch passed
_ Other Tests _
💔 unit 31m 18s hbase-server in the patch failed.
💚 asflicense 0m 15s The patch does not generate ASF License warnings.
93m 15s
Reason Tests
Failed junit tests hadoop.hbase.master.snapshot.TestSnapshotHFileCleaner
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/2/artifact/out/Dockerfile
GITHUB PR #769
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux a0bf09f39b92 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-769/out/precommit/personality/provided.sh
git revision master / 2451c2c
Default Java 1.8.0_181
checkstyle https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/2/artifact/out/diff-checkstyle-hbase-server.txt
unit https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/2/artifact/out/patch-unit-hbase-server.txt
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/2/testReport/
Max. process+thread count 672 (vs. ulimit of 10000)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/2/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

🎊 +1 overall

Vote Subsystem Runtime Comment
💙 reexec 1m 13s Docker mode activated.
_ Prechecks _
💚 dupname 0m 0s No case conflicting files found.
💚 hbaseanti 0m 0s Patch does not have any anti-patterns.
💚 @author 0m 0s The patch does not contain any @author tags.
💚 test4tests 0m 0s The patch appears to include 2 new or modified test files.
_ master Compile Tests _
💚 mvninstall 5m 55s master passed
💚 compile 0m 58s master passed
💚 checkstyle 1m 28s master passed
💚 shadedjars 5m 0s branch has no errors when building our shaded downstream artifacts.
💚 javadoc 0m 40s master passed
💙 spotbugs 5m 7s Used deprecated FindBugs config; considering switching to SpotBugs.
💚 findbugs 5m 4s master passed
_ Patch Compile Tests _
💚 mvninstall 6m 22s the patch passed
💚 compile 1m 7s the patch passed
💚 javac 1m 7s the patch passed
💚 checkstyle 1m 30s the patch passed
💚 whitespace 0m 0s The patch has no whitespace issues.
💚 shadedjars 5m 14s patch has no errors when building our shaded downstream artifacts.
💚 hadoopcheck 17m 37s Patch does not cause any errors with Hadoop 2.8.5 2.9.2 or 3.1.2.
💚 javadoc 0m 34s the patch passed
💚 findbugs 4m 36s the patch passed
_ Other Tests _
💚 unit 227m 43s hbase-server in the patch passed.
💚 asflicense 0m 26s The patch does not generate ASF License warnings.
292m 20s
Subsystem Report/Notes
Docker Client=19.03.4 Server=19.03.4 base: https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/3/artifact/out/Dockerfile
GITHUB PR #769
Optional Tests dupname asflicense javac javadoc unit spotbugs findbugs shadedjars hadoopcheck hbaseanti checkstyle compile
uname Linux b545485db603 4.15.0-66-generic #75-Ubuntu SMP Tue Oct 1 05:24:09 UTC 2019 x86_64 GNU/Linux
Build tool maven
Personality /home/jenkins/jenkins-slave/workspace/HBase-PreCommit-GitHub-PR_PR-769/out/precommit/personality/provided.sh
git revision master / 2451c2c
Default Java 1.8.0_181
Test Results https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/3/testReport/
Max. process+thread count 4407 (vs. ulimit of 10000)
modules C: hbase-server U: hbase-server
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/3/console
versions git=2.11.0 maven=2018-06-17T18:33:14Z) findbugs=3.1.11
Powered by Apache Yetus 0.11.0 https://yetus.apache.org

This message was automatically generated.

@guangxuCheng guangxuCheng requested review from busbey and z-york November 7, 2019 02:44
@ferhui
Copy link

ferhui commented May 11, 2020

Face the same problem. Any progress on this issue? @guangxuCheng @binlijin

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 4s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 3s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 2s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 3s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 2s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@Apache-HBase
Copy link

💔 -1 overall

Vote Subsystem Runtime Comment
+0 🆗 reexec 0m 0s Docker mode activated.
-1 ❌ patch 0m 3s #769 does not apply to master. Rebase required? Wrong Branch? See https://yetus.apache.org/documentation/in-progress/precommit-patchnames for help.
Subsystem Report/Notes
GITHUB PR #769
Console output https://builds.apache.org/job/HBase-PreCommit-GitHub-PR/job/PR-769/1/console
versions git=2.17.1
Powered by Apache Yetus 0.11.1 https://yetus.apache.org

This message was automatically generated.

@huaxiangsun
Copy link
Contributor

We run into this issue when exportSnapshot with large size hfiles, will spend some time on reviewing.

Copy link
Contributor

@huaxiangsun huaxiangsun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, will try to rebase and run test locally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants