-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-23202 ExportSnapshot (import) will fail if copying files to root directory takes longer than cleaner TTL #769
Conversation
💔 -1 overall
This message was automatically generated. |
try { | ||
snapshotInProgress.addAll(fileInspector.filesUnderSnapshot(run.getPath())); | ||
} catch (CorruptedSnapshotException e) { | ||
// See HBASE-16464 | ||
if (e.getCause() instanceof FileNotFoundException) { | ||
// If the snapshot is corrupt, we will delete it | ||
fs.delete(run.getPath(), true); | ||
LOG.warn("delete the " + run.getPath() + " due to exception:", e.getCause()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will this actually work for the ExportSnapshot case? The snapshot manifest is added to tmp before all the files are present on cluster so it looks like this will delete the snapshot manifest which would mess up the import job.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, there maybe race condition between ExportSnapshot and SnapshotCleaner.
Copying Snapshot Manifest is a fast operation. Maybe we can add a time threshold. When we catch CorruptedSnapshotException, if the modification time of the snapshot folder exceeds a certain time threshold, we will delete it, otherwise we will ignore this cleanup operation. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
copying the snapshot manifest is not always fast since it can be hundreds of MB and the link between clusters can be poor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and when the snapshot contains a large number of files, copying the snapshot can take a long time even when there isn't a lot of data. Also copying the actual data for a large export can take tens-of-days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, when CorruptedSnapshotException is thrown, we can ignore the exception and continue to clean up HFile instead of skip.
If the CorruptedSnapshotException is thrown, which means that the ExportSnapshot has not copy the snapshot manifest successfully, and the data file of the snapshot has not yet started to copy, so it will have no effect on the snapshot if the snapshotCleaner continues.
The main purpose of adding a delete snapshot manifest logic is to clean up the abnormal snapshot manifest. Of course, it is OK to remove the logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any progress on this issue review?? I faced exactly same problem, and hope it to be resolved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, if it reads into the middle of copying manifest files, it is ok to remove this snapshot as copying HFiles has not started yet. So there is no impact for the logic in snapshotCleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic of getUnreferencedFiles() is that for an HFile which is not in cache, it will refreshCache to get the latest snapshot hfiles. If one hfile from this exortSnapshot job is in the list, this means that manifest files have been copied over, so refreshCache() will get the latest snapshot file list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I rebased the patch and posted a new pull request,
#1791
It is same as the original one, except some minor changes (like some of utilities are moved, change to use new utility class).
LGTM |
263f2f1
to
ae40ed4
Compare
💔 -1 overall
This message was automatically generated. |
…t directory takes longer than cleaner TTL
ae40ed4
to
37d1e5b
Compare
🎊 +1 overall
This message was automatically generated. |
Face the same problem. Any progress on this issue? @guangxuCheng @binlijin |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
We run into this issue when exportSnapshot with large size hfiles, will spend some time on reviewing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, will try to rebase and run test locally.
Detail describe: https://issues.apache.org/jira/browse/HBASE-23202
Revert HBASE-21511