-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28680 BackupLogCleaner should clean up archived HMaster logs #6006
Conversation
String fn = file.getPath().getName(); | ||
if (fn.startsWith(WALProcedureStore.LOG_PREFIX)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic was already in place for marking fresh HMaster WALs as deletable — I believe we just forgot to consider archived ones too
try { | ||
Address walServerAddress = | ||
Address.fromString(BackupUtils.parseHostNameFromLogFile(file.getPath())); | ||
long walTimestamp = AbstractFSWALProvider.getTimestamp(file.getPath().getName()); | ||
|
||
if ( | ||
!addressToLastBackupMap.containsKey(walServerAddress) | ||
|| addressToLastBackupMap.get(walServerAddress) >= walTimestamp | ||
) { | ||
filteredFiles.add(file); | ||
} | ||
} catch (Exception ex) { | ||
LOG.warn( | ||
"Error occurred while filtering file: {} with error: {}. Ignoring cleanup of this log", | ||
file.getPath(), ex.getMessage()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this logic to a separate method to improve testability
try { | ||
String hostname = BackupUtils.parseHostNameFromLogFile(path); | ||
if (hostname == null) { | ||
LOG.warn( | ||
"Cannot parse hostname from RegionServer WAL file: {}. Ignoring cleanup of this log", | ||
path); | ||
return false; | ||
} | ||
Address walServerAddress = Address.fromString(hostname); | ||
long walTimestamp = AbstractFSWALProvider.getTimestamp(path.getName()); | ||
|
||
if ( | ||
!addressToLastBackupMap.containsKey(walServerAddress) | ||
|| addressToLastBackupMap.get(walServerAddress) >= walTimestamp | ||
) { | ||
return true; | ||
} | ||
} catch (Exception ex) { | ||
LOG.warn("Error occurred while filtering file: {}. Ignoring cleanup of this log", path, ex); | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is largely the same as what was removed above, with two key differences:
- Of course, it will now return archived HMaster WALs as deletable too
- Better handling of the possible
null
fromBackupUtils#parseHostNameFromLogFile
. It was difficult to reason about this error without a patch in our fork because, from the user's perspective, we were previously just hitting an NPE when constructing an Address
Path normalPath = | ||
new Path("/hbase/oldWALs/hmaster%2C60000%2C1716224062663.1716247552189$masterlocalwal$"); | ||
assertTrue(BackupLogCleaner.canDeleteFile(Collections.emptyMap(), normalPath)); | ||
|
||
Path masterPath = new Path( | ||
"/hbase/MasterData/oldWALs/hmaster%2C60000%2C1716224062663.1716247552189$masterlocalwal$"); | ||
assertTrue(BackupLogCleaner.canDeleteFile(Collections.emptyMap(), masterPath)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe it is necessary to check both paths for backwards compatibility — I'm not totally sure when the directory changed, but the book tells me that it was likely around 2.3. Anyway, there's no additional business logic necessary to flag both paths, but I thought it was worth explicitly testing both
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Pretty sure the build failure is just noise. I'm going to force push to rebuild. |
332297c
to
2edc36b
Compare
2edc36b
to
3f00def
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…ely (apache#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (apache#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (apache#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (apache#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
…ely (apache#6006) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Co-authored-by: Ray Mattingly <[email protected]> Signed-off-by: Nick Dimiduk <[email protected]>
… pile up indefinitely (apache#6006) (#101) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
… pile up indefinitely (apache#6006) (#100) We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups. This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it. This seems like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, and, instead, we should always return these HMaster WALs as deletable from the perspective of the BackupLogCleaner. We could subject them to the same scrutiny as RegionServer WALs: are they older than the most recent successful backup? But, if I understand correctly, HMaster WALs do not contain any data relevant to table backups, so that would be unnecessary. Signed-off-by: Nick Dimiduk <[email protected]> Co-authored-by: Ray Mattingly <[email protected]>
See https://issues.apache.org/jira/browse/HBASE-28680
We have been trying to setup daily incremental backups for hundreds of clusters at my day job. Recently we discovered that old WALs were piling up across many clusters inline with when we began running incremental backups.
This led to the realization that the BackupLogCleaner will always skip archived HMaster WALs. This is a problem because, if a cleaner is skipping a given file, then the CleanerChore will never delete it.
At first this seemed like a misunderstanding of what it means to "skip" a WAL in a BaseLogCleanerDelegate, but when working through this code I now believe it is more of an ordinary bug that forgot to take archived HMaster WALs into account
@ndimiduk @charlesconnell @hgromer