You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Databricks provides an operation called FSCK REPAIR TABLE that removes active files that no longer can be found in the underlying file system.
Use Case
Due to a hardware issue some parquet files were corrupted when written. This data is non-critical and I simply would like to delete to from the underlying storage and then use this operation to reconcile the log.
This operation also supports a dry run which can be used to check if files are missing due to an external issue.
Related Issue(s)
The text was updated successfully, but these errors were encountered:
# Description
Implementation of the filesystem check operation.
The implementation is fairly straight forward with a HEAD call being
made for each active file to check if it exists.
A remove action is then made for each file that is orphaned.
An alternative solution is instead to maintain a hashset with all active
files and then recursively list all files. If the file exists then
remove from the set. All remaining files in the set are then considered
orphaned.
Looking for feedback and if the second approach is preferred I can make
the changes
# Related Issue(s)
- closes#1092
---------
Co-authored-by: Will Jones <[email protected]>
# Description
Implementation of the filesystem check operation.
The implementation is fairly straight forward with a HEAD call being
made for each active file to check if it exists.
A remove action is then made for each file that is orphaned.
An alternative solution is instead to maintain a hashset with all active
files and then recursively list all files. If the file exists then
remove from the set. All remaining files in the set are then considered
orphaned.
Looking for feedback and if the second approach is preferred I can make
the changes
# Related Issue(s)
- closesdelta-io#1092
---------
Co-authored-by: Will Jones <[email protected]>
Description
Databricks provides an operation called FSCK REPAIR TABLE that removes active files that no longer can be found in the underlying file system.
Use Case
Due to a hardware issue some parquet files were corrupted when written. This data is non-critical and I simply would like to delete to from the underlying storage and then use this operation to reconcile the log.
This operation also supports a dry run which can be used to check if files are missing due to an external issue.
Related Issue(s)
The text was updated successfully, but these errors were encountered: