Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(unified) GC: Improve resiliance for unexpected files under _lakefs #8518

Open
yonipeleg33 opened this issue Jan 20, 2025 · 0 comments
Open
Assignees

Comments

@yonipeleg33
Copy link
Contributor

Currently, if users upload files to the _lakefs prefix, the GC will fail to run with an ugly error:

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 88 in stage 7.0 failed 4 times, most recent failure: Lost task 88.3 in stage 7.0 (TID 42180) ([2a05:d018:179b:7f01:d8e3:50aa:4d29:3899] executor 121): io.treeverse.jpebble.BadFileFormatException: Bad magic 37 66 30 31 22 0a 7d 0a: wrong bytes

For context, when we added the dummy file under _lakefs we had to explicitly ignore it in the GC.
Back then, we had the idea of whitelisting only metadata files, but were against it.

We need to revisit this decision, or come up with other ideas to improve the GC's resistance for such cases.

@arielshaqed arielshaqed changed the title (unified) GC: Improve resistancy for unexpected files under _lakefs (unified) GC: Improve resiliance for unexpected files under _lakefs Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants