-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[extension/filestorage] Compaction fails if the target directory is on a different filesystem #13449
Comments
FYI @djaglowski I'm planning to submit a PR to fix this, opened the issue so I don't forget. I think the fix is simply to fall back to a Truncate + Copy if the Rename fails, but I'm open to alternatives. |
I agree the current behavior is incorrect. I wonder if a more robust solution would be to use something like a blue/green pattern, where the client alternates between two file names. This would mean we do not need to depend on any renaming, but also that we are not leaking files. Very roughly in pseudocode:
|
That'd be easier to get right, but it has the downside of requiring both files to be on the same filesystem. One situation where compaction is useful is if we run out of space on the device, and then compacting to a different device and moving the file back allows us to actually reclaim the space. With this proposal, we'd simply be unable to compact at all. On the other hand, it's tricky to get the current method working correctly so it always leaves a working database in case of failure. Moving files between devices is non-atomic, unlike os.Rename, so we can get stuck with a corrupt DB if we get killed at an unlucky time. |
Good points. I think you are more tuned into the compaction use case, so I'm happy to review the implementation you think is most appropriate here. |
Fixed in #13730 |
Describe the bug
File storage compaction uses a mid-step directory for storing the compacted DB before moving it back to the storage path. This move is implemented as an
os.Rename
(opentelemetry-collector-contrib/extension/storage/filestorage/client.go
Line 220 in 0d628ed
Steps to reproduce
Set the storage directory and the compaction directory on different filesystems, and trigger a compaction.
What did you expect to see?
I expected the compaction to succeed.
What did you see instead?
I got an "invalid cross-device link" error message, and then the storage got into a state where nothing could be written to it, as the db wasn't open after the error
What version did you use?
v0.54.0 and v0.58.0.
What config did you use?
Cut down to the important bits:
Environment
Originally saw this in Kubernetes, AWS EKS 1.21 to be exact, with the directories placed respectively on an EBS volume and a local tmpDir volume. It can be reproduced on any two different filesystems though, including tmpfs on Linux.
Additional context
By itself, this problem should simply cause the compaction to fail and the storage to continue functioning normally, but due to us unnecessarily removing the original DB file here, it actually stops the storage client from working completely. Not sure if that's worth opening a separate issue, WDYT @djaglowski ?
The text was updated successfully, but these errors were encountered: