-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dvc status: 'up-to-date' but cache is corrupted #9641
Comments
Enabling the
There is currently no way to force this kind of check for local cache, but we can consider adding something like this in the future |
Also, just note that |
We have been previously working on the assumption that |
@skshetry I'm not following, s3fs uploads should still be atomic since incomplete s3 fetch atomicity should not depend on the underlying fs at all, since we handle downloading to the temporary path and then moving the resulting file on the successful download ourselves in dvc-objects. |
Ahh, sorry. I see that we are using
I thought it was DVC that was corrupting cache due to connection issues. But yeah, reading closely, this is a feature request. Sorry for the noise. :) |
Thanks @pmrowla and @skshetry for taking a look at this report. We will move forward with the |
@bardsleypt I see that you are using s3 remotes are exteremely reliable in our experience and we do our best to do everything (semi-)atomically, so harlink seems like a much more possible cause of the cache corruption. Could you talk a bit more about your typical setup? Shared dev machines, shared dvc cache? Anything else notable? |
@bardsleypt Btw, happy to jump on a call too to get to meet each other and learn about your use case and see if we can help identify where it might've broken. Let me know if you are up for that and we'll find some time next week. 🙂 |
@efiop to address some of your questions:
So far we have not encountered the problem after switching to the hardlink cache type. For this project, the hardlink cache type does seem sufficient as we don't expect to be changing/versioning data much (if at all), basically just large blob storage. That is to say, unprotecting/protecting are not that onerous for our use-case. I'd be happy to hear if you have other thoughts/suggestions on this issue. Otherwise, at the moment, I think our needs are met. I'll be sure to reach out if we do encounter other problems now that we are verifying the cache on pull. |
@bardsleypt Thanks for the info! The fact of cache corruption is very worrying for me, it should not happen, especially not during Have you inspected the files closely? What kind of corruption are we talking about? Just an incomplete file? |
@efiop no problem. Looking at the URL for our remote, it is actually s3-compatible, not plain aws S3. It is https://***minio.ad.***.com, so The corruption (at least the only one we've encountered so far) manifests as the .wav file seemingly missing a sample(s), causing the channels to permute. The content may be correct but is placed in the wrong place upon opening it within any waveform analyzer. In any case, it definitely results in a .wav file with incorrect channel information and incorrect content. I can drill into the specifics if it is helpful. |
@bardsleypt Oh, so it is minio after all! We did get similar reports in the past: #5502 , but were not able to reproduce them ourselves.
That would be really appreciated 🙏 Maybe we could get a better feeling of how it is corrupted so it is easier to identify where it was corrupted. My offer about the call still holds if you are interested 🙂 |
@efiop Sure, we can have a call. I'm a bit tied up early on this week, perhaps Thursday 6/29 or Friday 6/30? I'm free in the middle of the day 10am - 2pm (MDT) both days. My colleague did look into the corruption further and found the following:
I have some screenshots I can share with you when we have a call to clairfy, but maybe this information is helpful in the meantime. Let me know what time works for you for a call and we can go from there. |
@bardsleypt Thank you for the info! The corruption does not look trivial (like a truncated file or something). I'm out of good ideas right now, but will give it additional thought. Sent an invite for Friday to your gmail email (the only one I found, feel free to swap it for your work one, you should have permissions to edit that invite). Looking forward to it 🙂 |
For the record: had a meeting with @bardsleypt today. I got a pretty good idea of the layout/size of data that we are dealing with here and I'll try to reproduce with minio + windows, just to see if it is feasible. |
Bug Report
dvc status: 'up-to-date' but cache is corrupted
Description
Context: on a Windows machine (git-bash), performing a 'dvc pull' from an AWS-S3 bucket, the dvc-pull fails occasionally due to various network connection problems. This seems to be the origin of the following problem:
Problem:
Reproduce
Expected
Environment information
Output of
dvc doctor
:Additional information
Unfortunately I cannot generate a situation that reliably re-creates this corruption issue, it seems to arise from connectivity/network issues within our organization. I also have not encountered it on Mac/Linux (I'm submitting this on behalf of a colleague using Windows) so perhaps it is OS-specific. This may not be a true DVC bug, but we are seeking some guidance on how to avoid/detect such a corruption and have not found a good resource. We have found the setting for verification of the remote data, i.e.:
which as I understand it will force the re-hash locally and likely detect our problem, but does seem that a manual alternative such as
dvc status --verify
should be available?The text was updated successfully, but these errors were encountered: