Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvc status: 'up-to-date' but cache is corrupted #9641

Closed
bardsleypt opened this issue Jun 21, 2023 · 15 comments
Closed

dvc status: 'up-to-date' but cache is corrupted #9641

bardsleypt opened this issue Jun 21, 2023 · 15 comments
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature

Comments

@bardsleypt
Copy link

Bug Report

dvc status: 'up-to-date' but cache is corrupted

Description

Context: on a Windows machine (git-bash), performing a 'dvc pull' from an AWS-S3 bucket, the dvc-pull fails occasionally due to various network connection problems. This seems to be the origin of the following problem:

Problem:

  • dvc pull (possibly only a partial pull that fails with subsequent retrys until 'everything is up to date')
  • dvc status outputs : 'data and pipelines are up to date'
  • one file within a dvc-added directory is obviously corrupted (in this case a wav-file with missing channel data)
  • tracking the associated .dir file through the dvc-cache, we pull the .json listing the md5sums of the associated files
  • we find the md5sum of the corrupted file
  • we recompute the md5sum (manually) on the corrupted file (within the cache) and (as expected due to corruption) it gives a different md5sum than its associated filename (which itself should be the md5sum)

Reproduce

  1. dvc pull
  2. (possibly) dvc pull if previous pull only partially succeeded
  3. dvc status (everything is up to date)
  4. Workflow identifies a corrupted file
  5. Manual re-hash of file within dvc-cache does not match its associated filename

Expected

  1. dvc status warns the user that the dvc-tracked .dir has changed and that another dvc pull is required
  2. and/or md5sum of files within the dvc-cache match with their filenames (at least checked once on initial pull of data)
  3. feature-request of a flag to force recheck of hashes for 'dvc status' command

Environment information

  • (mini)conda version: 4.11.0
  • python version: 3.10.9
  • dvc version: 2.43.4
  • Windows 10

Output of dvc doctor:

$ dvc doctor

DVC version: 2.43.4 (pip)
-------------------------
Platform: Python 3.10.9 on Windows-10-10.0.19044-SP0
Subprojects:
        dvc_data = 0.37.8
        dvc_objects = 0.19.3
        dvc_render = 0.1.0
        dvc_task = 0.1.11
        dvclive = 2.0.2
        scmrepo = 0.1.7
Supports:
        http (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.4, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2023.6.0, boto3 = 1.26.76)
Cache types: hardlink
Cache directory: NTFS on D:\
Caches: local
Remotes: s3
Workspace directory: NTFS on D:\
Repo: dvc, git

Additional information

Unfortunately I cannot generate a situation that reliably re-creates this corruption issue, it seems to arise from connectivity/network issues within our organization. I also have not encountered it on Mac/Linux (I'm submitting this on behalf of a colleague using Windows) so perhaps it is OS-specific. This may not be a true DVC bug, but we are seeking some guidance on how to avoid/detect such a corruption and have not found a good resource. We have found the setting for verification of the remote data, i.e.:

dvc remote modify myrepo verify true

which as I understand it will force the re-hash locally and likely detect our problem, but does seem that a manual alternative such as dvc status --verify should be available?

@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

Enabling the verify option for the remote is the intended solution here.

a manual alternative such as dvc status --verify should be available?

There is currently no way to force this kind of check for local cache, but we can consider adding something like this in the future

@pmrowla pmrowla added feature request Requesting a new feature A: data-management Related to dvc add/checkout/commit/move/remove labels Jun 22, 2023
@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

Also, just note that md5sum output will not always match the DVC md5, depending on the file you are tracking. In DVC 2.x, DVC tries to determine whether or not the file contains text content, and in the event DVC thinks it is a text file it does a dos2unix conversion on the data before computing the MD5 hash (see #4658 for more details)

@skshetry
Copy link
Member

We have been previously working on the assumption that botocore provides atomicity for us.
Now that we use s3fs, it is no longer true.
@pmrowla, I'd classify this as bug, fetch/push should be atomic.

@pmrowla
Copy link
Contributor

pmrowla commented Jun 22, 2023

@skshetry I'm not following, s3fs uploads should still be atomic since incomplete s3 multipart_upload does not actually replace the final object with a partial upload. The final object in the bucket is not created until the complete_multipart_upload request is sent. Is the concern here that you could end up with a corrupted or partial chunk in the multipart request?

fetch atomicity should not depend on the underlying fs at all, since we handle downloading to the temporary path and then moving the resulting file on the successful download ourselves in dvc-objects.

@skshetry
Copy link
Member

Ahh, sorry. I see that we are using as_atomic there (I thought we had moved it back to respective filesystems).

the dvc-pull fails occasionally due to various network connection problems

I thought it was DVC that was corrupting cache due to connection issues. But yeah, reading closely, this is a feature request. Sorry for the noise. :)

@bardsleypt
Copy link
Author

Thanks @pmrowla and @skshetry for taking a look at this report. We will move forward with the verify option. Since there seems to be no bug with the atomicity, I'll close this bug report. If we have further issues, I'll reach back out, but thank you for clarifying this. I'll submit a feature request as well, I do think it would be useful from time-to-time to force a re-hash of the local workspace + cache (other colleagues have mentioned the same thing).

@efiop
Copy link
Contributor

efiop commented Jun 24, 2023

@bardsleypt I see that you are using hardlink link type and my guess is that someone corrupted cache through that. With hardlink or symlink your workspace files are directly linked to your cache files, which means that if you are not careful you might corrupt cache files. We usually protect those links with read-only permissions, but they are not a 100% defense and could be overcome.

s3 remotes are exteremely reliable in our experience and we do our best to do everything (semi-)atomically, so harlink seems like a much more possible cause of the cache corruption.

Could you talk a bit more about your typical setup? Shared dev machines, shared dvc cache? Anything else notable?

@efiop
Copy link
Contributor

efiop commented Jun 24, 2023

@bardsleypt Btw, happy to jump on a call too to get to meet each other and learn about your use case and see if we can help identify where it might've broken. Let me know if you are up for that and we'll find some time next week. 🙂

@bardsleypt
Copy link
Author

@efiop to address some of your questions:

  • We weren't using hardlink until recently and had actually encountered the corruption issue prior to changing our cache type from default (copy I believe on Windows)
  • We switched to hardlink to save some disk-space
  • We have roughly 10 or so developers who primarily use dvc as simply large-data storage, basically a replacment for git-lfs since we aren't using pipelines at the moment)
  • Most developers are on Windows 10 machines using Anaconda + Git-Bash for their access to python + dvc + environments, except myself and one other who use MacOS and Linux (Redhat)
  • We've only ever encountered the corruption on the Windows machines (more support for my anti-Windows bias :-) )
  • The corruption is not 100% reproducible on our end, some faulty network connection it seems, but it typically has followed these steps:
    1. dvc pull fails to pull down all data (typically a single file fails)
    2. Subsequent dvc pull won't execute due to partial file download, and asks for --force flag to be used
    3. dvc pull --force succeeds, and dvc status says everything is okay
    4. Manual inspection of data (.wav file) shows corrupted channel information
    • naive md5sum data and dvc-hash-file.dir comparison shows the file is indeed faulty and incorrect

So far we have not encountered the problem after switching to the hardlink cache type. For this project, the hardlink cache type does seem sufficient as we don't expect to be changing/versioning data much (if at all), basically just large blob storage. That is to say, unprotecting/protecting are not that onerous for our use-case.

I'd be happy to hear if you have other thoughts/suggestions on this issue. Otherwise, at the moment, I think our needs are met. I'll be sure to reach out if we do encounter other problems now that we are verifying the cache on pull.

@efiop
Copy link
Contributor

efiop commented Jun 26, 2023

@bardsleypt Thanks for the info!

The fact of cache corruption is very worrying for me, it should not happen, especially not during dvc pull. We download all files into a temporary location and then rename them into place (semi)atomically, to ensure there are no partial files that could end up in your cache. And AWS S3 remotes (I suppose it is plain aws s3, not minio or something s3-compatible, right?) are probably the ones that are tested the most and I don't remember seeing it corrupt the files at any point. Though, admittedly, windows is definitely much less tested than linux or macos. Since we can't reliably reproduce it, I guess we'll just have to give it a thought.

Have you inspected the files closely? What kind of corruption are we talking about? Just an incomplete file?

@bardsleypt
Copy link
Author

bardsleypt commented Jun 26, 2023

@efiop no problem.

Looking at the URL for our remote, it is actually s3-compatible, not plain aws S3. It is https://***minio.ad.***.com, so minio, perhaps that is part of the issue?

The corruption (at least the only one we've encountered so far) manifests as the .wav file seemingly missing a sample(s), causing the channels to permute. The content may be correct but is placed in the wrong place upon opening it within any waveform analyzer. In any case, it definitely results in a .wav file with incorrect channel information and incorrect content. I can drill into the specifics if it is helpful.

@efiop
Copy link
Contributor

efiop commented Jun 27, 2023

@bardsleypt Oh, so it is minio after all! We did get similar reports in the past: #5502 , but were not able to reproduce them ourselves.

I can drill into the specifics if it is helpful.

That would be really appreciated 🙏 Maybe we could get a better feeling of how it is corrupted so it is easier to identify where it was corrupted.

My offer about the call still holds if you are interested 🙂

@bardsleypt
Copy link
Author

@efiop Sure, we can have a call. I'm a bit tied up early on this week, perhaps Thursday 6/29 or Friday 6/30? I'm free in the middle of the day 10am - 2pm (MDT) both days.

My colleague did look into the corruption further and found the following:

  • 4 channel .wav file
  • Part way through the file (in time, call it t=t*) the corruption occurs
  • After t* the channels permute by 1 and restart from t=0
    • Channels (1, 2, 3, 4) -> (2, 3, 4, 1)
    • E.g., info in channel 2 in the interval [0, t*] becomes the info in channel 1 in the interval [t*, end of file]
  • File size itself is larger but length of 'audio data' is the same (waveheader seems to come through fine)
    • Llikely all channel information is intact within the file, but the wavefile ordering/compression is corrupted

I have some screenshots I can share with you when we have a call to clairfy, but maybe this information is helpful in the meantime. Let me know what time works for you for a call and we can go from there.

@efiop
Copy link
Contributor

efiop commented Jun 27, 2023

@bardsleypt Thank you for the info! The corruption does not look trivial (like a truncated file or something). I'm out of good ideas right now, but will give it additional thought.

Sent an invite for Friday to your gmail email (the only one I found, feel free to swap it for your work one, you should have permissions to edit that invite). Looking forward to it 🙂

@efiop
Copy link
Contributor

efiop commented Jun 30, 2023

For the record: had a meeting with @bardsleypt today. I got a pretty good idea of the layout/size of data that we are dealing with here and I'll try to reproduce with minio + windows, just to see if it is feasible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: data-management Related to dvc add/checkout/commit/move/remove feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests

4 participants