Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DVC pulls corrupted file from Minio (S3) without recalculating hash #5502

Closed
maxim1317 opened this issue Feb 20, 2021 · 12 comments
Closed

DVC pulls corrupted file from Minio (S3) without recalculating hash #5502

maxim1317 opened this issue Feb 20, 2021 · 12 comments
Labels
awaiting response we are waiting for your reply, please respond! :)

Comments

@maxim1317
Copy link

Bug Report

dvc pull: DVC pulls corrupted file from Minio (S3) without recalculating hash

Description

Had connection issue during dvc push and file was corrupted.
Then after pulling got this corrupted file and could't reupload it because MD5 was calculated for correct file.

Reproduce

Example:

  1. dvc init
  2. dvc add 1.txt
  3. dvc push
  4. modify 1.txt in minio;
  5. dvc pull

Expected

I expect DVC to check file MD5 on pulling. Or be able to reupload correct file.

Environment information

Ubuntu 20.04.2 LTS

Output of dvc version:

$ dvc version

1.11.16
@efiop
Copy link
Contributor

efiop commented Feb 20, 2021

Hi @maxim1317 .

By default we trust s3 remotes when downloading objects from them, but you could make your dvc repo instance not trust it by using dvc remote modify mys3 verify true config option, that will make it re-calculate hashes locally.

Could you share more details on how the upload get corrupted? Was it really during the upload or did the corruption occur after, in minio storage? Also, do you use https://dvc.org/doc/user-guide/managing-external-data ?

@efiop efiop added the awaiting response we are waiting for your reply, please respond! :) label Feb 20, 2021
@maxim1317
Copy link
Author

Oh, I'm sorry, I didn't know dvc remote modify mys3 verify true existed.

About corruption: I'm not really sure whether it was due to aborted push or due to connection error, but I'm pretty sure that corruption didn't occur in minio - only this machine has rights for dvc bucket (for now)

@efiop
Copy link
Contributor

efiop commented Feb 20, 2021

@maxim1317 Could you share more details on the circumstances? E.g. were you on bad unstable connection or something else went wrong?

Then after pulling got this corrupted file and could't reupload it because MD5 was calculated for correct file.

How did you detect the corruption error?

@maxim1317
Copy link
Author

@efiop on pulling, actually - the correct file is 20Mb, but the one i pulled was 12MB.

@efiop
Copy link
Contributor

efiop commented Mar 1, 2021

@maxim1317 And dvc didn't complain? Could you try with verify true and see if that catches it?

@maxim1317
Copy link
Author

@efiop I'll try try to reproduce it with verify true
There is another question - we've tried to fix the error by removing file from minio and pushing it again, but it doesn't seem to work. dvc status -c shows that the file was deleted from remote, but dvc push does nothing

@efiop
Copy link
Contributor

efiop commented Mar 2, 2021

@maxim1317 That file is a part of dataset? (i.e. part of directory that you've dvc added as a whole)

@maxim1317
Copy link
Author

@efiop yeah, it is part of a directory that was added as a whole

@efiop
Copy link
Contributor

efiop commented Mar 3, 2021

@maxim1317 If so, you also need to delete the corresponding .dir on the remote (e.g. 12345.dir), as dvc trusts that if .dir exsits - all files in it also exist on remote.

@maxim1317
Copy link
Author

@efiop oh, didn't know that
Thank you!

@efiop
Copy link
Contributor

efiop commented Jun 27, 2023

@maxim1317 Have you been running into this problem again? We got a similar report in #9641 so I've created iterative/dvc-s3#45

@maxim1317
Copy link
Author

@efiop To my knowledge, no. We have enabled verify true, but we also became more cautious with hard stopping of uploads/downloads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
awaiting response we are waiting for your reply, please respond! :)
Projects
None yet
Development

No branches or pull requests

2 participants