Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: differentiate types of file hashes #1469

Closed
3 tasks done
jorgeorpinel opened this issue Jun 21, 2020 · 19 comments
Closed
3 tasks done

docs: differentiate types of file hashes #1469

jorgeorpinel opened this issue Jun 21, 2020 · 19 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) status: stale You've been groomed! type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jun 21, 2020

Extracted from #1448 (comment) and #1494 (review):

For HTTP(s)/S3/azure, we have etag, for ssh/local/GS we have md5 and for hdfs, we use checksum key.

("key" referring to the actual name with which the values are saved in dvc.yaml)

Related: #68


I believe this affects the following docs:

  • import-url or import?
  • .dvc file spec (both deps and outs)
  • dvc.lock file spec
  • External Dependencies guide
  • External Data (Outputs) guide
  • run -d (external dep) ?
  • add (external outs) ?

some of which may need more examples that feature etag and checksum fields in dvc.lock.

@jorgeorpinel jorgeorpinel added type: enhancement Something is not clear, small updates, improvement suggestions good first issue Good for newcomers A: docs Area: user documentation (gatsby-theme-iterative) labels Jun 21, 2020
@jorgeorpinel jorgeorpinel removed the good first issue Good for newcomers label Jun 21, 2020
@shcheklein
Copy link
Member

I would suggest to generalize it somehow instead of trying to mention all possible fields everywhere. E.g. outs can have hash (e.g. md5 or etag).

@jorgeorpinel
Copy link
Contributor Author

Agree but we still need to update all those docs with the more general concept of "file hash" or "hash value" — which fortunately we've already been doing in the past 🙂

@utkarshsingh99
Copy link
Contributor

utkarshsingh99 commented Jun 24, 2020

I think this issue is linked to #1448 more than it looks like.
I didn't find any page that gave a list of all file hashes together, and I think we might need it every time we try to generalize.
I'm not sure where exactly would we need to update the contents in import, import-url. Everything looks quite specific to me there.
I believe once #1448 is merged, we can add the link to dvc-file-and-directories in #1494 too.
Thoughts on this?

@jorgeorpinel
Copy link
Contributor Author

I'm not sure where exactly would we need to update the contents in import, import-url

Which kind of file hash is used on each type of external dependency. Same as in add and run. Or maybe in the External Data guide and link from all those refs into there.

once #1448 is merged, we can add the link to

Yes, maybe.

@jorgeorpinel
Copy link
Contributor Author

For HTTP(s)/S3/azure, we have etag, for ssh/local/GS we have md5 and for hdfs, we use checksum key.

What about OSS and Google Drive @skshetry ? Please lmk and I'll update in #1527

@jorgeorpinel
Copy link
Contributor Author

Also @skshetry this all applies to .dvc files as well, right? Not just dvc.lock

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jul 4, 2020

Oh and what about external outputs? Or do outs entries always use md5 no matter what? CC @efiop you probably also know 🙂

@efiop
Copy link
Contributor

efiop commented Jul 6, 2020

@jorgeorpinel They might be md5(local, ssh, gs)/etag(s3) or checksum(hdfs)

@skshetry
Copy link
Member

skshetry commented Jul 6, 2020

what about external outputs?

@jorgeorpinel, this is about external dependencies and outputs. So, it applies to dvc.lock and .dvc files too.

What about OSS and Google Drive?

OSS uses etag. Google Drive does not support external deps/outs, so it does not have any.

@efiop
Copy link
Contributor

efiop commented Jul 6, 2020

oss also doesn't support external deps/outs.

@jorgeorpinel

This comment has been minimized.

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jul 7, 2020

On OSS, please clarify:

OSS uses etag

vs.

oss also doesn't support external deps/outs.

Thanks

@jorgeorpinel
Copy link
Contributor Author

jorgeorpinel commented Jul 7, 2020

Wait no, sorry. Saugat said MD5 for GS, never mind. eTag is for HTTP(s), S3, and Azure.

@jorgeorpinel
Copy link
Contributor Author

p.s. you can just review #1527 instead.

@MetalBlueberry
Copy link
Contributor

hi all! I'm working on a tool to upload data to a dvc remote without actually using dvc and I noticed that the md5 of a file is not calculated correctly if the file is in windows format. Looks like the CRLF are replace by LF before calculation. iterative/dvc#775 This is a minor issue because will treat as equal files with different ending.

The point is, I first though that for files added with dvc add it will be always md5 checksum of the content, but looks like it is not that simple.
is there any clear documentation in how the file hash is calculated to help me to implement a compatible uploader?

@jorgeorpinel
Copy link
Contributor Author

I noticed that the md5 of a file is not calculated correctly if the file is in windows format. Looks like the CRLF are replace by LF before calculation.

Could it be Git doing that though (depending on the repo's config)? Cc @efiop on this Q anyway.

is there any clear documentation in how the file hash is calculated

Not much @MetalBlueberry, which is why this issue and #68 exist.

The one place where we've already put some of this info is in our DVC Metafiles guide: https://dvc.org/doc/user-guide/dvc-files-and-directories (please find the md5, etag, checksum, and hash terms).

Feel free to ask any questions about this or anything else directly in our http://dvc.org/chat !

@shcheklein
Copy link
Member

@MetalBlueberry if we talk about regular dvc add of some local artifact (not dvc add --external) you got it right - it applies dos2unix, and that's pretty much it. Curious about your use case though? Why don't you want using DVC for this- you can reuse some API at least, e.g. to calculate hash.

@MetalBlueberry
Copy link
Contributor

I've created an issue for the CRLF problem iterative/dvc#4658.

@jorgeorpinel
Copy link
Contributor Author

I think this is probably addressed now ( https://dvc.org/doc/user-guide/project-structure/pipelines-files#dvclock-file ) but I'll double check the list in the issue description.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) status: stale You've been groomed! type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

6 participants