Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add note to the introduction tutorial to highlight fallback to copying files in the DVC cache can be expected behaviour #579

Closed
doc-E-brown opened this issue Aug 22, 2019 · 11 comments
Assignees
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@doc-E-brown
Copy link

Continuing from issue #139.

I recently completed the introduction tutorial and was surprised by the amount of disk space being occupied by the data on my system. According to the Data File Internals section of https://dvc.org/doc/tutorial/define-ml-pipeline. After running du -sh . I expected a file size of 41M but instead received 81MB. After speaking with @MrOutis on discord, he kindly pointed out that my file system did not support reflinks and the system defaulted back to copy and that this was expected behaviour. Upon re-reading the documentation and repeating the tutorial a number of times, I found this expected behaviour to be unclear. I would like to propose that this note is added, that symlinks and hardlinks can be used if supported by the filesystem via:

$ dvc config cache.type hardlink,symlink
$ dvc config cache.protected true

and if the file system does not support hard or symlinks that copying will be used. I will link to the page https://dvc.org/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache in particular the Configuring DVC cache file link type section. I am happy to update the docs and submit a PR against this issue.

Thanks!!

@ghost ghost added A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions tutorial labels Aug 22, 2019
@ghost
Copy link

ghost commented Aug 22, 2019

Let's wait for @shcheklein / @jorgeorpinel , maybe they have something to add (as they are more familiar with the docs) 🙂

@shcheklein
Copy link
Member

@doc-E-brown sure! I would happy to merge your updates. Documentation is indeed a little bit outdated here. Initially DVC was designed to fallback to hardlink or symlinks automatically, but since they require the protected mode to be enabled to prevent cache corruptions we decided to make it opt-in instead.

Since you have done the tutorial just recently, I think you still has a fresh perspective on what is the ideal place to put these notes to.

Thanks! :)

@shcheklein shcheklein changed the title Add note to the introduction tutorial to highlight fallback to copying files in the DVC cache can be expected behaviour add note to the introduction tutorial to highlight fallback to copying files in the DVC cache can be expected behaviour Aug 22, 2019
@dashohoxha
Copy link
Contributor

I think that the tutorial (/doc/tutorial) is outdated and also has some overlap with get-started and examples. So, just making a small fix to it is not going to improve it. It needs to be completely rewritten or revised.

About the topic of cache types, I think that we should first explain to the users that they need to use a partition that supports reflinks, like XFS, Btrfs, ZFS, etc. in order to have the best performance. They can format a second partition if needed, or they can use an external drive (even a USB drive will do just for testing). They should fall back to using hardlinks/symlinks only if everything else has failed.

In other words, we should not promote the usage of hardlinks/symlinks, since they make the workflow more complex and non-intuitive. For testing or small projects copy mode should be OK. For big projects or big companies, they should know how to use more efficient filesystems for their data and projects. Then hardlinks/symlinks should be used almost never, except in very rare cases or situations.
So, maybe we can have a special tutorial that explains them, but not explain them in the main tutorials.

@ghost
Copy link

ghost commented Aug 22, 2019

@dashohoxha , just take into account that Windows doesn't support those filesystems; as far as I know, there's no alternative to use reflinks on Windows.

hardlinks and symlinks aren't that bad if you set dvc config cache.protected true 😅 .

More context about the decision of using reflinks / copy: iterative/dvc#1599

@dashohoxha
Copy link
Contributor

I am not familiar with windows, but I think that it supports deduplication of files (aka CoW).

I am OK with the decision for using reflinks,copy by default (avoiding hardlinks and symlinks as much as possible).

@shcheklein
Copy link
Member

@dashohoxha there is ticket already to consolidate get-started, examples, and tutorial #564 .

We already have a section in the user guide that explains different strategies, pros/cons and when they should be used (feel free to suggest any improvements the section) - https://dvc.org/doc/user-guide/large-dataset-optimization

Current default policy is reflinks, copy, but we don't want to deemphasize hardlinks or symlinks in any way. We want to mention that there is this option available if you deal with a lot of data, we want to mention it everywhere. It's an extremely important optimization for ML practitioners.

So, I would update the existing tutorial to include the note like we have in some other place I believe. To include a link to the Large Datasets Optimization. When it comes to consolidation most likely we'll reuse some parts of this note in one way or another.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Aug 22, 2019

How about this note I added in to a feature branch in 0d7682c?

When reflinks are not supported, DVC defaults to copying files to avoid problems with other file link types, but these can be enabled easily. See File link types for more information.

Would that have been informative enough for you, @doc-E-brown?

@doc-E-brown
Copy link
Author

Hi @jorgeorpinel that would be great! A very small note pointing out that the du -sh . command in the tutorial would only be 41M in the event that reflinks are supported and that if dvc falls back to copying file this would not be 41M would be the cherry on top. On reading it seemed that this section of the tutorial was emphasizing the space saving of the reflinks and it was surprising not to see it occur. Thanks!!!

@shcheklein
Copy link
Member

@jorgeorpinel looks good, avoid problems with other file link types - I would change it a bit though to something less scary. Problems probably not the best term here. The idea is that we want to keep the workflow simple by default. It's not like more advanced workflow has any "problems".

@jorgeorpinel
Copy link
Contributor

Good points. Updated in 4384bfa

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Aug 24, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

4 participants