-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add note to the introduction tutorial to highlight fallback to copying files in the DVC cache can be expected behaviour #579
Comments
Let's wait for @shcheklein / @jorgeorpinel , maybe they have something to add (as they are more familiar with the docs) 🙂 |
@doc-E-brown sure! I would happy to merge your updates. Documentation is indeed a little bit outdated here. Initially DVC was designed to fallback to hardlink or symlinks automatically, but since they require the protected mode to be enabled to prevent cache corruptions we decided to make it opt-in instead. Since you have done the tutorial just recently, I think you still has a fresh perspective on what is the ideal place to put these notes to. Thanks! :) |
I think that the tutorial (/doc/tutorial) is outdated and also has some overlap with get-started and examples. So, just making a small fix to it is not going to improve it. It needs to be completely rewritten or revised. About the topic of cache types, I think that we should first explain to the users that they need to use a partition that supports reflinks, like XFS, Btrfs, ZFS, etc. in order to have the best performance. They can format a second partition if needed, or they can use an external drive (even a USB drive will do just for testing). They should fall back to using hardlinks/symlinks only if everything else has failed. In other words, we should not promote the usage of hardlinks/symlinks, since they make the workflow more complex and non-intuitive. For testing or small projects copy mode should be OK. For big projects or big companies, they should know how to use more efficient filesystems for their data and projects. Then hardlinks/symlinks should be used almost never, except in very rare cases or situations. |
@dashohoxha , just take into account that Windows doesn't support those filesystems; as far as I know, there's no alternative to use reflinks on Windows.
More context about the decision of using |
I am not familiar with windows, but I think that it supports deduplication of files (aka CoW). I am OK with the decision for using |
@dashohoxha there is ticket already to consolidate get-started, examples, and tutorial #564 . We already have a section in the user guide that explains different strategies, pros/cons and when they should be used (feel free to suggest any improvements the section) - https://dvc.org/doc/user-guide/large-dataset-optimization Current default policy is So, I would update the existing tutorial to include the note like we have in some other place I believe. To include a link to the Large Datasets Optimization. When it comes to consolidation most likely we'll reuse some parts of this note in one way or another. |
How about this note I added in to a feature branch in 0d7682c?
Would that have been informative enough for you, @doc-E-brown? |
Hi @jorgeorpinel that would be great! A very small note pointing out that the |
@jorgeorpinel looks good, |
Good points. Updated in 4384bfa |
Continuing from issue #139.
I recently completed the introduction tutorial and was surprised by the amount of disk space being occupied by the data on my system. According to the Data File Internals section of https://dvc.org/doc/tutorial/define-ml-pipeline. After running
du -sh .
I expected a file size of 41M but instead received 81MB. After speaking with @MrOutis on discord, he kindly pointed out that my file system did not support reflinks and the system defaulted back to copy and that this was expected behaviour. Upon re-reading the documentation and repeating the tutorial a number of times, I found this expected behaviour to be unclear. I would like to propose that this note is added, that symlinks and hardlinks can be used if supported by the filesystem via:$ dvc config cache.type hardlink,symlink $ dvc config cache.protected true
and if the file system does not support hard or symlinks that copying will be used. I will link to the page https://dvc.org/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache in particular the Configuring DVC cache file link type section. I am happy to update the docs and submit a PR against this issue.
Thanks!!
The text was updated successfully, but these errors were encountered: