-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Regular updates (Apr 6) #1110
Regular updates (Apr 6) #1110
Conversation
related to #1096 (review)
…nd "shared" cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so per https://discordapp.com/channels/485586884165107732/563406153334128681/696834803831406614 I reviewed a use case and a user guide to try clarifying external vs shared cache, but found some more questions, asked here. Please check my changes and questions 🙂
Now, ensure that the cached directories and files have appropriate permissions, | ||
so that they can be accessed by your colleagues (assuming their users are | ||
members of the same group): | ||
|
||
```dvc | ||
$ sudo find /path/to/dvc-cache -type f -exec chmod 0664 {} \; | ||
$ sudo find /path/to/dvc-cache -type d -exec chmod 0775 {} \; | ||
$ sudo chown -R myuser:ourgroup /path/to/dvc-cache/ | ||
$ sudo find /home/shared/dvc-cache -type d -exec chmod 0775 {} \; | ||
$ sudo find /home/shared/dvc-cache -type f -exec chmod 0664 {} \; | ||
$ sudo chown -R myuser:ourgroup /home/shared/dvc-cache/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure these commands will always work. What if the shared cache dir has some files your user can't access (if someone didn't do this process before, for example)? Wouldn't the command fail?
## Configure the external shared cache | ||
|
||
Tell DVC to use the directory we've set up above as an shared cache location by | ||
running: | ||
Tell DVC to use the directory we've set up above as the <abbr>cache</abbr> for | ||
your <abbr>project</abbr>: | ||
|
||
```dvc | ||
$ dvc config cache.dir /path/to/dvc-cache | ||
$ dvc config cache.dir /home/shared/dvc-cache | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have all these exact and reproducible command guides in a use case? This section seems more like a user guide to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I agree ... most of it should be part of the shared cache
article in the UG. And use case should be probably more generic "Optimizing data management" which can cover this one (multiple people working on a single box), k8s scenario - multiple machines + NAS, etc. We are long overdue on this. There were contributors that made an attempt, but didn't get it right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so, these two mostly about improving the existing case - but they are relevant, up to you to consolidate or not ... there should be tickets related to this from other angles - shared cache on NAS/NFS for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some other related links:
https://discuss.dvc.org/t/setup-dvc-to-work-with-shared-data-on-nas-server/180/4
#732
#455
#563
and others ... someone should a first step systematize and suggest the way to structure this properly
> Avoid using the same remote location that you are using for `dvc push`, | ||
> `dvc pull`, `dvc fetch` as external cache for your external outputs, because | ||
> it may cause possible file hash overlaps: The hash value of a data file in | ||
> external storage could collide with that generated locally for another file. | ||
> Avoid using the same location of the | ||
> [remote storage](/doc/command-reference/remote) that you have for `dvc push` | ||
> and `dvc pull` for external outputs or as external cache, because it may cause | ||
> file hash overlaps: The hash value of a data file in external storage could | ||
> collide with the one generated locally for another file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved this note but it's still pretty strange. Difficult to understand what it is about. The ambiguity of the term "remote" is giving us problems here. Not sure how to address, maybe just remove the note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, it's a very complicated to simplify it :) the whole topic is very advanced and very technical. Let's keep it for now with your revisions and get back to it when we have more time to spend on all this data management problems.
@jorgeorpinel just minor comments, overall looks good to merge. |
per #1110 (review) and others
remote add -d