-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
External Data and Remotes (user-guide/external-data) #807
Conversation
@dashohoxha good stuff! Will keep reviewing and asking questions. First is - what is the difference between "Create Remotes" and "DVC storage" sections? |
@dashohoxha I also not sure how we went from solving #520 to actually including regular remotes stuff? #520 is about very specific advanced thing - managing external data, while remotes is a very basic functionality. It deserves it's own very details section that explains them very well. |
"Creates Remotes" explains all the details and options about a certain remote type (which are set with "DVC storage" emphasizes that we need to use the option I admit that in some cases there is not much difference between them and they look very similar.
External data and remotes seem to me closely related because:
So, I thought that it is best to explain them together. |
let's remove and simplify
introducing basic and advanced concepts together sends a wrong signal. We should not emphasize external deps/outs as a regular basic workflow. Also, there should be a separate motivation behind them. |
I moved the sections about DVC Storages as collapsed sections under "Remotes".
Are you suggesting to split Remotes from External Data, like this:
|
I just don't get the purpose of those sections. Can you elaborate or explain it in some other way? Why do we need both - remotes and dvc storage?
Yes. I'm not sure about exact place and structure. But external data management should be separate for sure. Remotes - they can be part of the Basic Concepts, they can be part of the Managing Data section. External Data Management should be part of the Advanced section of the User Guide. ^^ this is not a precise instruction, it's just a way for me to communicate how far I think those two should be split. |
I removed the sections about "DVC storage". I also split the "Remotes" from "External Data":
The first one ("Remotes") is basically a consolidation of remote types on:
I seems like the command The second section ("External Data") is an attempt to consolidate and improve these pages: |
I am not sure that I understand this note from managing-external-data:
The first part seems reasonable: "Avoid using the same remote location". The explanation seems nonsense. If there is some possibility for checksum collision, this may happen for the cached files as well (two different cached files happen to have the same checksum). If those external data files that we are tracking with DVC were instead local files, we would store them in cache with all the other project files. So, what difference does it make if they are tracked remotely? Maybe some other explanation is needed or maybe we should remove this note. |
] | ||
}, | ||
{ | ||
"label": "External Data", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we keep the Managing External Data label and name? Not much difference but redirect won't be required.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we keep the Managing External Data label and name? Not much difference but redirect won't be required.
Redirect would still be required because of "external-dependencies".
On the other hand "External Data" is more generic than "Managing External Data". For example in the case of HTTP (and other cases) we are just accessing or downloading external data, "managing" doesn't seem suitable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, please, don't resolve. I have to repeat myself and create duplicates.
Don't see any reasons to rename and complicate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't see any reasons to rename and complicate this.
I don't understand, what is the complication that comes from renaming? In the first comment you say "not much difference", and I agree with this. And redirect is still required even if we don't change the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one extra redirect to support + updating links in the docs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have already updated the links on the docs, and also have fixed the redirect.
Changing the name back would be extra work, and unjustified, unless the old name is better than the new one.
src/Documentation/sidebar.json
Outdated
@@ -120,6 +120,80 @@ | |||
"label": "Managing External Data", | |||
"slug": "managing-external-data" | |||
}, | |||
{ | |||
"label": "Remotes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come up with a better name - Setting up storage?
Move the section up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Come up with a better name - Setting up storage?
I am renaming to "DVC Remotes". I am using "DVC Storage" for the mirror of .dvc/cache/
. A "DVC Remote" is analogous to a Git Remote, and is broader than a "DVC Storage" (which is just a mirror/backup of the local cache). This terminology is just a convention (for example instead of "DVC Storage" we could have used "DVC Cache Backup".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't resolve comments that are not created by you. As I mentioned already, it's extremely hard to follow and track changes this way.
I would prefer to have something human readable . Otherwise we are defining remote through remotes. Remote is remote ...
kind-a of thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to have something human readable . Otherwise we are defining remote through remotes. Remote is remote ... kind-a of thing.
I don't understand this. We are talking about the remotes that are managed with the command dvc remote
. What could be a better name than "Remote" or "DVC Remote"?
"source": "external-data/index.md", | ||
"children": [ | ||
{ | ||
"label": "Local Filesystem", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local should go very last in all cases. SSH, S3, Google Cloud, Azure, HTTP, HDFS - something like this is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Local supports more cases than the other types, and it also can be used for the mounted data storages (NAS etc.) So, it seems like an important one. Why should it go last?
$ dvc import-url s3://mybucket/file.csv | ||
``` | ||
|
||
Or, using a remote: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote -> remote notation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or put remote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's kinda not only remote as usual it's also a protocol, so some explanation is needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote -> remote notation?
First we create a remote (with dvc remote add
), so "remote" is ok.
I think that there is no need to introduce the term "remote notation", because it is kind of obvious.
|
||
## External Data and Outputs | ||
|
||
For cached external outputs (specified using `-o`) we need to setup an external |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvc add
creates an output as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended.
@@ -0,0 +1,89 @@ | |||
# External Data on HDFS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove all external data section for HDFS. We don't support it actually and it should be remove from DVC for now. cc @efiop can confirm that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed, we were not able to find a suitable workaround for external outputs. 🙁 So let's hide hdfs external deps and outs from docs for now. We'll probably remove them from dvc too very soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So let's hide hdfs external deps and outs from docs for now.
I am removing the section "External Data and Outputs". But it seems like "External Dependencies" should work. Please correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dashohoxha even external dependencies do not work reliable unfortunately as far as I remember
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
@@ -0,0 +1,66 @@ | |||
# External Data on HTTP/HTTPS | |||
|
|||
**❗ Note:** HTTP/HTTPS remotes support only download operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means probably that we support only dependencies, right? use the same terminology ... not download operations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it means probably that we support only dependencies, right? use the same terminology ... not download operations
I think it is better to remove this note, since it says the obvious (there is only the section for dependencies, no section for outputs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've seen people asking us over and over again - thus this note. It was not cleat either docs are out of date (always) or what :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it was useful when HTTP was in the same page with the other remotes, but in this page it does not make sense.
Probably people are confused by Git remotes, which can be used both to pull and push.
@@ -0,0 +1,22 @@ | |||
# External Data | |||
|
|||
External data are located outside the project directory, either on the local |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we had better explanation before why and when it should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean this explanation:
There are cases when data is so large, or its processing is organized in a way that you would like to avoid moving it out of its external/remote location. For example from a network attached storage (NAS) drive, processing data on HDFS, running Dask via SSH, or having a script that streams data from S3 to process it. A mechanism for external outputs and external dependencies provides a way for DVC to control data externally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
# External Data | ||
|
||
External data are located outside the project directory, either on the local | ||
filesystem or on a remote host. They may be used in order to: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote host sounds like a machine while it can be actually a cloud
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remote host sounds like a machine while it can be actually a cloud
"remote host" (like a machine) is one the cases. I am also adding "cloud data storage", like this:
External data are located outside the project directory, either on the local
filesystem, or on a remote host, or on a cloud data storage.
8cf733d
to
2f2f9cf
Compare
Again, too complicated, too many changes at once, no luck of iterating on this not willingness to address even small changes. |
Close #520, Close #566, Close #237, Related #108
Consolidate (and deprecate):
Maybe the examples for each remote type (those that are collapsed) should be removed as well from these man pages: