Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

External Data and Remotes (user-guide/external-data) #807

Closed
wants to merge 28 commits into from

Conversation

dashohoxha
Copy link
Contributor

@dashohoxha dashohoxha commented Nov 20, 2019

Close #520, Close #566, Close #237, Related #108

Consolidate (and deprecate):

Maybe the examples for each remote type (those that are collapsed) should be removed as well from these man pages:

@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 20, 2019 15:19 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 22, 2019 05:26 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 22, 2019 05:49 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 22, 2019 14:56 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 22, 2019 21:36 Inactive
@dashohoxha dashohoxha changed the title [WIP] External Data and Remotes (user-guide/external-data) External Data and Remotes (user-guide/external-data) Nov 22, 2019
@shcheklein
Copy link
Member

@dashohoxha good stuff! Will keep reviewing and asking questions.

First is - what is the difference between "Create Remotes" and "DVC storage" sections?

@shcheklein
Copy link
Member

@dashohoxha I also not sure how we went from solving #520 to actually including regular remotes stuff? #520 is about very specific advanced thing - managing external data, while remotes is a very basic functionality. It deserves it's own very details section that explains them very well.

@dashohoxha
Copy link
Contributor Author

what is the difference between "Create Remotes" and "DVC storage" sections

"Creates Remotes" explains all the details and options about a certain remote type (which are set with dvc remote modify).

"DVC storage" emphasizes that we need to use the option -d, --default so that the commands dvc push and dvc pull can use that remote as a "mirror" of .dvc/cache (aka a DVC Storage).

I admit that in some cases there is not much difference between them and they look very similar.

I also not sure how we went from solving #520 to actually including regular remotes stuff? #520 is about very specific advanced thing - managing external data, while remotes is a very basic functionality. It deserves it's own very details section that explains them very well.

External data and remotes seem to me closely related because:

  1. We can use remotes to access and manage external data.
  2. External data management is specific for each remote type. For example some remote types can be used for external outputs, some others cannot.

So, I thought that it is best to explain them together.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 24, 2019 07:43 Inactive
@shcheklein
Copy link
Member

I admit that in some cases there is not much difference between them and they look very similar.

let's remove and simplify

External data and remotes seem to me closely related because:

introducing basic and advanced concepts together sends a wrong signal. We should not emphasize external deps/outs as a regular basic workflow. Also, there should be a separate motivation behind them.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 27, 2019 15:11 Inactive
@dashohoxha
Copy link
Contributor Author

I admit that in some cases there is not much difference between them and they look very similar.

let's remove and simplify

I moved the sections about DVC Storages as collapsed sections under "Remotes".

introducing basic and advanced concepts together sends a wrong signal. We should not emphasize external deps/outs as a regular basic workflow. Also, there should be a separate motivation behind them.

Are you suggesting to split Remotes from External Data, like this:

  • user-guide/remotes/
    • local
    • http
    • ssh
    • amazon
    • s3
    • azure
    • gs
    • hdfs
    • oss
  • user-guide/external-data/
    • local
    • http
    • ssh
    • amazon
    • gs
    • hdfs

@shcheklein
Copy link
Member

I moved the sections about DVC Storages as collapsed sections under "Remotes".

I just don't get the purpose of those sections. Can you elaborate or explain it in some other way? Why do we need both - remotes and dvc storage?

Are you suggesting to split Remotes from External Data, like this:

Yes. I'm not sure about exact place and structure. But external data management should be separate for sure. Remotes - they can be part of the Basic Concepts, they can be part of the Managing Data section. External Data Management should be part of the Advanced section of the User Guide.

^^ this is not a precise instruction, it's just a way for me to communicate how far I think those two should be split.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 November 28, 2019 20:29 Inactive
@dashohoxha
Copy link
Contributor Author

let's remove and simplify

I removed the sections about "DVC storage".

I also split the "Remotes" from "External Data":

The first one ("Remotes") is basically a consolidation of remote types on:

I seems like the command dvc remote only takes care of the configuration file .dvc/config (or .dvc/config.local etc.) One could modify these configuration files manually (using an editor, without using dvc remote) and everything would be fine. So, it seems more appropriate to explain different remote types on the User Guide.

The second section ("External Data") is an attempt to consolidate and improve these pages:

@dashohoxha
Copy link
Contributor Author

I am not sure that I understand this note from managing-external-data:

Avoid using the same remote location that you are using for dvc push, dvc pull, dvc fetch as external cache for your external outputs, because it may cause possible checksum overlaps. Checksum for some data file on an external storage can potentially collide with checksum generated locally for a different file, with a different content.

The first part seems reasonable: "Avoid using the same remote location".

The explanation seems nonsense. If there is some possibility for checksum collision, this may happen for the cached files as well (two different cached files happen to have the same checksum). If those external data files that we are tracking with DVC were instead local files, we would store them in cache with all the other project files. So, what difference does it make if they are tracked remotely?

Maybe some other explanation is needed or maybe we should remove this note.

]
},
{
"label": "External Data",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we keep the Managing External Data label and name? Not much difference but redirect won't be required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can't we keep the Managing External Data label and name? Not much difference but redirect won't be required.

Redirect would still be required because of "external-dependencies".
On the other hand "External Data" is more generic than "Managing External Data". For example in the case of HTTP (and other cases) we are just accessing or downloading external data, "managing" doesn't seem suitable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, please, don't resolve. I have to repeat myself and create duplicates.

Don't see any reasons to rename and complicate this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't see any reasons to rename and complicate this.

I don't understand, what is the complication that comes from renaming? In the first comment you say "not much difference", and I agree with this. And redirect is still required even if we don't change the name.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one extra redirect to support + updating links in the docs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have already updated the links on the docs, and also have fixed the redirect.
Changing the name back would be extra work, and unjustified, unless the old name is better than the new one.

@@ -120,6 +120,80 @@
"label": "Managing External Data",
"slug": "managing-external-data"
},
{
"label": "Remotes",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come up with a better name - Setting up storage?

Move the section up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Come up with a better name - Setting up storage?

I am renaming to "DVC Remotes". I am using "DVC Storage" for the mirror of .dvc/cache/. A "DVC Remote" is analogous to a Git Remote, and is broader than a "DVC Storage" (which is just a mirror/backup of the local cache). This terminology is just a convention (for example instead of "DVC Storage" we could have used "DVC Cache Backup".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't resolve comments that are not created by you. As I mentioned already, it's extremely hard to follow and track changes this way.

I would prefer to have something human readable . Otherwise we are defining remote through remotes. Remote is remote ... kind-a of thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to have something human readable . Otherwise we are defining remote through remotes. Remote is remote ... kind-a of thing.

I don't understand this. We are talking about the remotes that are managed with the command dvc remote. What could be a better name than "Remote" or "DVC Remote"?

"source": "external-data/index.md",
"children": [
{
"label": "Local Filesystem",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local should go very last in all cases. SSH, S3, Google Cloud, Azure, HTTP, HDFS - something like this is better

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Local supports more cases than the other types, and it also can be used for the mounted data storages (NAS etc.) So, it seems like an important one. Why should it go last?

$ dvc import-url s3://mybucket/file.csv
```

Or, using a remote:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote -> remote notation?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or put remote

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's kinda not only remote as usual it's also a protocol, so some explanation is needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote -> remote notation?

First we create a remote (with dvc remote add), so "remote" is ok.
I think that there is no need to introduce the term "remote notation", because it is kind of obvious.


## External Data and Outputs

For cached external outputs (specified using `-o`) we need to setup an external
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dvc add creates an output as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

@@ -0,0 +1,89 @@
# External Data on HDFS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's remove all external data section for HDFS. We don't support it actually and it should be remove from DVC for now. cc @efiop can confirm that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, we were not able to find a suitable workaround for external outputs. 🙁 So let's hide hdfs external deps and outs from docs for now. We'll probably remove them from dvc too very soon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So let's hide hdfs external deps and outs from docs for now.

I am removing the section "External Data and Outputs". But it seems like "External Dependencies" should work. Please correct me if I am wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dashohoxha even external dependencies do not work reliable unfortunately as far as I remember

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -0,0 +1,66 @@
# External Data on HTTP/HTTPS

**❗ Note:** HTTP/HTTPS remotes support only download operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means probably that we support only dependencies, right? use the same terminology ... not download operations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it means probably that we support only dependencies, right? use the same terminology ... not download operations

I think it is better to remove this note, since it says the obvious (there is only the section for dependencies, no section for outputs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've seen people asking us over and over again - thus this note. It was not cleat either docs are out of date (always) or what :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it was useful when HTTP was in the same page with the other remotes, but in this page it does not make sense.
Probably people are confused by Git remotes, which can be used both to pull and push.

@@ -0,0 +1,22 @@
# External Data

External data are located outside the project directory, either on the local
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we had better explanation before why and when it should be used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean this explanation:

There are cases when data is so large, or its processing is organized in a way that you would like to avoid moving it out of its external/remote location. For example from a network attached storage (NAS) drive, processing data on HDFS, running Dask via SSH, or having a script that streams data from S3 to process it. A mechanism for external outputs and external dependencies provides a way for DVC to control data externally.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

# External Data

External data are located outside the project directory, either on the local
filesystem or on a remote host. They may be used in order to:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote host sounds like a machine while it can be actually a cloud

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remote host sounds like a machine while it can be actually a cloud

"remote host" (like a machine) is one the cases. I am also adding "cloud data storage", like this:

External data are located outside the project directory, either on the local
filesystem, or on a remote host, or on a cloud data storage.

@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 December 1, 2019 00:10 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 December 1, 2019 00:13 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 December 1, 2019 01:05 Inactive
@dashohoxha dashohoxha requested a review from shcheklein December 1, 2019 01:27
@shcheklein shcheklein temporarily deployed to dvc-org-pr-807 December 9, 2019 17:03 Inactive
@shcheklein
Copy link
Member

Again, too complicated, too many changes at once, no luck of iterating on this not willingness to address even small changes.

@shcheklein shcheklein closed this Dec 14, 2019
@efiop efiop deleted the external-data branch December 15, 2019 14:21
@efiop efiop restored the external-data branch December 15, 2019 14:21
@jorgeorpinel jorgeorpinel deleted the external-data branch January 23, 2020 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants