Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update dvc install and dvc import documentation #260

Merged
merged 6 commits into from
Apr 25, 2019
Merged

Conversation

robogeek
Copy link
Contributor

This commit is only the dvc install documentation. The dvc import documentation is still to come.

`gs` | Google Storage | `gs://mybucket/data.csv`
`ssh` | SSH server | `ssh://[email protected]:/path/to/data.csv`
`hdfs` | HDFS | `hdfs://[email protected]/path/to/data.csv`
`http` | HTTP to file with _strong ETag_ | `https://example.com/path/to/data.csv`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is one more actually - remote://. See this ticket - #108. It would be great to add it here and propagate the explanation to the external dependencies section, and dvc run if necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The remote URL needs to be documented in the dvc remote command documentation. It should then be enough to reference that documentation from here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would still explain it briefly here - just give an example of the transformation.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Really good stuff.

I put lots of small comments. Probably the biggest one is re the second example in DVC import. Let's simplify it and use local 'remote` instead. We can make it this way reproducible.

It would be great to add a phrase like - one of the use cases for the DVC import is to track inputs to the ETL pipeline. Imagine you use cron to run repro that checks some external file and if it changed rebuilds some model.

@@ -51,6 +51,10 @@ The output of `dvc checkout` does not list which data files were restored. It
does report removed files and files that DVC was unable to restore due to it
missing from the cache.

This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
checked out without error will be restored.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any files that are found in cache instead of an error. It's not an error usually - it's a warning like you mentioned above.

* `ssh` - URL to a file on another machine with SSH access
* `hdfs` - URL to a file on HDFS
* `http` - URL to a file with a _strong ETag_ served with HTTP or HTTPS
Import file from any supported URL or local directory to local workspace and track changes in remote file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not 80 symbols here

@shcheklein shcheklein merged commit f18a49d into master Apr 25, 2019
@shcheklein shcheklein deleted the import-install branch April 30, 2019 20:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants