Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get-started: full refactoring #1074

Closed
wants to merge 56 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
9ddaf88
tutorials: consolidating agenda, initialize, and configure into get-s…
jorgeorpinel Mar 19, 2020
fff66f3
Allow to set empty slug in sidebar.json
psdcoder Mar 19, 2020
ef4ece6
home: put it in /doc directly
jorgeorpinel Mar 19, 2020
4f0a1ad
Merge branch 'refactor/get-started' into refactor/get-started-2
jorgeorpinel Mar 19, 2020
263838a
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Mar 21, 2020
61dd537
[WIP] get-started: combine chapters into index + 3 pages
jorgeorpinel Mar 21, 2020
e0a424d
tutorials: finish get-started index page (part 0)
jorgeorpinel Mar 23, 2020
8c11f74
tutorials: a few more impovements to get-started index
jorgeorpinel Mar 23, 2020
3f13e2b
tutorials: finish get-started/versioning-basics page (page 1/3)
jorgeorpinel Mar 24, 2020
dc7ffae
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 8, 2020
ed44410
redirect: fix bad redirect from recent merge
jorgeorpinel Apr 8, 2020
482f3dc
tutorials: get-started index updates
jorgeorpinel Apr 8, 2020
1a41422
home: shorten What is DVC? section, move some to use cases index
jorgeorpinel Apr 8, 2020
fa29455
term: review use of "section" in Get Started tut
jorgeorpinel Apr 8, 2020
b2600a0
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 8, 2020
d890531
tutorials: review H2 sections and links around the get started
jorgeorpinel Apr 9, 2020
e634ed0
tutorial: reformat GS versioning page
jorgeorpinel Apr 9, 2020
4adb159
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 12, 2020
441aafc
tutorials: new get started up to stages section
jorgeorpinel Apr 14, 2020
25e6154
tutorials: review GS up to data-pipelines page
jorgeorpinel Apr 14, 2020
686eb3b
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 14, 2020
46ff5c6
sidebar: roll back empty element change
jorgeorpinel Apr 15, 2020
f491fc3
tutorials: rewrite GS upto experiment-management#experiments
jorgeorpinel Apr 15, 2020
67ca828
tutorials: add link to API ref from versioning-basics#import-data
jorgeorpinel Apr 15, 2020
ded65e5
tutorials: updates into experiment-management#experiments
jorgeorpinel Apr 15, 2020
5eaf397
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 17, 2020
1a4ebcb
windows: review some notes about Windows, esp regarding wget
jorgeorpinel Apr 17, 2020
45d497b
home: make main topics into basic HTML tiles
jorgeorpinel Apr 17, 2020
a7f7d9e
get started: push and fetch/pull in a single section
jorgeorpinel Apr 18, 2020
b03c1a9
get started: another round of improvements for intro and
jorgeorpinel Apr 19, 2020
10cbcd9
get started: another round of ipros up to data-versioning
jorgeorpinel Apr 19, 2020
b1dd35b
get started: list layers in index, rename Experiments page, et al.
jorgeorpinel Apr 19, 2020
e24883c
get started: remove 2 extra visualize outputs
jorgeorpinel Apr 20, 2020
73e703c
get started: improvements to data-pipelines part, et al.
jorgeorpinel Apr 20, 2020
ca6ab67
get started: copy edits per Ivan't feedback
jorgeorpinel Apr 21, 2020
64b77e3
get started: fix command story up to data-pipelines#stages
jorgeorpinel Apr 22, 2020
8b8dd9e
Merge branch 'master' into refactor/get-started-2
jorgeorpinel Apr 22, 2020
c8b0081
get started: expand on params, move section to data-pipelines page
jorgeorpinel Apr 22, 2020
23a0a13
get started: complete params section back in experiments page
jorgeorpinel Apr 22, 2020
3b48857
get started: misc. small fixes
jorgeorpinel Apr 23, 2020
9297a41
get started: simplify index, et al.
jorgeorpinel Apr 23, 2020
f250180
get started: simplify through data-versioning#configure-remote-storage
jorgeorpinel Apr 23, 2020
8d23b5d
get started: simplify complete data-versioning page
jorgeorpinel Apr 24, 2020
f2c76bc
get started: simplify complete data-versioning page (2)
jorgeorpinel Apr 24, 2020
ac4bd48
get started: update metrics examples to use JSON
jorgeorpinel Apr 24, 2020
e0e573c
get started: more simplification and improvements up to data-pipeline…
jorgeorpinel Apr 25, 2020
11c068c
get started: chnges to GS intro per private feedback
jorgeorpinel Apr 26, 2020
5bba6ee
get started: jump right into action in data-versioning, et al.
jorgeorpinel Apr 26, 2020
a37e8e5
get started: yet another round of feedback on data-versioning
jorgeorpinel Apr 29, 2020
15d5139
Merge branch 'master' into refactor/get-started-2
jorgeorpinel May 14, 2020
e84d319
Merge branch 'master' into refactor/get-started-2
jorgeorpinel May 19, 2020
39ac903
Merge branch 'master' into refactor/get-started-2
jorgeorpinel May 20, 2020
24e5edf
Merge branch 'master' into refactor/get-started-2
jorgeorpinel May 21, 2020
a0c7c75
Merge branch 'master' into refactor/get-started-2
jorgeorpinel May 25, 2020
b3c4057
Merge branch 'master' into refactor/get-started-2 and
jorgeorpinel May 27, 2020
0e01aec
get-started: simplifications to fata-pipelines
jorgeorpinel May 27, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions content/docs/api-reference/open.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,10 +108,10 @@ with dvc.api.open(

Notice that we use a [SAX](http://www.saxproject.org/) XML parser here because
`dvc.api.open()` is able to stream the data from
[remote storage](/doc/command-reference/remote/add#supported-storage-types).
(The `mySAXHandler` object should handle the event-driven parsing of the
document in this case.) This increases the performance of the code (minimizing
memory usage), and is typically faster than loading the whole data into memory.
[remote storage](/doc/command-reference/remote/add). (The `mySAXHandler` object
should handle the event-driven parsing of the document in this case.) This
increases the performance of the code (minimizing memory usage), and is
typically faster than loading the whole data into memory.

> If you just needed to load the complete file contents into memory, you can use
> `dvc.api.read()` instead:
Expand Down
6 changes: 3 additions & 3 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ positional arguments:

## Description

At DVC initialization, a new `.dvc/` directory will be created for internal
configuration and cache
[files and directories](/doc/user-guide/dvc-files-and-directories) that are
At DVC initialization, a new `.dvc/` directory is created for internal
configuration and <abbr>cache</abbr>
[files and directories](/doc/user-guide/dvc-files-and-directories), that are
hidden from the user.

The cache is where your data files, models, etc. (anything you want to version
Expand Down
3 changes: 2 additions & 1 deletion content/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,8 @@ $ dvc diff

Let's checkout the
[3-add-file](https://github.com/iterative/example-get-started/releases/tag/3-add-file)
tag, corresponding to the [Add Files](/doc/tutorials/get-started/add-files) _Get
tag, corresponding to the
[tracking data](/doc/tutorials/get-started/data-versioning#tracking-data) _Get
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
Started_ chapter, right after we added `data.xml` file with DVC:

```dvc
Expand Down
20 changes: 10 additions & 10 deletions content/docs/command-reference/fetch.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,27 +47,27 @@ in the project's <abbr>cache</abbr>. (Refer to `dvc remote` for more information
on DVC remotes.) These necessary data or model files are listed as
<abbr>dependencies</abbr> or <abbr>outputs</abbr> in a DVC-file (target
[stage](/doc/command-reference/run)) so they are required to
[reproduce](/doc/tutorials/get-started/reproduce) the corresponding
[pipeline](/doc/command-reference/pipeline). (See
[reproduce](/doc/tutorials/get-started/data-pipelines#reproduce) the
corresponding [pipeline](/doc/command-reference/pipeline). (See
[DVC-File Format](/doc/user-guide/dvc-file-format) for more information on
dependencies and outputs.)

`dvc fetch` ensures that the files needed for a DVC-file to be
[reproduced](/doc/tutorials/get-started/reproduce) exist in cache. If no
`targets` are specified, the set of data files to fetch is determined by
analyzing all DVC-files in the current branch, unless `--all-branches` or
`--all-tags` is specified.
[reproduced](/doc/tutorials/get-started/data-pipelines#reproduce) exist in
cache. If no `targets` are specified, the set of data files to fetch is
determined by analyzing all DVC-files in the current branch, unless
`--all-branches` or `--all-tags` is specified.

The default remote is used (see `dvc config core.remote`) unless the `--remote`
option is used.

`dvc fetch`, `dvc pull`, and `dvc push` are related in that these 3 commands
perform data synchronization among local and remote storage. The specific way in
which the set of files to push/fetch/pull is determined begins with calculating
file hashes when these are [added](/doc/tutorials/get-started/add-files) with
DVC. File hashes are stored in the corresponding DVC-files (typically versioned
with Git). Only the hashes specified in DVC-files currently in the workspace are
considered by `dvc fetch` (unless the `-a` or `-T` options are used).
file hashes when these are [added](/doc/command-reference/add) with DVC. File
hashes are stored in the corresponding DVC-files (typically versioned with Git).
Only the hashes specified in DVC-files currently in the workspace are considered
by `dvc fetch` (unless the `-a` or `-T` options are used).

## Options

Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/get-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ Download a file or directory from a supported URL (for example `s3://`,
`ssh://`, and other protocols) into the local file system.

> See `dvc get` to download data/model files or directories from other <abbr>DVC
> repositories</abbr> (e.g. hosted on GitHub).
> repositories</abbr> (e.g. hosted on Github).

## Synopsis

Expand Down
17 changes: 9 additions & 8 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -148,8 +148,8 @@ https://remote.dvc.org/get-started/66/2eb7f64216d9c2c1088d0a5e2c6951
location to place the target data within the workspace. Combining these two
options allows us to do something we can't achieve with the regular
`git checkout` + `dvc checkout` process – see for example the
[Get Older Data Version](/doc/tutorials/get-started/older-versions) chapter of
our _Get Started_.
[Get Older Data Version](/doc/tutorials/get-started/data-versioning#navigate-versions)
chapter of our _Get Started_.

Let's use the
[get started example repo](https://github.com/iterative/example-get-started)
Expand All @@ -161,12 +161,13 @@ $ git clone https://github.com/iterative/example-get-started
$ cd example-get-started
```

If you are familiar with our [Get Started](/doc/tutorials/get-started) project
(used in these examples), you may remember that the chapter where we train a
first version of the model corresponds to the the `baseline-experiment` tag in
the repo. Similarly `bigrams-experiment` points to an improved model (trained
using bigrams). What if we wanted to have both versions of the model "checked
out" at the same time? `dvc get` provides an easy way to do this:
If you are familiar with the project in our
[Get Started](/doc/tutorials/get-started) (used in these examples), you may
remember that the chapter where we train a first version of the model
corresponds to the the `baseline-experiment` tag in the repo. Similarly
`bigrams-experiment` points to an improved model (trained using bigrams). What
if we wanted to have both versions of the model "checked out" at the same time?
`dvc get` provides an easy way to do this:

```dvc
$ dvc get . model.pkl --rev baseline-experiment
Expand Down
21 changes: 11 additions & 10 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Download a file or directory from a supported URL (for example `s3://`,
changes in the remote data source. Creates a DVC-file.

> See `dvc import` to download and tack data/model files or directories from
> other <abbr>DVC repositories</abbr> (e.g. hosted on GitHub).
> other <abbr>DVC repositories</abbr> (e.g. hosted on Github).

## Synopsis

Expand Down Expand Up @@ -136,8 +136,8 @@ in the [Get Started](/doc/tutorials/get-started).
Start by cloning our example repo if you don't already have it. Then move into
the repo and checkout the
[2-remote](https://github.com/iterative/example-get-started/releases/tag/2-remote)
tag, corresponding to the [Configure](/doc/tutorials/get-started/configure) _Get
Started_ chapter:
tag, corresponding to the [Configure](/doc/tutorials/get-started#configure)
section of the _Get Started_:

```dvc
$ git clone https://github.com/iterative/example-get-started
Expand All @@ -146,15 +146,16 @@ $ git checkout 2-remote
$ mkdir data
```

You should now have a blank <abbr>workspace</abbr>, just before the
[Add Files](/doc/tutorials/get-started/add-files) chapter.
You should now have a blank <abbr>workspace</abbr>, just before
[Versioning Basics](/doc/tutorials/get-started/data-versioning).

</details>

## Example: Tracking a remote file

An advanced alternate to [Add Files](/doc/tutorials/get-started/add-files)
chapter of the _Get Started_ is to use `dvc import-url`:
An advanced alternate to the intro of the
[Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get
Started_ is to use `dvc import-url`:

```dvc
$ dvc import-url https://data.dvc.org/get-started/data.xml \
Expand Down Expand Up @@ -246,9 +247,9 @@ directory we created previously. (Its `path` has the URL for the data store.)
And instead of an `etag` we have an `md5` hash value. We did this so its easy to
edit the data file.

Let's now manually reproduce a
[processing chapter](/doc/tutorials/get-started/connect-code-and-data) from the
_Get Started_ project. Download the example source code archive and unzip it:
Let's now manually reproduce the
[data processing part](/doc/tutorials/get-started/data-pipelines) of the _Get
Started_ project. Download the example source code archive and unzip it:

```dvc
$ wget https://code.dvc.org/get-started/code.zip
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -181,7 +181,7 @@ $ dvc update --rev cats-dogs-v2
## Example: Data registry

If you take a look at our
[dataset-registry](https://github.com/iterative/dataset-registry)
[dataset registry](https://github.com/iterative/dataset-registry)
<abbr>project</abbr>, you'll see that it's organized into different directories
such as `tutorial/ver` and `use-cases/`, and these contain
[DVC-files](/doc/user-guide/dvc-file-format) that track different datasets.
Expand Down
6 changes: 3 additions & 3 deletions content/docs/command-reference/init.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,9 @@ usage: dvc init [-h] [-q | -v] [--no-scm] [-f] [--subdir]

## Description

DVC works on top of a Git repository by default. This enables all features,
providing the most value. It means that `dvc init` (without flags) expects to
run in a Git repository root (a `.git/` directory should be present).
DVC works best in a Git repository. This enables all features, providing the
most value. For this reason, `dvc init` (without flags) expects to run in a Git
repository root (a `.git/` directory should be present).

The command [options](#options) can be used to start an alternative workflow for
advanced scenarios:
Expand Down
3 changes: 2 additions & 1 deletion content/docs/command-reference/metrics/show.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,7 @@ Examples in [add](/doc/command-reference/metrics/add),
[remove](/doc/command-reference/metrics/remove) cover most of the basic cases
for the `dvc metrics show`.

The [Compare Experiments](/doc/tutorials/get-started/compare-experiments)
The
[Compare Experiments](/doc/tutorials/get-started/experiments#compare-experiments)
chapter of our _Get Started_ covers the `-a` option to collect and print a
metric file value across all Git branches.
16 changes: 9 additions & 7 deletions content/docs/command-reference/pipeline/index.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# pipeline

A set of commands to manage [pipelines](/doc/tutorials/get-started/pipeline):
A set of commands to manage
[pipelines](/doc/tutorials/get-started/data-pipelines):
[show](/doc/command-reference/pipeline/show) and
[list](/doc/command-reference/pipeline/list).

Expand All @@ -17,12 +18,13 @@ positional arguments:

## Description

A data pipeline, in general, is a series of data processes (for example console
commands that take an input and produce an <abbr>output</abbr>). A pipeline may
produce intermediate data, and has a final result. Machine Learning (ML)
pipelines typically start a with large raw datasets, include intermediate
featurization and training stages, and produce a final model, as well as
accuracy [metrics](/doc/command-reference/metrics).
A data pipeline, in general, is a series of data processing
[stages](/doc/command-reference/run) (for example console commands that take an
input and produce an <abbr>output</abbr>). A pipeline may produce intermediate
data, and has a final result. Machine learning (ML) pipelines typically start a
with large raw datasets, include intermediate featurization and training stages,
and produce a final model, as well as accuracy
[metrics](/doc/command-reference/metrics).

In DVC, pipeline stages and commands, their data I/O, interdependencies, and
results (intermediate or final) are specified with `dvc add` and `dvc run`,
Expand Down
7 changes: 3 additions & 4 deletions content/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -355,10 +355,9 @@ $ dvc remote add myremote https://example.com/path/to/dir
A "local remote" is a directory in the machine's file system.

> While the term may seem contradictory, it doesn't have to be. The "local" part
> refers to the machine where the <abbr>project</abbr> is stored, so it can be
> any directory accessible to the same system. The "remote" part refers
> specifically to the project/repository itself. Read "local, but external"
> storage.
> refers to the type of location where the storage is: another directory in the
> same file system. "Remote" is how we call storage for <abbr>DVC
> projects</abbr>. It's essentially a local backup for data tracked by DVC.

Using an absolute path (recommended):

Expand Down
8 changes: 4 additions & 4 deletions content/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ positional arguments:

What is data remote?

The same way as GitHub provides storage hosting for Git repositories, DVC
The same way as Github provides storage hosting for Git repositories, DVC
remotes provide a central place to keep and share data and model files. With
this remote storage, you can pull models and data files created by colleagues
without spending time and resources to build or process them locally. It also
Expand Down Expand Up @@ -76,9 +76,9 @@ For the typical process to share the <abbr>project</abbr> via remote, see
### What is a "local remote" ?

While the term may seem contradictory, it doesn't have to be. The "local" part
refers to the machine where the <abbr>project</abbr> is stored, so it can be any
directory accessible to the same system. The "remote" part refers specifically
to the project/repository itself. Read "local, but external" storage.
refers to the type of location where the storage is: another directory in the
same file system. "Remote" is how we call storage for <abbr>DVC projects</abbr>.
It's essentially a local backup for data tracked by DVC.

</details>

Expand Down
8 changes: 4 additions & 4 deletions content/docs/command-reference/remote/list.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,16 +39,16 @@ including names and URLs.

## Examples

Let's for simplicity add a _default_ local remote:
For simplicity, let's add a default local remote:

<details>

### What is a "local remote" ?

While the term may seem contradictory, it doesn't have to be. The "local" part
refers to the machine where the <abbr>project</abbr> is stored, so it can be any
directory accessible to the same system. The "remote" part refers specifically
to the project/repository itself. Read "local, but external" storage.
refers to the type of location where the storage is: another directory in the
same file system. "Remote" is how we call storage for <abbr>DVC projects</abbr>.
It's essentially a local backup for data tracked by DVC.

</details>

Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,10 +184,10 @@ these settings, you could use the following options:
$ dvc remote modify myremote grant_full_control id=aws-canonical-user-id,id=another-aws-canonical-user-id
```

> \* - `grant_read`, `grant_read_acp`, `grant_write_acp` and
> \* `grant_read`, `grant_read_acp`, `grant_write_acp` and
> `grant_full_control` params are mutually exclusive with `acl`.
>
> \*\* - default ACL grantees are overwritten. Grantees are AWS accounts
> \*\* default ACL grantees are overwritten. Grantees are AWS accounts
> identifiable by `id` (AWS Canonical User ID), `emailAddress` or `uri`
> (predefined group).

Expand Down
38 changes: 15 additions & 23 deletions content/docs/index.md
Original file line number Diff line number Diff line change
@@ -1,34 +1,26 @@
# DVC Documentation

Welcome! In here you may find all the guiding material and technical documents
needed to learn about DVC: how to use it, how it works, and where to go for
additional resources.
Data Version Control, or DVC, is a data and ML experiments management tool that
takes advantage of the existing engineering toolset that you're already familiar
with (Git, CI/CD, etc.)

<cards>

<card href="/doc/tutorials/get-started" heading="Get Started">
<card href="/doc/tutorials/get-started" heading="Get Started">
A step-by-step introduction into basic DVC features
</card>

A step-by-step introduction into basic DVC features
<card href="/doc/user-guide" heading="User Guide">
Study the detailed inner-workings of DVC in its user guide.
</card>

</card>
<card href="/doc/use-cases" heading="Use Cases">
Non-exhaustive list of scenarios DVC can help with
</card>

<card href="/doc/user-guide" heading="User Guide">

Study the detailed inner-workings of DVC in its user guide.

</card>

<card href="/doc/use-cases" heading="Use Cases">

Non-exhaustive list of scenarios DVC can help with

</card>

<card href="/doc/command-reference" heading="Command Reference">

See all of DVC's commands.

</card>
<card href="/doc/command-reference" heading="Command Reference">
See all of DVC's commands.
</card>

</cards>

Expand Down
2 changes: 1 addition & 1 deletion content/docs/install/macos.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ $ brew install dvc
## Install from package

Get the PKG (binary) from the big "Download" button on the [home page](/), or
from the [release page](https://github.com/iterative/dvc/releases/) on GitHub.
from the [release page](https://github.com/iterative/dvc/releases/) on Github.

> Note that currently, in order to open the PKG file, you must go to the
> Downloads directory in Finder and use
Expand Down
2 changes: 1 addition & 1 deletion content/docs/install/pre-release.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# Install Pre-release Version

If you want to test the latest stable version of DVC, ahead of official
releases, you can install it from our code repository GitHub.
releases, you can install it from our code repository Github.

> We **strongly** recommend creating a
> [virtual environment](https://packaging.python.org/tutorials/installing-packages/#creating-virtual-environments)
Expand Down
Loading