Skip to content

Commit

Permalink
Merge branch 'master' into jorgeorpinel
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Jan 8, 2020
2 parents 5963f6c + 7c180ee commit a89d268
Show file tree
Hide file tree
Showing 18 changed files with 204 additions and 183 deletions.
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -17,9 +17,6 @@ lib-cov
# Coverage directory used by tools like istanbul
coverage

# nyc test coverage
.nyc_output

# Grunt intermediate storage (http://gruntjs.com/creating-plugins#storing-task-files)
.grunt

Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"main": "index.js",
"scripts": {
"dev": "node server.js",
"dev:debug": "node --inspect server.js",
"debug": "node --inspect-brk server.js",
"build": "next build",
"test": "jest",
"start": "NODE_ENV=production node server.js",
Expand Down
4 changes: 2 additions & 2 deletions pages/doc.js
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ export default function Documentation({ item, headings, markdown, errorCode }) {
apiKey: '755929839e113a981f481601c4f52082',
indexName: 'dvc',
inputSelector: '#doc-search',
debug: false // Set debug to true if you want to inspect the dropdown
debug: false // Set to `true` if you want to inspect the dropdown
})
}
} catch (ReferenceError) {
Expand All @@ -81,7 +81,7 @@ export default function Documentation({ item, headings, markdown, errorCode }) {
return () => Router.events.off('routeChangeComplete', handleRouteChange)
}, [])

const githubLink = `https://github.com/iterative/dvc.org/blob/master${source}`
const githubLink = `https://github.com/iterative/dvc.org/blob/master/public${source}`

return (
<Page stickHeader={true}>
Expand Down
100 changes: 49 additions & 51 deletions public/static/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ DVC-files.
## Synopsis

```usage
usage: dvc checkout [-h] [-q | -v] [-d] [-f] [-R]
usage: dvc checkout [-h] [-q | -v] [-d] [-R] [-f] [--relink]
[targets [targets ...]]
positional arguments:
Expand All @@ -16,60 +16,49 @@ positional arguments:

## Description

[DVC-files](/doc/user-guide/dvc-file-format) in a <abbr>project</abbr> specify
which instance of each data file or directory should be used, with the checksums
saved in the `outs` field. The `dvc checkout` command updates the workspace data
to match with the <abbr>cached</abbr> files that correspond to those checksums.

Using an SCM like Git, the DVC-files are kept under version control. At a given
branch or tag of the SCM repository, the DVC-files will contain checksums for
the corresponding data files kept in the cache. After an SCM command like
`git checkout` is run, the DVC-files will change to the state at the specified
branch or commit or tag. Afterwards, the `dvc checkout` command is required in
order to synchronize the data files with the currently checked out DVC-files.

This command must be executed after `git checkout` since Git doesn't track files
that are under DVC control. For convenience a Git hook is available, simply by
running `dvc install`, that will automate running `dvc checkout` after
`git checkout`. See `dvc install` for more information.

The execution of `dvc checkout` does:

- Scan the `outs` entries in DVC-files to compare with the currently checked out
data files. The scanned DVC-files is limited by the listed `targets` (if any)
on the command line. And if the `--with-deps` option is specified, it scans
backward from the given `targets` in the corresponding
[pipeline](/doc/command-reference/pipeline).

- For any data files where the checksum doesn't match their DVC-file entry, the
data file is restored from the cache. The link strategy used (`reflink`,
`hardlink`, `symlink`, or `copy`) depends on the OS and the configured value
for `cache.type` – See `dvc config cache`.

Note that this command by default tries NOT to copy files between the cache and
the workspace, using reflinks instead when supported by the file system. (Refer
to
[DVC-files](/doc/user-guide/dvc-file-format) act as pointers to specific version
of data files or directories under DVC control. This command synchronizes the
workspace data with the versions specified in the current DVC-files.

`dvc checkout` is useful, for example, when using Git in the
<abbr>project</abbr>, after `git clone`, `git checkout`, or any other operation
that changes the DVC-files in the workspace.

💡 For convenience, a Git hook is available to automate running `dvc checkout`
after `git checkout`. Use `dvc install` to install it.

The execution of `dvc checkout` does the following:

- Scans the DVC-files to compare against the data files or directories in the
<abbr>workspace</abbr>. DVC knows which data (<abbr>outputs</abbr>) match
because their checksums are saved in the `outs` fields inside the DVC-files.
Scanning is limited to the given `targets` (if any). See also options
`--with-deps` and `--recursive` below.

- Missing data files or directories, or those that don't match with any
DVC-file, are restored from the <abbr>cache</abbr>. See options `--force` and
`--relink`.

By default, this command tries not to copy files between the cache and the
workspace, using reflinks instead, when supported by the file system. (Refer to
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache).)
The next linking strategy default value is `copy` though, so unless other file
link types are manually configured in `cache.type` (using `dvc config`), files
will be copied. Keep in mind that having file copies doesn't present much of a
negative impact unless the project uses very large data (several GBs or more).
But leveraging file links is crucial for large files where checking out a 50Gb
by copying file might take a few minutes for example, whereas with links,
But leveraging file links is crucial with large files, for example when checking
out a 50Gb file by copying might take a few minutes whereas, with links,
restoring any file size will be almost instantaneous.

> When linking files takes longer than expected (10 seconds for any one file)
> and `cache.type` is not set, a warning will be displayed reminding users about
> the faster link types available. These warnings can be turned off setting the
> `cache.slow_link_warning` config option to `false` with `dvc config cache`.
The output of `dvc checkout` does not list which data files were restored. It
does report removed files and files that DVC was unable to restore because
they're missing from the <abbr>cache</abbr>.

This command will fail to checkout files that are missing from the cache. In
such a case, `dvc checkout` prints a warning message. Any files that can be
checked out without error will be restored.
such a case, `dvc checkout` prints a warning message. It also lists removed
files. Any files that can be checked out without error will be restored without
being reported individually.

There are two methods to restore a file missing from the cache, depending on the
situation. In some cases a pipeline must be reproduced (using `dvc repro`) to
Expand All @@ -94,6 +83,12 @@ be pulled from remote storage using `dvc pull`.
remove files that don't match those DVC-file references or are missing from
cache. (They are not "committed", in DVC terms.)

- `--relink` - ensures the file linking strategy (`reflink`, `hardlink`,
`symlink`, or `copy`) for all data in the workspace is consistent with the
project's [`cache.type`](/doc/command-reference/config#cache). This is
achieved by restoring **all data files or a directories** referenced in
current DVC-files (regardless of whether they match a current DVC-file).

- `-h`, `--help` - shows the help message and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
Expand Down Expand Up @@ -210,18 +205,21 @@ do `dvc fetch` + `dvc checkout`.

## Automating `dvc checkout`

We have the data files (managed by DVC) lined up with the other files (managed
by Git). This required us to remember to run `dvc checkout`, and of course we
won't always remember to do so. Wouldn't it be nice to automate this?
We want the data files or directories (managed by DVC) to match with the other
files (managed by Git e.g. source code). This requires us to remember running
`dvc checkout` when needed, and of course we won't always remember to do so.
Wouldn't it be nice to automate this?

Let's run this command:
Let's try this:

```dvc
$ dvc install
```

This installs Git hooks to automate running `dvc checkout` (or `dvc status`)
when needed. Then we can checkout the master branch again:
`dvc install` installs Git hooks to automate common operations, including
running `dvc checkout` when needed.

We can then checkout the master branch again:

```dvc
$ git checkout bigrams
Expand All @@ -233,6 +231,6 @@ $ md5 model.pkl
MD5 (model.pkl) = 3863d0e317dee0a55c4e59d2ec0eef33
```

Previously this took two steps, `git checkout` followed by `dvc checkout`. We
can now skip the second one, which is automatically executed for us. The
workspace is automatically synchronized accordingly.
Previously this took two commands, `git checkout` followed by `dvc checkout`. We
can now skip the second one, which is automatically run for us. The workspace is
automatically synchronized accordingly.
4 changes: 4 additions & 0 deletions public/static/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,10 @@ for more details.) This section contains the following options:
[File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
for a full explanation of each one.

To apply changes to this option in the workspace, by restoring all file
links/copies from cache, please use `dvc checkout --relink`. See
[checkout options](/doc/command-reference/checkout#options) for more details.

- `cache.slow_link_warning` - used to turn off the warnings about having a slow
cache link type. These warnings are thrown by `dvc pull` and `dvc checkout`
when linking files takes longer than usual, to remind them that there are
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,13 +95,13 @@ url = /path/to/remote
remote = myremote
```

## Example: Add Amazon S3 remote and modify its region
## Example: Add a default Amazon S3 remote and modify its region

> 💡 Before adding an S3 remote, be sure to
> [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html).
```dvc
$ dvc remote add mynewremote s3://mybucket/myproject
$ dvc remote add -d mynewremote s3://mybucket/myproject
$ dvc remote modify mynewremote region us-east-2
```

Expand Down
32 changes: 32 additions & 0 deletions public/static/docs/command-reference/repro.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,38 @@ files, intermediate or final results. It saves all the data files, intermediate
or final results into the <abbr>DVC cache</abbr> (unless `--no-commit` option is
specified), and updates stage files with the new checksum information.

### Parallel stage execution

Currently, `dvc repro` is not able to parallelize stage execution automatically.
If you need to do this, you can launch `dvc repro` multiple times manually. For
example, let's say a <abbr>pipeline</abbr> graph looks something like this:

```
$ dvc pipeline show --ascii result.py
+--------+ +--------+
| A1.dvc | | B1.dvc |
+--------+ +--------+
* *
* *
* *
+--------+ +--------+
| A2.dvc | | B2.dvc |
+--------+ +--------+
* *
** **
* *
+------------+
| result.dvc |
+------------+
```

This pipeline consists of two parallel branches (`A` and `B`), and the final
"result" stage, where the branches merge. To reproduce both branches at the same
time, you could run `dvc repro A2.dvc` and `dvc repro B2.dvc` at the same time
(e.g. in separate terminals). After both finish successfully, you can then run
`dvc repro result.dvc`: DVC will know that both branches are already up-to-date
and only execute the final stage.

## Options

- `-f`, `--force` - reproduce a pipeline, regenerating its results, even if no
Expand Down
5 changes: 2 additions & 3 deletions public/static/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@
]
},
{
"label": "Install",
"slug": "install",
"source": "install/index.md",
"children": [
Expand Down Expand Up @@ -140,7 +139,7 @@
{
"label": "Contributing",
"slug": "contributing",
"source": "contributing/index.md",
"source": false,
"children": [
{
"label": "DVC Core Project",
Expand Down Expand Up @@ -356,8 +355,8 @@
]
},
{
"label": "Understanding DVC",
"slug": "understanding-dvc",
"label": "Understanding DVC",
"source": false,
"children": [
"collaboration-issues",
Expand Down
8 changes: 4 additions & 4 deletions public/static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -100,11 +100,11 @@ http://studio.ml/
- Git-annex is a datafile-centric system whereas DVC is focused on providing a
workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via `git clone`, data files won't be copied to
the local machine as file contents are stored in separate
the local machine, as file contents are stored in separate
[remotes](/doc/command-reference/remote). With DVC,
[DVC-files](/doc/user-guide/dvc-file-format) (that provide the reproducible
workflow) are always included in the Git repository and hence can be recreated
locally with minimal effort.
[DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible
workflow, are always included in the Git repository. Hence, they can be
executed locally with minimal effort.

- DVC is not fundamentally bound to Git, and users have the option of changing
the repository format.
Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/user-guide/contributing/docs.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,8 +88,8 @@ documentation files automatically.

### Debugging

The `yarn dev:debug` script runs the local development server with Node's
[`--inspect` option](https://nodejs.org/en/docs/guides/debugging-getting-started/#command-line-options)
The `yarn debug` script runs the local development server with `node`'s
[`--inspect-brk` option](https://nodejs.org/en/docs/guides/debugging-getting-started/#command-line-options)
in order for debuggers to connect to it (on the default port, 9229).

> For example, use this launch configuration in **Visual Studio Code**:
Expand All @@ -100,7 +100,7 @@ in order for debuggers to connect to it (on the default port, 9229).
> "request": "launch",
> "name": "Launch via Yarn",
> "runtimeExecutable": "yarn",
> "runtimeArgs": ["dev:debug"],
> "runtimeArgs": ["debug"],
> "port": 9229
> }
> ```
Expand Down
35 changes: 0 additions & 35 deletions public/static/docs/user-guide/contributing/index.md

This file was deleted.

6 changes: 6 additions & 0 deletions public/static/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,12 @@ operation:

- `.dvc/lock`: Lock file for the entire DVC project

- `.dvc/tmp`: Directory for miscellaneous temporary files

- `.dvc/tmp/rwlock`: JSON file that contains read and write locks for specific
dependencies and outputs, to allow safely running multiple DVC commands in
parallel.

## Structure of cache directory

There are two ways in which the data is stored in <abbr>cache</abbr>: As a
Expand Down
Loading

0 comments on commit a89d268

Please sign in to comment.