Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user guide: DVC Files and Directories - 1.0 updates #1370

Merged
merged 42 commits into from
Jun 15, 2020
Merged
Show file tree
Hide file tree
Changes from 48 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
44b9b17
get-started: update index for DVC 1.0
jorgeorpinel May 28, 2020
cd800fa
user-guide: DVC File guide update (1)
jorgeorpinel May 28, 2020
92b3bf3
user-guide: change DVC Metafile Formats guide link URL
jorgeorpinel May 28, 2020
1008321
glossary: update DVC-file terminology for tooltips
jorgeorpinel May 28, 2020
bf050eb
user-guide: full draft of new DVC Metafiles Format doc
jorgeorpinel May 28, 2020
85ceb01
term: don't use "meta"
jorgeorpinel Jun 2, 2020
5e79613
Merge branch 'master' into refactor/get-started-1.0
jorgeorpinel Jun 2, 2020
7f9ed2c
user-guide: merge files and dirs + metafiles guides
jorgeorpinel Jun 2, 2020
bf44d1f
install: remove InteliJ plugin info
jorgeorpinel Jun 2, 2020
f928526
term: avoid "metafile"
jorgeorpinel Jun 3, 2020
04cf1e6
get-started: DVC-file -> .dvc file/dvc.yaml (incomplete)
jorgeorpinel Jun 3, 2020
5f6113b
cmd ref: include note about targets not being previously tracked in add
jorgeorpinel Jun 7, 2020
a7ef618
Merge branch 'master' into refactor/get-started-1.0 and
jorgeorpinel Jun 7, 2020
5a605d4
get-started: remove unnecessary unlock (unfreeze) cmd from ex.
jorgeorpinel Jun 7, 2020
3185804
Merge branch 'master' into refactor/get-started-1.0 and
jorgeorpinel Jun 9, 2020
a011e23
user guide: review some links to DVC files & dirs guide
jorgeorpinel Jun 9, 2020
d7ceede
user guide: review links to files&dirs guide
jorgeorpinel Jun 9, 2020
2b08a54
cmd ref: update destroy desc. and link to DVC files/dirs guide
jorgeorpinel Jun 9, 2020
8fbf129
user guide: term locked->frozen in files&dirs guide
jorgeorpinel Jun 9, 2020
bdab8ae
user guide: update and reorg all YAML structure fields
jorgeorpinel Jun 9, 2020
ead834a
user guide: reinstate info about meta fields and
jorgeorpinel Jun 10, 2020
5a97a2d
user guide: term metric->metrics + add plots field
jorgeorpinel Jun 10, 2020
e957993
user guide: mark optional fields in dvc.yaml and .dvc files
jorgeorpinel Jun 11, 2020
f32ffd5
user guide: metrics are now a regular outs field in dvc.yaml and
jorgeorpinel Jun 11, 2020
910e6db
user guide: meta fields always preserved in dvc.yaml
jorgeorpinel Jun 11, 2020
92572d4
Merge branch 'master' into refactor/get-started-1.0 +
jorgeorpinel Jun 12, 2020
3986283
user guide: bring more links from old guide
jorgeorpinel Jun 12, 2020
caa5d84
server:add dvc files & dirs page redirect
jorgeorpinel Jun 12, 2020
4e8705d
Merge branch 'master' into refactor/get-started-1.0
jorgeorpinel Jun 12, 2020
72d09bf
user guide: fix more links to new files&dirs and details around them
jorgeorpinel Jun 12, 2020
2462c83
Update content/docs/user-guide/basic-concepts/external-dependency.md
jorgeorpinel Jun 13, 2020
c66d4eb
user guide: imrovements to files & dirs guide
jorgeorpinel Jun 13, 2020
7a24add
Merge branch 'refactor/get-started-1.0' of github.com:iterative/dvc.o…
jorgeorpinel Jun 13, 2020
de1f12c
user guide: md5 is optional et al.
jorgeorpinel Jun 14, 2020
b805512
user guide: update fields in dvc.yaml
jorgeorpinel Jun 14, 2020
137f044
user guide: rename dvc.yaml file section
jorgeorpinel Jun 14, 2020
ed786b0
user guide: remove "optional" from files&dirs guide
jorgeorpinel Jun 14, 2020
c69e008
user guide: make meta and ocmment notes part of prev p
jorgeorpinel Jun 14, 2020
e1db7e1
user guide: remove some periods
jorgeorpinel Jun 14, 2020
d4705d6
user guide: don't use term "output" so much
jorgeorpinel Jun 14, 2020
ac556c3
user guide: few more improvements for iterative/dvc.org/pull/1370
jorgeorpinel Jun 14, 2020
cd7bca0
user guide: more unnecessary periods removed
jorgeorpinel Jun 14, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 0 additions & 3 deletions config/prismjs/dvc-commands.js
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,7 @@ module.exports = [
'pipeline',
'move',
'metrics show',
'metrics remove',
'metrics modify',
'metrics diff',
'metrics add',
'metrics',
'params diff',
'params',
Expand Down
6 changes: 4 additions & 2 deletions content/authors/dmitry_petrov.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,11 @@
---
name: Dmitry Petrov
avatar: dmitry_petrov.png
link: https://twitter.com/fullstackml
links:
- https://twitter.com/fullstackml
- https://www.linkedin.com/in/dmitryleopetrov
---

Creator of [http://dvc.org](http://dvc.org) — Git for ML. Ex-Data Scientist
[http://twitter.com/Microsoft](@Microsoft). PhD in CS. Making jokes with a
[@Microsoft](http://twitter.com/Microsoft). PhD in CS. Making jokes with a
serious face.
3 changes: 2 additions & 1 deletion content/authors/elle_obrien.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Elle O'Brien
avatar: elle_obrien.jpg
link: https://twitter.com/andronovhopf
links:
- https://twitter.com/andronovhopf
---

Data scientist at [http://dvc.org](http://dvc.org)
3 changes: 2 additions & 1 deletion content/authors/george_vyshnya.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: George Vyshnya
avatar: george_vyshnya.jpeg
link: https://www.linkedin.com/in/gvyshnya
links:
- https://www.linkedin.com/in/gvyshnya
---

Seasoned Data Scientist / Software Developer with blended experience in software
Expand Down
5 changes: 3 additions & 2 deletions content/authors/jorge_orpinel.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Jorge Orpinel Pérez
avatar: jorge.jpg
link: https://www.linkedin.com/in/jorgeorpinel
links:
- https://www.linkedin.com/in/jorgeorpinel
---

Technical writer and developer at [http://dvc.org](http://dvc.org)
Technical writer and developer at [dvc.org](http://dvc.org/)
3 changes: 2 additions & 1 deletion content/authors/marcel_rd.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Marcel Ribeiro-Dantas
avatar: marcel.jpg
link: https://twitter.com/mribeirodantas
links:
- https://twitter.com/mribeirodantas
---

Early Stage Researcher at [Institut Curie](https://intstitut-curie.org) with
Expand Down
3 changes: 2 additions & 1 deletion content/authors/marija_ilic.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Marija Ilić
avatar: marija_ilic.png
link: https://www.linkedin.com/in/marija-ili%C4%87-65b8a53
links:
- https://www.linkedin.com/in/marija-ili%C4%87-65b8a53
---

Data scientist at Njuškalo, Croatia.
3 changes: 2 additions & 1 deletion content/authors/svetlana_grinchenko.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
---
name: Svetlana Grinchenko
avatar: svetlana_grinchenko.jpeg
link: https://twitter.com/a142hr
links:
- https://twitter.com/a142hr
---

Head of developer relations at [http://dvc.org](http://dvc.org)
133 changes: 133 additions & 0 deletions content/blog/2020-05-26-may-20-community-gems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
---
title: May '20 Community Gems
date: 2020-05-26
description: |
A roundup of technical Q&A's from the DVC community. This month, we discuss
development best practices, sharing models and data across projects,
and using DVC with teams.
descriptionLong: |
A roundup of technical Q&A's from the DVC community. This month, we discuss
development best practices, sharing models and data across projects,
and using DVC with teams.
picture: 2020-05-26/May_20_Gems_Header.png
author: elle_obrien
commentsUrl: https://discuss.dvc.org/t/may-20-community-gems/398
tags:
- Discord
- Gems
- Cache
- Google Cloud Storage
- Import
---

## Discord gems

Here are some Q&A's from our Discord channel that we think are worth sharing.

### Q: [How do I completely delete a file from DVC?](https://discord.com/channels/485586884165107732/563406153334128681/710546561498873886)

To stop tracking a file with DVC, you can simply delete the file and its
corresponding `.dvc` file (if there is one) from your project. But, what if you
want to entirely erase a file from DVC?

After deleting the `.dvc` file, you'll usually want to
[clear your DVC cache](https://dvc.org/doc/command-reference/gc#gc). Ordinarily,
that's done with `dvc gc`. However, if there's any chance the file you wish to
remove might be referenced by another commit (even under a different name), be
sure to use the right flag: `dvc gc --all-commits`.

If you want to remove a single `.dvc` file without doing a cache cleanup, look
into the `.dvc` file and note the `md5` field inside. Then use this value to
identify the corresponding file in your `.dvc/cache` and delete it. For example:
if your target file has `md5`: 123456, the corresponding file in your cache will
be `.dvc/cache/12/3456`.

There's one last case worth mentioning: what if you're deleting a file inside a
DVC-tracked folder? For example, say you've previously run

```dvc
dvc add data_dir
```

and now want to remove a single file (say, `image_1.png`) from `data_dir`. When
DVC starts tracking a directory, it creates a corresponding `.dir` file inside
`.dvc/cache` that lists every file and subfolder, as well as an `md5` for each,
in a JSON format. You'll want to locate this `.dir` file in the cache, and then
find the entry corresponding to `image_1.png`. It'll give the `md5` for
`image_1.png`. Finally, go back to `.dvc/cache`, identify the file corresponding
to that `md5`, and delete it. For detailed instructions about `.dir` files,
where to find them and how they're used,
[see our docs about the structure of the cache](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).

Having said all this... please know that in the future, we plan to support a
function like `git rm` that will allow easier deletes from DVC!

### Q: [Is it safe to add a custom file to my DVC remote?](https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230https://discord.com/channels/485586884165107732/563406153334128681/707551737745244230)

Definitely. Some people add additional files to their DVC remote, like a README
to explain to teammates what the folder is being used for. Having an additional
file in the remote that isn't part of DVC tracking won't pose any issues. You
would only encounter problems if you were manually modifying or deleting
contents of the remote managed by DVC.

### Q: [Are there limits to how many files DVC can handle? My dataset contains ~100,000 files.](https://discord.com/channels/485586884165107732/563406153334128681/706538115048669274)

We ourselves have stored datasets containing up to 2 million files, so 100,000
is certainly feasible. Of course, the larger your dataset, the more time data
transfer operations will take. Luckily, we have a
[DVC 1.0 contains several data transfer optimizations](https://dvc.org/blog/dvc-3-years-and-1-0-release#data-transfer-optimizations)
to substantially reduce the time needed to `dvc pull / push / status -c / gc -c`
for very large datasets.

### Q: [Two developers on my team are doing `dvc push` to the same remote. Should they `dvc pull` first?](https://discord.com/channels/485586884165107732/563406153334128681/704211629075857468)

It's safe to push simultaneously, no `dvc pull` needed. While some teams might
be in the habit of frequently pulling, like in Git flow, there are less risks of
"merge conflicts" in DVC. That's because DVC remotes stores files indexed by
`md5`s, so there's usually a very low probability of a collision (if two
developers have two different versions of a file, they'll be stored as two
separate files in the DVC remote- so no merge conflicts).

### Q: [What are `*.tmp` files in my DVC remote?](https://discord.com/channels/485586884165107732/563406153334128681/698163554095857745)

Inside your DVC remote, you might see `.tmp` files from incomplete uploads. This
can happen if a user killed a process like `dvc push`. You can safely remove
them; for example, if you're using an S3 bucket, `aws s3 rm ... *.tmp` will do
the trick.

One caveat: before you delete, make sure no one is actively running `dvc push`.

### Q: [I'm using a Google Cloud Platform (GCP) bucket as a DVC remote and getting an error. Any ideas?](https://discord.com/channels/485586884165107732/485596304961962003/705131622537756702)

If you're getting the error,

```
ERROR: unexpected error - ('invalid_grant: Bad Request', '{\n "error": "invalid_grant",\n "error_description": "Bad Request"\n}')
```

something is going wrong with your GCP authentication! A few things to check:
first,
[check out our docs](https://dvc.org/doc/command-reference/remote/add#supported-storage-types)
to `dvc remote add` a Google Cloud bucket as your remote. Note that before DVC
can use this type of remote, you have to configure your credentials through the
GCP CLI
([see docs here](https://dvc.org/doc/command-reference/remote/add#supported-storage-types)).

If you're still getting an error, DVC probably can't find the `.json`
credentials file for your GCP bucket. Try authenticating using
`gcloud beta auth application-default login`. This command obtains your access
credentials and places them in a `.json` in your local workspace.

### Q: [I'm working on several projects that all need involve the same saved model. One project trains a model and pushes it to cloud storage with `dvc push`, and another takes the model out of cloud storage for use. What's the best practice for doing this with DVC?](https://discord.com/channels/485586884165107732/485596304961962003/708318821253120040)

One of DVC's goals is to make it easy to move models and datasets in and out of
cloud storage. We had this in mind when we designed the function `dvc import` -
it lets you reuse artifacts from one project to another. And you can quickly
synchronize an artifact, like a model or dataset, with its latest version using
`dvc update`. Check out our
[docs about `import`](https://dvc.org/doc/command-reference/import), and also
our [data registry use case](https://dvc.org/doc/use-cases/data-registries) for
an example of sharing artifacts across projects.

![](/static/uploads/images/2020-05-26/data-registry.png) _Using DVC for sharing
artifacts like datasets and models across projects and teammates._
4 changes: 2 additions & 2 deletions content/docs/api-reference/get_url.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,8 @@ specified by its `path` in a `repo` (<abbr>DVC project</abbr>), is stored.

The URL is formed by reading the project's
[remote configuration](/doc/command-reference/config#remote) and the
[DVC-file](/doc/user-guide/dvc-file-format) where the given `path` is found
(`outs` field). The URL schema returned depends on the
[DVC-file](/doc/user-guide/dvc-files-and-directories) where the given `path` is
found (`outs` field). The URL schema returned depends on the
[type](/doc/command-reference/remote/add#supported-storage-types) of the
`remote` used (see the [Parameters](#parameters) section).

Expand Down
34 changes: 14 additions & 20 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# add

Track data files or directories with DVC, by creating a corresponding
[DVC-file](/doc/user-guide/dvc-file-format).
[DVC-file](/doc/user-guide/dvc-files-and-directories).

## Synopsis

Expand All @@ -17,7 +17,7 @@ positional arguments:

The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
the target data, as a first step to version it. It creates a
[DVC-file](/doc/user-guide/dvc-file-format) to track the added data.
[DVC-file](/doc/user-guide/dvc-files-and-directories) to track the added data.

The `targets` are files or directories to add with this command, that are turned
into <abbr>data artifacts</abbr> of the <abbr>project</abbr>. By default, these
Expand Down Expand Up @@ -48,8 +48,8 @@ Under the hood, a few actions are taken for each file (or directory) in
appropriate.

Summarizing, the result is that the target data is replaced small DVC-files can
be tracked with Git. See [DVC-File Format](/doc/user-guide/dvc-file-format) for
more details.
be tracked with Git. See
[DVC-File Format](/doc/user-guide/dvc-files-and-directories) for more details.

> Note that DVC-files created by this command are considered _orphan stage
> files_ because they have no _dependencies_, only outputs. These are always
Expand Down Expand Up @@ -125,8 +125,8 @@ To track the changes with git run:
git add .gitignore data.xml.dvc
```

As shown above, a [DVC-file](/doc/user-guide/dvc-file-format) has been created
for `data.xml`. Let's explore the result:
As shown above, a [DVC-file](/doc/user-guide/dvc-files-and-directories) has been
created for `data.xml`. Let's explore the result:

```dvc
$ tree
Expand All @@ -138,22 +138,16 @@ $ tree
Let's check the `data.xml.dvc` file inside:

```yaml
md5: aae37d74224b05178153acd94e15956b
outs:
- cache: true
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
md5: d8acabbfd4ee51c95da5d7628c7ef74b
metric: false
path: data.xml
meta: # Special field to contain arbitary user data
name: John
email: [email protected]
md5: d8acabbfd4ee51c95da5d7628c7ef74b # file hash value
path: data.xml # file name
```

This is a standard DVC-file with only one output (in the `outs` field). The hash
value should correspond to a file path in the <abbr>cache</abbr>.

> Note that the `meta` values above were entered manually for this example. Meta
> values and `#` comments are not preserved when a DVC-file is overwritten with
> Note that `#` comments are not preserved when a DVC-file is overwritten with
> the `dvc add`, `dvc run`, `dvc import`, or `dvc import-url` commands.

```dvc
Expand Down Expand Up @@ -194,11 +188,11 @@ Saving information to 'pics.dvc'.
...
```

There are no [DVC-files](/doc/user-guide/dvc-file-format) generated within this
directory structure, but the images are all added to the <abbr>cache</abbr>. DVC
prints a message mentioning that MD5 hash values are computed for each file. A
single `pics.dvc` DVC-file is generated for the top-level directory, and it
contains:
There are no [DVC-files](/doc/user-guide/dvc-files-and-directories) generated
within this directory structure, but the images are all added to the
<abbr>cache</abbr>. DVC prints a message mentioning that MD5 hash values are
computed for each file. A single `pics.dvc` DVC-file is generated for the
top-level directory, and it contains:

```yaml
md5: df06d8d51e6483ed5a74d3979f8fe42e
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/cache/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,8 @@ positional arguments:

At DVC initialization, a new `.dvc/` directory is created for internal
configuration and <abbr>cache</abbr>
[files and directories](/doc/user-guide/dvc-files-and-directories), that are
hidden from the user.
[files and directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files),
that are hidden from the user.

The cache is where your data files, models, etc. (anything you want to version
with DVC) are actually stored. The corresponding files you see in the
Expand Down
7 changes: 4 additions & 3 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,10 @@ positional arguments:

## Description

[DVC-files](/doc/user-guide/dvc-file-format) act as pointers to specific version
of data files or directories tracked by DVC. This command synchronizes the
workspace data with the versions specified in the current DVC-files.
[DVC-files](/doc/user-guide/dvc-files-and-directories) act as pointers to
specific version of data files or directories tracked by DVC. This command
synchronizes the workspace data with the versions specified in the current
DVC-files.

`dvc checkout` is useful, for example, when using Git in the
<abbr>project</abbr>, after `git clone`, `git checkout`, or any other operation
Expand Down
10 changes: 5 additions & 5 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# commit

Record changes to DVC-tracked files in the <abbr>project</abbr>, by updating
[DVC-files](/doc/user-guide/dvc-file-format) and saving <abbr>outputs<abbr> to
the <abbr>cache</abbr>.
[DVC-files](/doc/user-guide/dvc-files-and-directories) and saving
<abbr>outputs<abbr> to the <abbr>cache</abbr>.

## Synopsis

Expand Down Expand Up @@ -66,8 +66,8 @@ cache. This is where the `dvc commit` command comes into play. It performs that
last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
force-update the [DVC-files](/doc/user-guide/dvc-file-format) and save data to
cache. They are still useful, but keep in mind that DVC can't guarantee
force-update the [DVC-files](/doc/user-guide/dvc-files-and-directories) and save
data to cache. They are still useful, but keep in mind that DVC can't guarantee
reproducibility in those cases.

## Options
Expand Down Expand Up @@ -226,7 +226,7 @@ the new instance of `model.pkl` is there.
It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in a
[DVC-file](/doc/user-guide/dvc-file-format). For example:
[DVC-file](/doc/user-guide/dvc-files-and-directories). For example:

```dvc
$ python src/featurization.py data/prepared data/features
Expand Down
5 changes: 3 additions & 2 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -179,8 +179,9 @@ for more details.) This section contains the following options:

### state

See [DVC Files and Directories](/doc/user-guide/dvc-files-and-directories) to
learn more about the state file (database) that is used for optimization.
See
[DVC Files and Directories](/doc/user-guide/dvc-files-and-directories#internal-directories-and-files)
to learn more about the state file (database) that is used for optimization.

- `state.row_limit` - maximum number of entries in the state database, which
affects the physical size of the state file itself, as well as the performance
Expand Down
Loading