Skip to content

Commit

Permalink
Merge branch 'master' into v1
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Jan 31, 2021
2 parents 37142d6 + ec0a158 commit 184b60d
Show file tree
Hide file tree
Showing 96 changed files with 2,335 additions and 806 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -81,10 +81,10 @@ there is need for improving the overall structure and making some parts more
friendly from a new user perspective. We have mostly complete
[reference documentation](/doc/commands-reference) for each command, although
some functions are missing good actionable examples. We also have a
[User Guide](/doc/user-guide/dvc-files-and-directories), however it is not in
very good shape. We strive for making our documentation clear and comprehensive
for users of various backgrounds and proficiency levels and this is where we do
need some fresh perspective.
[User Guide](/doc/user-guide), however it is not in very good shape. We strive
for making our documentation clear and comprehensive for users of various
backgrounds and proficiency levels and this is where we do need some fresh
perspective.

### How DVC documentation is built

Expand Down
8 changes: 4 additions & 4 deletions content/blog/2019-11-05-october-19-dvc-heartbeat.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,9 +193,9 @@ move should be instant. Please, find more information
### Q: My repo’s DVC is “busy and locked” and I’m not sure how it got that way and how to remove/diagnose the lock. [Any suggestions?](https://discordapp.com/channels/485586884165107732/485596304961962003/608392956679815168)

DVC uses a lock file to prevent running two commands at the same time. The lock
[file](https://dvc.org/doc/user-guide/dvc-files-and-directories#dvc-files-and-directories)
is under the `.dvc` directory. If no DVC commands running and you are still
getting this error it’s safe to remove this file manually to resolve the issue.
[file](https://dvc.org/doc/user-guide/dvc-internals) is under the `.dvc`
directory. If no DVC commands running and you are still getting this error it’s
safe to remove this file manually to resolve the issue.

### Q: [I’m trying to understand how does DVC remote add work in case of a local folder and what is the best workflow when data is outside of your project root?](https://discordapp.com/channels/485586884165107732/485596304961962003/611209851757920266)

Expand Down Expand Up @@ -250,7 +250,7 @@ advanced[ DVC setup with symlinks and hardlinks](https://dvc.org/doc/user-guide/
(`cache.type` config option is not default). If `dvc gc` behavior is not
granular enough you can manually find the by its cache from the DVC-file in
`.dvc/cache` and remote storage. Learn
[here](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
[here](https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory)
how they are organized.

### Q: [I’m trying to understand if DVC is an appropriate solution for storing data under GDPR requirements.](https://discordapp.com/channels/485586884165107732/485596304961962003/621057268145848340) That means that permanent deletion of files with sensitive data needs to be fully supported.
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2020-01-20-january-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ as we'll be eager to pass on any insights to the community.

### Q: Say I have a Git repository with multiple projets inside (one classification, one object detection, etc.). [Is it possible to tell DVC to just pull data for one particular project?](https://discordapp.com/channels/485586884165107732/563406153334128681/646760832616890408)

Absolutely, DVC supports pulling data from different DVC-files. An example would
Absolutely, DVC supports pulling data from different DVC files. An example would
be having two project subdirectories in your Git repo, `classification` and
`detection`. You could use `dvc pull -R classification` to only pull files in
that project to your workspace.
Expand Down
8 changes: 4 additions & 4 deletions content/blog/2020-04-06-april-20-dvc-heartbeat.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,10 @@ descriptionLong: |
projects by our users and big ideas about best practices in ML and data
science.
picture: 2020-04-06/april_header.png
pictureComment:
A view from [Barrancas del
Cobre](https://en.wikipedia.org/wiki/Copper_Canyon), shot by Jorge Orpinel
Pérez. Jorge has mastered the art of working on DVC remotely.
pictureComment: |
A view from
[Barrancas del Cobre](https://en.wikipedia.org/wiki/Copper_Canyon), shot by
Jorge Orpinel Pérez. Jorge has mastered the art of working on DVC remotely.
author: elle_obrien
commentsUrl: https://discuss.dvc.org/t/april-20-heartbeat/347
tags:
Expand Down
16 changes: 8 additions & 8 deletions content/blog/2020-04-16-april-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -119,7 +119,7 @@ all. That's okay, easy to fix. Simply remove the `.dvc` file like any other-
`rm <file>.dvc`. DVC will then stop tracking the file, and the associated target
file will still be in your local workspace. Note that the file will still be in
your
[DVC cache](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
[DVC cache](https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory)
unless you clear it with `dvc gc`.

### Q: [I'm trying to move a stage file with `dvc move`, but I'm getting an error. What's going on?](https://discordapp.com/channels/485586884165107732/563406153334128681/685125650901630996)
Expand All @@ -129,12 +129,12 @@ modify its corresponding DVC file. It's handy so you don't rename a file in your
local workspace that's under DVC tracking without updating DVC to the change
(see an [example here](https://dvc.org/doc/command-reference/move#description)).
The function doesn't work on
[stage files](https://dvc.org/doc/tutorials/pipelines#define-stages) from DVC
pipelines. There's not currently an easy way to safely move stage files, and
it's an
["stage files"](https://dvc.org/doc/tutorials/pipelines#define-stages) from DVC
pipelines. There's not currently an easy way to safely move `dvc.yaml` files,
and it's an
[open issue we're working on](https://github.com/iterative/dvc/issues/1489).
Until then, you can manually update the stage file, or make a new one in the
desired location.
Until then, you can manually update `dvc.yaml`, or make a new one in the desired
location.

### Q: [I just starting using DVC and noticed that when I `dvc push` files to remote cloud storage, the directory in my remote looks like my DVC cache, not my local workspace directory. Is this right?](https://discordapp.com/channels/485586884165107732/485596304961962003/693740598498426930)

Expand All @@ -148,5 +148,5 @@ look like hashes (because, well, they are). Luckily, DVC handles all the
conversions between the filenames in your local workspace and these hashes.

To get some more intuition about this, check out some of our
[docs](https://dvc.org/doc/user-guide/dvc-files-and-directories) about how DVC
organizes files.
[docs](https://dvc.org/doc/user-guide/dvc-internals) about how DVC organizes
files.
2 changes: 1 addition & 1 deletion content/blog/2020-04-30-gsod-ideas-2020.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ technical writer, [Jorge](https://github.com/jorgeorpinel).
- A
[multi-stage _pipelines file_](https://github.com/iterative/dvc/issues/1871)
that partially substitutes
[DVC-files](https://dvc.org/doc/user-guide/dvc-file-format)
[DVC files](https://dvc.org/doc/user-guide/dvc-files)
- Separation between
[scalar vs. continuous metrics](https://github.com/iterative/dvc/issues/3409),
and new commands to visualize them, such as `dvc plots`
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2020-05-26-may-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ find the entry corresponding to `image_1.png`. It'll give the `md5` for
`image_1.png`. Finally, go back to `.dvc/cache`, identify the file corresponding
to that `md5`, and delete it. For detailed instructions about `.dir` files,
where to find them and how they're used,
[see our docs about the structure of the cache](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).
[see our docs about the structure of the cache](https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory).

Having said all this... please know that in the future, we plan to support a
function like `git rm` that will allow easier deletes from DVC!
Expand Down
2 changes: 1 addition & 1 deletion content/blog/2020-06-22-dvc-1-0-release.md
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ pipelines with data processing steps. People need to change the commands of the
pipeline often and it was not easy to do this with the old DVC-files.

In DVC 1.0, the DVC metafile format was changed in three big ways. First,
instead of multiple DVC stage files (`*.dvc`), each project has a single
instead of multiple DVC "stage files" (`*.dvc`), each project has a single
`dvc.yaml` file. By default, all stages go in this single YAML file.

Second, we made clear connections between the `dvc run` command (a helper to
Expand Down
6 changes: 3 additions & 3 deletions content/blog/2020-06-29-june-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ delete the old `.dvc` files. We also have a
available, although we can't provide long-term support for it.

Learn more about the `dvc.yaml` format in our
[brand new docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#dvcyaml-file)!
[brand new docs](https://dvc.org/doc/user-guide/dvc-files#dvcyaml-file)!

https://media.giphy.com/media/JYpTAnhT0EI2Q/giphy.gif

Expand All @@ -46,7 +46,7 @@ possible- these file names are how DVC deduplicates data (to avoid keeping
multiple copies of the same file version) and ensures that each unique version
of a file is immutable. If you manually overwrote those filenames you would risk
breaking Git version control. You can
[read more about how DVC uses this file format in our docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).
[read more about how DVC uses this file format in our docs](https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory).

It sounds like you're looking for ways to interact with DVC-tracked objects at a
high level of abstraction, meaning that you want to interface with the original
Expand All @@ -69,7 +69,7 @@ secure and recommended ways to do this:
If the directory you're adding is logically one unit (for example, it is the
whole dataset in your project), we recommend using `dvc add` at the directory
level. Otherwise, add files one-by-one. You can
[read more about how DVC versions directories in our docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).
[read more about how DVC versions directories in our docs](https://dvc.org/doc/user-guide/dvc-internals#structure-of-cache-directory).

### Q: [Do you have any examples of using DVC with MinIO?](https://discord.com/channels/485586884165107732/563406153334128681/722780202844815362)

Expand Down
2 changes: 1 addition & 1 deletion content/blog/2020-07-22-july-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ If for some reason this won't work for your team, you can either downgrade to a
previous version, or use a workaround:

```dvc
$ dvc repro <.dvc stage file>
$ dvc repro <.dvc file>
```

substituting the appropriate `.dvc` file for your pipeline. DVC 1.0 is backwards
Expand Down
4 changes: 2 additions & 2 deletions content/blog/2020-11-25-november-20-community-gems.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ https://<path-to-your-remote-storage>/ab/cd123
To better understand how DVC uses
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
in your remote,
[read up in our docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
[read up in our docs](https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory).

### [Q: Can I have more than one `dvc.yaml` file in my project?](https://discord.com/channels/485586884165107732/563406153334128681/777946398250893333)

Expand Down Expand Up @@ -110,7 +110,7 @@ Alternatively, you can manually find and delete your files:
2. Look in your remote storage and remove the file matching the hash.
3. Look in `.dvc/cache` and remove the file as well. If you'd like to better
understand how your cache is organized,
[we have docs for that](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-the-cache-directory).
[we have docs for that](https://dvc.org/doc/user-guide/dvc-internals#structure-of-the-cache-directory).

Your DVC remote storage and cache are simply storage locations, so once your
file is gone from there it's gone for good.
Expand Down
219 changes: 219 additions & 0 deletions content/blog/2020-12-30-december-20-community-gems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,219 @@
---
title: December '20 Community Gems
date: 2020-12-30
description: |
A roundup of technical Q&A's from the DVC community.
This month, read about custom DVC plots, teamwork
with DVC, CML without Docker, and maintaining
several pipelines in parallel!
descriptionLong: |
A roundup of technical Q&A's from the DVC community.
This month, read about custom DVC plots, teamwork
with DVC, CML without Docker, and maintaining
several pipelines in parallel!
picture: 2020-12-30/cover.png
author: elle_obrien
commentsUrl: https://discuss.dvc.org/t/december-20-gems/606
tags:
- Discord
- Gems
- CML
- Plots
- Pipelines
- Docker
---

## DVC questions

### [Q: Is there a way to plot all columns in a `.csv` file on a single graph using `dvc plot`?](https://discord.com/channels/485586884165107732/563406153334128681/768689062314770442)

By default, `dvc plot` graphs one or two columns from the metric file of your
choice (use the `-x` and `-y` flags to specify which columns).

However, there's nothing special about the way DVC makes plots. The plot
function is a wrapper for the [Vega-Lite](https://vega.github.io/vega-lite-v1/)
grammar, which can make pretty much any kind of plot you can imagine. If you
check inside `.dvc/plots/`, you'll see a few Vega-Lite template files- that's
where the plotting instructions are stored!

You can create your own, or modify the existing templates, by
[following the instructions in our docs](https://dvc.org/doc/command-reference/plots#plot-templates).
In short, you'll create a new template and then run
`dvc plot show -t <name-of-template>` to use it!

Vega-Lite has an
[interactive template editor online](https://vega.github.io/editor/#/), which
might help you test out ideas. Happy creating, and if you come up with a
template you'd like to share with the DVC community,
[consider opening a pull request!](https://github.com/iterative/dvc)

### [Q: My teammate and I are having some issues keeping our workplaces synced. We're tracking some folders with DVC, and he recently added a new file to each of these folders. How does he update the tracked folder and push the new contents so I can access them, too?](https://discord.com/channels/485586884165107732/563406153334128681/785965719367843860)

Your partner should first run

```dvc
$ dvc add <folder>
$ dvc push
```

to update DVC about the new file and then push its contents to remote storage.
Next, they'll run:

```dvc
$ git commit <folder>.dvc
$ git push
```

to update your shared Git repository. Then you can do a `git pull` and
`dvc pull` to sync the changes with your local workspace!

### [Q: I forgot to declare a metric output in my `dvc.yaml` file, so one of my metrics is currently untracked. How can I fix this without rerunning the stage? It takes a long time to run.](https://discord.com/channels/485586884165107732/485596304961962003/781643749050155009)

No problem- what you'll want to do is edit your `dvc.yaml` case and then run
`dvc commit dvc.yaml` to store the change.

`dvc commit` is a helpful function that updates your `dvc.lock` file and `.dvc`
files as needed, which foces DVC to accept any modifications to tracked data
currently in your workspace. That should cover the case where you have a metric
file from your last pipeline run in your workspace, but forgot to add it to the
`dvc.yaml` as an output!

[Check out the docs](https://dvc.org/doc/command-reference/commit#commit) for
more about `dvc commit` and how it can help you edit pipeline dependencies as
you work.

### [Q: Can I have multiple `dvc.yaml` files?](https://discord.com/channels/485586884165107732/485596304961962003/784083794583486496)

If you have multiple independent pipelines (for example, `main-data-pipeline`
and `secondary-data-pipeline`), you can have a `dvc.yaml` for each. The catch
is, they have to be in separate directories. Here are two approaches we
recommend:

#### Option 1

```
.
├── main_data_pipeline
│ └── dvc.yaml
└── secondary_data_pipeline
└── dvc.yaml
```

#### Option 2

```
.
└── main_data_pipeline
├── dvc.yaml
└── secondary_data_pipeline
└── dvc.yaml
```

### [Q: I want to work on my DVC pipeline on a different computer than usual. For the stage I'm developing, I don't need access to all the data dependencies of the earlier stages- is there a way to download only what I need?](https://discord.com/channels/485586884165107732/563406153334128681/788068487246512158)

Say for example that you have a pipeline like this:

```
+----------+
| data.dvc |
+----------+
*
*
*
+----+
| s1 |
+----+
*
*
*
+----+
| s2 |
+----+
*
*
*
+----+
| s3 |
+----+
```

where stage `s2` is frozen (meaning, its dependencies will not change and we can
be reasonably sure the outputs of `s2` are static).

To work on stage `s3` in a new workspace, you could run:

```dvc
$ dvc pull s2
$ dvc repro s3
```

This set of commands will pull only the targeted stage (not the data
corresponding to `data.dvc`), and then execute the final stage of your pipeline
only.

## CML questions

### [Q: Why do you need Docker to run CML?](https://www.youtube.com/watch?v=rVq-SCNyxVc&lc=UgzohiMVxO1GKB30bad4AaABAg)

Even though we use Docker in many of our tutorials, you technically _don't_ need
it at all! Here's what's going on:

We use a custom Docker container that comes with the CML functions installed (as
well as some useful data science tools like Python, Vega-Lite, and CUDA
drivers). If you want to use your own Docker container, that's fine too- just
make sure you install the CML library of functions on your runner.

To install CML as an `npm` package on your runner, we recommend:

```dvc
npm i -g @dvcorg/cml
```

Once this is done, you should be able to execute functions like `cml-publish`
and `cml-send-comment` on your runner.

For more tips about using CML without Docker,
[see our docs](https://github.com/iterative/cml#install-cml-as-a-package).

### [Q: I'm using CML to print a `dvc metrics diff` to my pull request in GitHub, but I'm getting an error: `token not found`. What does that mean?](https://discord.com/channels/485586884165107732/728693131557732403/786382971706933258)

Generally, `token` refers to an authorization token that grants your runner
certain permissions with the GitHub API- such as the ability to post a comment
on your pull request. If you're working in GitHub, you don't have to follow any
manual steps to create a token. But you _do_ need to make sure your
environmental variables in the workflow are named properly.

Make sure you've specified the following field in your workflow file:

```yaml
env:
repo_token: ${{ secrets.GITHUB_TOKEN }}
```
The variable must be called `repo_token` for CML to recognize it!

A few other pointers:

- In GitLab, you have to set a variable in your repository called `repo_token`
whose value is Personal Access token. We have
[step-by-step instructions in our docs](https://github.com/iterative/cml/wiki/CML-with-GitLab#variables).
Forgetting to set this is the #1 issue we see with first-time GitLab CI users!
- In BitBucket Cloud, you need to set a variable in your repository called
`repo_token` whose value is your API credentials. We have
[detailed docs for creating this token](https://github.com/iterative/cml/wiki/CML-with-Bitbucket-Cloud#repository-variables),
too.
- Need to see more sample workflows to get a feel for it? We have plenty
[of case studies](https://dvc.org/doc/cml#case-studies) to examine.

### [Q: Is there any reason why an experimental DVC feature wouldn't work on the CML Docker container?](https://discord.com/channels/485586884165107732/728693131557732403/788512890394247178)

Generally, no- the container `dvcorg/cml:latest` should have the latest DVC
release and the latest CML release (you can see where DVC and CML are installed
from in our
[Dockerfile](https://github.com/iterative/cml/blob/master/docker/Dockerfile)).
So besides the time it takes for releases to be published on various package
managers, there shouldn't be any lag. That means experimental features are ready
to play on your runner!

Note that you can also install pre-release versions of DVC- check out our
[docs about installing the latest stable version ahead of official releases](https://dvc.org/doc/install/pre-release).
Loading

0 comments on commit 184b60d

Please sign in to comment.