Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

June gems #1510

Merged
merged 5 commits into from
Jun 30, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
144 changes: 144 additions & 0 deletions content/blog/2020-06-29-june-20-community-gems copy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,144 @@
---
title: June '20 Community Gems
elleobrien marked this conversation as resolved.
Show resolved Hide resolved
date: 2020-06-29
description: |
A roundup of technical Q&A's from the DVC community. This month, we discuss
migrating to DVC 1.0, the new pipeline format, and our Python API.
descriptionLong: |
A roundup of technical Q&A's from the DVC community. This month, we discuss
migrating to DVC 1.0, the new pipeline format, and our Python API.
picture: 2020-06-29/Gems_June_20.png
author: elle_obrien
commentsUrl: https://discuss.dvc.org/t/june-20-community-gems/426
tags:
- Discord
- Gems
- MinIO
- Pipeline
- Python API
- Optimization
---

## Highlights from Discord

Here are some Q&A's from our Discord channel that we think are worth sharing.

### Q: I just upgraded to DVC 1.0. I've got some pipeline stages currently saved as `.dvc` files. [Is there an easy way to convert the old `.dvc` format to the new `dvc.yaml` standard?](https://discord.com/channels/485586884165107732/563406153334128681/725019219930120232)

Yes! You can easily transfer the stages by hand: `dvc.yaml` is designed for
manual edits in any text editor, so you can type your old stages in and then
delete the old `.dvc` files. We also have a
[migration script](https://gist.github.com/skshetry/07a3e26e6b06783e1ad7a4b6db6479da)
available, although we can't provide long-term support for it.

Learn more about the `dvc.yaml` format in our
[brand new docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#dvcyaml-file)!

https://media.giphy.com/media/JYpTAnhT0EI2Q/giphy.gif

_Just like this but with technical documentation._

### Q: After I pushed my local data to remote S3 storage, I noticed the file names are different in S3- they're hash values. [Can I make them more meaningful names?](https://discord.com/channels/485586884165107732/563406153334128681/717737163122540585)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to mention S3 - we can generalize it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, would be great to briefly provide motivation - e.g. deduplication , security - file are immutable, etc, GitFlow ...

In addition to dvc list mention data registry article and/or other commands dvc get, dvc import, Python dvc.api - - all of them provide a holistic data access layer for DVC-tracked objects (files, ML models, directories) which can be used usually as a drop-in replacement for regular data access libraries (e.g. aws boto,aws cli, in case of S3)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK have developed this answer more in the next version, let me know what you think


Unfortunately, no. What you're seeing are cached files, and they're stored in a
special format that makes DVC versioning and addressing possible. You can
[read more about the format in our docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).

If you want to see a human-readable list of files that are currently tracked by
DVC, we recommend the `dvc list` command-
[read up on it here](https://dvc.org/doc/command-reference/list).

### Q: [Is it better practice to `dvc add` data files individually, or to add a directory containing multiple data files?](https://discord.com/channels/485586884165107732/563406153334128681/722141190312689675)

If the directory you're adding is logically one unit (for example, it is the
whole dataset in your project), we recommend using `dvc add` at the directory
level. Otherwise, add files one-by-one. You can
[read more about how DVC versions directories in our docs](https://dvc.org/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory).

### Q: [Do you have any examples of using DVC with MinIO?](https://discord.com/channels/485586884165107732/563406153334128681/722780202844815362)

We don't have any tutorials for this use case exactly, but it's a very
straightforward modification from
[our basic use cases](https://dvc.org/doc/use-cases). The key difference when
using MinIO or a similar API (like DigitalOcean Spaces or IBM Cloud Object
elleobrien marked this conversation as resolved.
Show resolved Hide resolved
Storage) is that in addition to setting remote data storage, you must set the
`endpointurl` too. For example:

```dvc
$ dvc remote add -d myremote s3://mybucket/path/to/dir
$ dvc remote modify myremote endpointurl https://object-storage.example.com
```

Read up on configuring supported storage
[in our docs](https://dvc.org/doc/command-reference/remote/add#supported-storage-types).

### Q: [If I have a folder containing many data files, is there any advantage to zipping the folder and DVC tracking the `.zip`?](https://discord.com/channels/485586884165107732/563406153334128681/714922184455225445)

There are a few things to consider:

- **CPU time.** Even though it can be faster to pull a single file than a
directory (though not in all cases, since we can parallelize directory
downloads), the tradeoff is the time needed to unzip your data. Depending on
your constraints, this can be expensive and undesirable.

- **Deduplication.** DVC deduplicates on the file level. So if you add one
single file to a directory, DVC will save only that file, not the whole
dataset again. If you use a zipped directory you won't get this benefit. In
the long run, this could be more expensive in terms of storage space for your
DVC cache and remote if the contents of your dataset change frequently.

Generally, we would recommend first trying a plain unzipped directory. DVC is
designed to work with large numbers of files (on the order of millions) and has
the latest release (DVC 1.0) has
[optimizations built for this purpose exactly](https://dvc.org/blog/dvc-1-0-release#data-transfer-optimizations).

### [Q: Can I execute a `dvc push` with the DVC Python API inside a Python script?](https://discord.com/channels/485586884165107732/485596304961962003/718419219288686664)

Currently, our [Python API](https://dvc.org/doc/api-reference#python-api)
doesn't support commands like `dvc push`,`dvc pull`, or `dvc status`. It is
designed for interfacing with objects tracked by DVC. That said, CLI commands
are basically calling `dvc.repo.Repo` object methods. So if you want to use
commands from within Python code, you could try creating a `Repo` object with
`r = Repo({root_dir})` and then `r.push()`. Please note that we don't officially
support this use case yet.

Of course, you can also run DVC commands from a Python script using `sys` or a
similar library for issuing system commands.

### [Q: Does the `dvc pipeline` command for visualizing pipelines still work in DVC 1.0?](https://discord.com/channels/485586884165107732/485596304961962003/717682556203565127)

Most of the `dvc pipeline` functionality- like `dvc pipeline show --ascii` to
print out an ASCII diagram of your pipeline- has been migrated to a new command,
`dvc dag`. This function is written for our new pipeline format. Check out
[our new docs](https://dvc.org/doc/command-reference/dag#dag) for an example.

### [Q: Is there a way to create a DVC pipeline stage without running the commands in that stage?](https://discord.com/channels/485586884165107732/485596304961962003/715271980978405447)

Yes. Say you have a Python script, `train.py`, that takes in a dataset `data`
elleobrien marked this conversation as resolved.
Show resolved Hide resolved
and outputs a model `model.pkl`. To create a DVC pipeline stage corresponding to
this process, you could do so like this:

```dvc
dvc run -f train.dvc
elleobrien marked this conversation as resolved.
Show resolved Hide resolved
-d train.py -d data
-o model.pkl
python train.py
```

However, this would automatically rerun the command `python train.py`, which is
not necessarily desirable if you have recently run it, the process is time
consuming, and the dependencies and outputs haven't changed. You can use the
`--no-exec` flag to get around this:

```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add dvc

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add $ before the command - here and in other places

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some bugs like this in the previous Gems btw

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. might be good to revise previous gems then too

dvc run --no-exec
-f train.dvc
elleobrien marked this conversation as resolved.
Show resolved Hide resolved
-d train.py -d data
-o model.pkl
python train.py
```

This flag can also be useful when you want to define the pipeline on your local
machine but plan to run it later on a different machine (perhaps an instance in
the cloud).
[Read more about the `--no-exec` flag in our docs.](https://dvc.org/doc/command-reference/run)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.