Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start: Data Access and Data Versioning to mention Model in titles (#2096) #2214

Merged
merged 29 commits into from
Mar 29, 2021
Merged
Show file tree
Hide file tree
Changes from 20 commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
4742d9c
guide: disclaim x data (impro #2104)
jorgeorpinel Feb 3, 2021
8dea963
Added changes from PR #2188 and modified paths & titles
iesahin Feb 18, 2021
45ba851
Update redirects-list.json with fixed subsection redirects.
iesahin Feb 20, 2021
09dc8ca
Fixed incomplete looking sentence
Feb 20, 2021
3ed7627
Merge branch 'iesahin/issue2096-take-2' of github.com:iterative/dvc.o…
iesahin Feb 20, 2021
a3b15ba
merged into a single paragraph
iesahin Feb 20, 2021
7731587
Divided models sentence and added "large files" phrase.
iesahin Feb 20, 2021
bb84a99
Adds new paths to sidebar
iesahin Feb 20, 2021
9ef97c6
Updated links to data-access and data-versioning cmd ref
iesahin Feb 20, 2021
2593bb7
updated links to data-access and data-versioning in blog
iesahin Feb 20, 2021
9ed0867
Updated links to data-access and data-versioning in UC
iesahin Feb 20, 2021
3d7d61d
Updated links to data-access and data-versioning in UG
iesahin Feb 20, 2021
f44e92e
Merge branch 'master' of https://github.com/iterative/dvc.org into ie…
iesahin Feb 22, 2021
b65de40
updated yarn.lock
iesahin Feb 22, 2021
3555c5e
Update content/docs/start/data-and-model-versioning.md
iesahin Feb 23, 2021
19a0859
Merge branch 'master' into iesahin/issue2096-take-2
iesahin Feb 24, 2021
f3b0631
Merge branch 'iesahin/issue2096-take-2' of origin into iesahin/issue2…
iesahin Feb 24, 2021
b83d00d
Restyled by prettier
restyled-commits Feb 24, 2021
6166ed0
Merge pull request #2231 from iterative/restyled/iesahin/issue2096-ta…
iesahin Feb 24, 2021
e6d6bf7
fixes hardcoded links to data-and-model-access in the blog
iesahin Feb 24, 2021
b46cca3
minor fixes
iesahin Feb 24, 2021
4210bbf
Merge branch 'master' into guide/external-disclaimer
jorgeorpinel Mar 1, 2021
06cf1da
Merge branch 'master' into guide/external-disclaimer
jorgeorpinel Mar 14, 2021
49eefb0
guide: revert Exp Outs guide rename
jorgeorpinel Mar 14, 2021
a4252f6
Merge branch 'guide/external-disclaimer'
jorgeorpinel Mar 28, 2021
8f3de6e
Merge branch 'master' of github.com:iterative/dvc.org
jorgeorpinel Mar 28, 2021
a4ed206
start: emphasize models are files (assumption)
jorgeorpinel Mar 29, 2021
c143342
start: roll back unnecessary changes
jorgeorpinel Mar 29, 2021
0c94a8e
Merge branch 'master' into iesahin/issue2096-take-2 +
jorgeorpinel Mar 29, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/blog/2020-10-12-october-20-dvc-heartbeat.md
Original file line number Diff line number Diff line change
Expand Up @@ -107,7 +107,7 @@ few weeks, so stay tuned. Another big initative is adding videos to our docs:
since video seems like a popular format for a lot of learners, we're working to
supplement our official docs with embedded videos. Check out our first
installment on the
[Getting Started with Data Versioning](https://dvc.org/doc/start/data-versioning).
[Getting Started with Data Versioning](/doc/start/data-and-model-versioning).

https://youtu.be/kLKBcPonMYw

Expand Down
2 changes: 1 addition & 1 deletion content/blog/2020-11-11-november-20-dvc-heartbeat.md
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ welcome referrals if you know a good candidate)!

We're continuing to develop our video docs, and now half of our "Getting
Started" section has video accompaniments. Check out our latest release on
[data access with DVC](https://dvc.org/doc/start/data-access):
[data access with DVC](/doc/start/data-and-model-access):

https://youtu.be/EE7Gk84OZY8

Expand Down
14 changes: 7 additions & 7 deletions content/blog/2020-12-18-december-20-dvc-heartbeat.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,17 +53,17 @@ As you may have heard
on adding complete video docs to the "Getting Started" section of the DVC site.
We now have 100% coverage! We have videos that mirror the tutorials for:

- [Data versioning](https://dvc.org/doc/start/data-versioning) - how to use Git
and DVC together to track different versions of a dataset
- [Data versioning](/doc/start/data-and-model-versioning) - how to use Git and
DVC together to track different versions of a dataset

- [Data access](https://dvc.org/doc/start/data-access) - how to share models and
- [Data access](/doc/start/data-and-model-access) - how to share models and
datasets across projects and environments

- [Pipelines](https://dvc.org/doc/start/data-pipelines) - how to create
reproducible pipelines to transform datasets to features to models
- [Pipelines](/doc/start/data-pipelines) - how to create reproducible pipelines
to transform datasets to features to models

- [Experiments](https://dvc.org/doc/start/experiments) - how to do a `git diff`
for models that compares and visualizes metrics
- [Experiments](/doc/start/experiments) - how to do a `git diff` for models that
compares and visualizes metrics

https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif

Expand Down
5 changes: 3 additions & 2 deletions content/docs/command-reference/diff.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,9 @@ $ dvc diff

Let's checkout the
[2-track-data](https://github.com/iterative/example-get-started/releases/tag/2-track-data)
tag, corresponding to the [Data Versioning](/doc/start/data-versioning) _Get
Started_ chapter, right after we added `data.xml` file with DVC:
tag, corresponding to the
[Data Versioning](/doc/start/data-and-model-versioning) _Get Started_ chapter,
right after we added `data.xml` file with DVC:

```dvc
$ git checkout 2-track-data
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -151,7 +151,7 @@ file or directory from. It also has the `--out` option to specify the location
to place the target data within the workspace. Combining these two options
allows us to do something we can't achieve with the regular `git checkout` +
`dvc checkout` process – see for example the
[Get Older Data Version](/doc/tutorials/get-started/data-versioning#navigate-versions)
[Get Older Data Version](/doc/start/data-and-model-versioning#switching-between-versions)
chapter of our _Get Started_.

Let's use the
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ $ git checkout 3-config-remote
## Example: Tracking a file from the web

An advanced alternate to the intro of the
[Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get
[Versioning Basics](/doc/start/data-and-model-versioning) part of the _Get
Started_ is to use `dvc import-url`:

```dvc
Expand Down
4 changes: 2 additions & 2 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,8 +67,8 @@ data `path`, and the `outs` field contains the corresponding local path in the
<abbr>workspace</abbr>. It records enough metadata about the imported data to
enable DVC efficiently determining whether the local copy is out of date.

To actually [version the data](/doc/tutorials/get-started/data-versioning),
`git add` (and `git commit`) the import `.dvc` file.
To actually [version the data](/doc/start/data-and-model-versioning), `git add`
(and `git commit`) the import `.dvc` file.

Note that `dvc repro` doesn't check or update import `.dvc` files (see
`dvc freeze`), use `dvc update` to bring the import up to date from the data
Expand Down
4 changes: 2 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,13 @@
},
"children": [
{
"slug": "data-versioning",
"slug": "data-and-model-versioning",
"tutorials": {
"katacoda": "https://katacoda.com/dvc/courses/get-started/versioning"
}
},
{
"slug": "data-access",
"slug": "data-and-model-access",
"tutorials": {
"katacoda": "https://katacoda.com/dvc/courses/get-started/accessing"
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
---
title: 'Get Started: Data Access'
title: 'Get Started: Data and Model Access'
---

# Get Started: Data Access
# Get Started: Data and Model Access

Okay, now that we've learned how to _track_ data and models with DVC and how to
version them with Git, next question is how can we _use_ these artifacts outside
of the project? How do I download a model to deploy it? How do I download a
specific version of a model? How do I reuse datasets across different projects?
We've learned how to _track_ data files in DVC and how to commit their versions
to Git. Machine learning models are typically large files written and read by
programs. DVC can track and version model files similar to data files. The next
questions are: How can we _use_ these artifacts outside of the project? How do I
download a model to deploy it? How do I download a specific version of a model?
How do I reuse datasets across different projects?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change seems to break the semantics of the paragraph. TBH I don't think it's necessary, we already state "data and models".

I do like the correction to the first sentence, "... , and how to commit their versions to Git" but the next 2 sentences are out of context here.

jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

> These questions tend to come up when you browse the files that DVC saves to
> remote storage, e.g.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
title: 'Get Started: Data Versioning'
description: 'Get started with data versioning in DVC. Learn how to use a
title: 'Get Started: Data and Model Versioning'
description: 'Get started with data and model versioning in DVC. Learn how to use a
regular Git workflow for datasets and ML models, without storing large files in
Git.'
---
Expand Down Expand Up @@ -247,6 +247,16 @@ defines data file versions. Git itself provides the version control. DVC in turn
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
the <abbr>workspace</abbr> efficiently to match them.

## Model versioning

DVC helps you to handle model files as well. Models in a project usually change
more frequently than data files and they need to be kept in sync with changes in
other elements of a project. Model files are no different than data files when
it comes to tracking their versions. DVC also provides means to track minor
changes in model files without fully checking in to Git. In later sections of
this series, you'll see how DVC enables to track changes to synchronize multiple
model and data files.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continuing #2214 (review)

I'm still not convinced we need this new section. We already say "data and models" in every section (except in Retrieving — let's fix that though), so if the gist here is that models are also tracked as any file normally, I think that's already implied in every other section.

Also, it can probably be summarized a bit (see feedback below) and then it's too short for a whole section anyway (could be moved to right before Storing and sharing if anything).

  • "usually change more frequently than data files" contradicts "are no different than data files" (at first sight)
  • "provides means to track minor changes in model files" - vague, what do we mean? run-cache, parameters etc. are not specifically for models.
  • The last sentence seems unnecessary.


jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
## Large datasets versioning

In cases where you process very large datasets, you need an efficient mechanism
Expand Down
2 changes: 1 addition & 1 deletion content/docs/start/data-pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ stages:
There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared`
in this case); `dvc run` already took care of this. You only need to run
`dvc push` if you want to save them to
[remote storage](/doc/tutorials/get-started/data-versioning#storing-and-sharing),
[remote storage](/doc/start/data-and-model-versioning#storing-and-sharing),
(usually along with `git commit` to version `dvc.yaml` itself).

## Dependency graphs (DAGs)
Expand Down
19 changes: 10 additions & 9 deletions content/docs/start/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,15 +53,16 @@ Now you're ready to DVC!
DVC's features can be grouped into functional components. We'll explore them one
by one in the next few pages:

- [**Data versioning**](/doc/start/data-versioning) (try this next) is the base
layer of DVC for large files, datasets, and machine learning models. Use a
regular Git workflow, but without storing large files in the repo (think "Git
for data"). Data is stored separately, which allows for efficient sharing.

- [**Data access**](/doc/start/data-access) shows how to use data artifacts from
outside of the project and how to import data artifacts from another DVC
project. This can help to download a specific version of an ML model to a
deployment server or import a model to another project.
- [**Data and model versioning**](/doc/start/data-and-model-versioning) (try
this next) is the base layer of DVC for large files, datasets, and machine
learning models. Use a regular Git workflow, but without storing large files
in the repo (think "Git for data"). Data is stored separately, which allows
for efficient sharing.

- [**Data and model access**](/doc/start/data-and-model-access) shows how to use
data artifacts from outside of the project and how to import data artifacts
from another DVC project. This can help to download a specific version of an
ML model to a deployment server or import a model to another project.

- [**Data pipelines**](/doc/start/data-pipelines) describe how models and other
data artifacts are built, and provide an efficient way to reproduce them.
Expand Down
8 changes: 4 additions & 4 deletions content/docs/use-cases/data-registries.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,10 @@

One of the main uses of <abbr>DVC repositories</abbr> is the
[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning).
DVC also enables cross-project [reusability](/doc/start/data-access) of these
<abbr>data artifacts</abbr>. This means that your projects can depend on data
from other DVC repositories — like a **package management system for data
science**.
DVC also enables cross-project [reusability](/doc/start/data-and-model-access)
of these <abbr>data artifacts</abbr>. This means that your projects can depend
on data from other DVC repositories — like a **package management system for
data science**.

![](/img/data-registry.png) _Data management middleware_

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Benefits of our approach include:
- **Collaboration**: Easily distribute your project development and share its
data [internally](/doc/use-cases/shared-development-server) and
[remotely](/doc/use-cases/sharing-data-and-model-files), or
[reuse](/doc/start/data-access) it in other places.
[reuse](/doc/start/data-and-model-access) it in other places.

- **Data compliance**: Review data modification attempts as Git
[pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/).
Expand Down
4 changes: 2 additions & 2 deletions content/docs/user-guide/project-structure/dvc-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,8 @@ You can use `dvc add` to track data files or directories located in your current
`dvc import` and `dvc import-url` let you bring data from external locations to
your project, and start tracking it locally.

> See [Data Versioning](/doc/start/data-versioning) and
> [Data Access](/doc/start/data-access) for more info.
> See [Data Versioning](/doc/start/data-and-model-versioning) and
> [Data Access](/doc/start/data-and-model-access) for more info.

Files ending with the `.dvc` extension ("dot DVC file") are created by these
commands as data placeholders that can be versioned with Git. They contain the
Expand Down
2 changes: 2 additions & 0 deletions redirects-list.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,8 @@
"^/(?:docs|documentation)(/.*)?$ /doc$1",

"^/doc/get-started(/.*)?$ /doc/start",
"^/doc/start/data-versioning$ /doc/start/data-and-model-versioning",
"^/doc/start/data-access$ /doc/start/data-and-model-access",
"^/doc/tutorial(/.*)?$ /doc/start",
"^/doc/tutorials/get-started(/.*)?$ /doc/start",
"^/doc/tutorials/versioning(/.*)?$ /doc/use-cases/versioning-data-and-model-files/tutorial",
Expand Down