Skip to content
This repository has been archived by the owner on Jul 5, 2022. It is now read-only.

Commit

Permalink
Corrections in #27 applied. Titles changed back to "Step n"
Browse files Browse the repository at this point in the history
  • Loading branch information
iesahin committed Mar 8, 2021
1 parent 38d398a commit dbe14ca
Show file tree
Hide file tree
Showing 7 changed files with 91 additions and 119 deletions.
13 changes: 7 additions & 6 deletions get-started/versioning/01-track-a-file-or-directory.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
# Track a File or Directory

Let's get a data file from the [Get
Started](https://dvc.org/doc/start/data-and-model-versioning) example project:

Expand All @@ -7,18 +9,17 @@ dvc get \
get-started/data.xml -o data/data.xml
```{{execute}}
> The command [`dvc get`][cmdget] is like streamlined and smart `wget`
> that can be used to retrieve artifacts from DVC projects hosted on Git
> repositories. You don't even need an _initialized_ DVC repository to use
> `get`
> The command [`dvc get`][cmdget] is like a smart `wget` that can be used to
> retrieve artifacts from DVC repositories. You don't even need an
> _initialized_ DVC repository to use `dvc get`.
`ls -lh data/`{{execute}}
To track a large file, ML model or a whole directory with DVC we use `dvc add`:
To track a large file, ML model, or a whole directory with DVC, we use `dvc add`:
`dvc add data/data.xml`{{execute}}
DVC has listed `data.xml` on `.gitignore` to make sure that we don't commit it
DVC has listed `data.xml` in `.gitignore` to make sure that we don't commit it
to Git.
`cat data/.gitignore`{{execute}}
Expand Down
39 changes: 14 additions & 25 deletions get-started/versioning/02-how-does-it-work.md
Original file line number Diff line number Diff line change
@@ -1,43 +1,32 @@
Let's take a look at inside the `.dvc` file to learn how it keeps track of our
data file:
# How does it work?

For a data file named `data.xml`, DVC keeps the tracking information in
`data.xml.dvc`.
Let's take a look at inside the `.dvc` file to learn how it keeps track of our
data:

`cat data/data.xml.dvc`{{execute}}

Note that this is valid YAML 1.2 file. There is a field named `md5` in the
file.

This field is used to keep track the changes in `data.xml`. Let's check whether
the hash calculated by `md5sum` is identical with the hash calculated by DVC.
> This is a valid YAML 1.2 file. `md5` field is used to keep track changes in
> `data.xml`. Let's check whether the hash calculated by `md5sum` is identical
> with the hash calculated by DVC.
```
md5sum data/data.xml
grep 'md5' data/data.xml.dvc
```{{execute}}
DVC uses `md5` field to actually address the file in its cache. The hash
is a pointer to the content in DVC cache.
If the content of `data.xml` is changed, its MD5 hash will change. DVC is able
to detect the change by comparing the new MD5 hash to the MD5 hash kept in
DVC uses the hash to address the file in the cache. DVC detects
the change by comparing the new MD5 hash to the hash kept in
`data.xml.dvc`.
Now let's see how cache is organized:
Let's see how the cache is organized:
`tree .dvc/cache`{{execute}}
DVC saved a copy of `data/data.xml` to `.dvc/cache/` using its MD5 hash as a
file name. The first two characters of the hash is used as a directory name and
the rest as a file name.
When a data or model file's content changes, its address in the cache changes
too. This allows to keep track history of large data files.
DVC saved a copy of `data/data.xml` to `.dvc/cache/` using the first two
characters as directory and the rest as file name.
> The default setting for DVC is to copy content both in the cache and the
> workspace. To save storage space and time when dealing with large files, DVC
> can use `symlinks`, `hardlinks`, or `reflinks`, depending on the file system.
> Read more in the user guide for [Large Dataset
> To save storage space and time when dealing with large files, DVC can use
> `symlinks`, `hardlinks`, or `reflinks` depending on the file system. Read
> more in the user guide for [Large Dataset
> Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization).
36 changes: 15 additions & 21 deletions get-started/versioning/03-data-remotes.md
Original file line number Diff line number Diff line change
@@ -1,27 +1,25 @@
Most of the times it's necessary to share data files with team members. DVC
allows to set up a central location to share such content safely.
[Remotes][bcremote] that are accessible by other systems or team members can be
set up and versioned data and model files can be shared by uploading to and
downloading from them.
# Data Remotes

DVC allows to set up [remotes][bcremote] that are accessible by other systems
or team members.

[bcremote]: https://dvc.org/doc/user-guide/basic-concepts/remote

To set up a data remote, you would typically use Amazon S3, Google Cloud, or
another supported storage type like this:
[another supported storage][cmdremote] type like:

```
dvc remote add --default \
my-storage s3://my-bucket/dvc-storage
```

> Note that DVC remotes are unlike Git remotes. DVC remotes are cache and content
> Note that DVC remotes are _unlike_ Git remotes. DVC remotes are data
> storage locations _without_ a commit history. History of data and model files
> are tracked in `*.dvc` text files and these are typically shared with other
> team members as if they are code/text files.
> are tracked in `*.dvc` text files, and these are typically shared with other
> team members as if they were code/text files.
It's possible to use another directory in the same disk as a _remote_ also.
This allows fast backups and we'll use this feature to keep this scenario
simple.
It's possible to use a _local_ directory in the same file system as a _remote_
as well. This allows fast backups. We'll use this feature to keep this
scenario simple.

```
dvc remote add --default \
Expand All @@ -32,20 +30,16 @@ You can get a list of configured remotes using:
`dvc remote list`{{execute}}
These configurations are typically stored in Git:
The configuration is typically stored in Git:
`git diff .dvc/config`{{execute}}
Let's commit the configuration to Git:
Let's commit the changes to Git:
```
git commit .dvc/config \
-m "Configure data storage"
```{{execute}}
> DVC supports the following remote storage types: Google Drive, Amazon S3,
> Azure Blob Storage, Google Cloud Storage, Aliyun OSS, SSH, HDFS, and HTTP.
> Please refer to
> [dvc remote add](https://dvc.org/doc/command-reference/remote/add) for more
> details and examples.
[cmdremote]: https://dvc.org/doc/command-reference/remote/add
[bcremote]: https://dvc.org/doc/user-guide/basic-concepts/remote
55 changes: 26 additions & 29 deletions get-started/versioning/04-saving-and-retrieving-data.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
In order to see the status of tracked data and model files and if they are
stored in remotes, we can use
# Saving and Retrieving Data

`dvc status --cloud`{{execute}}
The command `dvc status` can compare the data on the workspace with the data saved
on the cache.

to compare files between the data stored on cache and the default remote.
`dvc status --cloud`{{execute}}

Without the `-c, --cloud` option

`dvc status`{{execute}}

checks whether workspace files are up to date with _local cache._
checks whether workspace files are up to date with the _local cache._

As `dvc status --cloud` shows we have a missing file in `mystorage`.
As `dvc status --cloud` shows, we have a missing file in `mystorage`.

We manage
data and model files in remotes with `dvc push` and `dvc pull`. These are used
Expand All @@ -24,21 +24,20 @@ previous step:
`dvc push`{{execute}}

> Note that `dvc push` uses _default remote_. There can be multiple remotes for
> a project and if you want to push to a particular one, you can use `--remote`
> a project, and if you want to push to a particular one, you can use the `--remote`
> option for `dvc push`.
After pushing let's check the status again:
Let's check the status again:

`dvc status --cloud`{{execute}}

Now, let's take a look at the content of `/tmp/data-storage` to see what
happens in when we push a file to the cache.
Now, let's take a look at the content of `/tmp/data-storage` (location of our
remote) to see what happens when we push a file to the cache.

`tree /tmp/data-storage/`{{execute}}

As you can see the structure of `.dvc/cache` and `/tmp/data-storage` is
similar. Both of them contains the same files addressed by the same hash
values.
The structure of `.dvc/cache` and `/tmp/data-storage` is identical. Both of them
contains the same files addressed by the same hash values.

`tree .dvc/cache`{{execute}}

Expand All @@ -48,46 +47,44 @@ Suppose we deleted `data.xml` from the workspace _and_ from the cache.

`rm -f data/data.xml`{{execute}}

We deleted all _local_ versions of `data.xml`. Neither `.dvc/cache` nor `data/` contains our data file.
We deleted all _local_ versions of `data.xml`. Neither `.dvc/cache`, nor `data/`
contains our data file.

`tree -a -I .git`
`ls -R | grep 'data.xml'`{{execute}}

This is also reported by
This is also reported by:

`dvc status`{{execute}}

In order to retrieve missing data and model files from remotes we can use:
In order to retrieve missing data and model files from remotes, we can use:

`dvc fetch`{{execute}}

This command copies files from remote to _local_ cache but it doesn't update
the workspace with the data file.

Let's see the status of workspace now:
This command copies files from remote to _local_ cache, but it doesn't update
the workspace with the data file.

`dvc status`{{execute}}

Note that `data.xml` is missing from the workspace but we can get it from local
cache:
Note that `data.xml` is still missing from the workspace, but we can get it
from the local cache:

`dvc checkout`{{execute}}

`checkout` command copies (or creates link to) local cache files to the
workspace. Let's check the status again:
`dvc checkout` links (or copies) from local cache files to the workspace.

`dvc status`{{execute}}

DVC has a single command that both copies to local cache and the workspace. Instead of two step operation with `fetch` and `checkout`, we can use `pull`. Let's delete the data file and the cache again.
Instead of a two-step operation with `fetch` and `checkout`, we can use `pull`. Let's delete the data file and the cache again.

```
rm -rf .dvc/cache
rm -f data/data.xml```{{execute}}
rm -f data/data.xml
```{{execute}}
We can retrieve the file from remote with a single command:
`dvc pull`{{execute}}
`dvc pull` is usually used after `git clone`, `git pull`, or `git checkout` to
synchronize the data with the code. Along with `dvc push`, it provides a basic
collaboration workflow, similar to `git push` and `git pull`, that facilitates
sharing of data.
collaboration workflow, similar to `git push` and `git pull`.
23 changes: 11 additions & 12 deletions get-started/versioning/05-making-changes.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
Let's make the dataset smaller. It's not what would you usually do (usually we
want more data), but since there are some RAM limitations to this online
platform we have to do this to being able to run training later:
# Making Changes

Let's make the dataset smaller. It's not what we'd usually do. In general, we
want more data, but since there are limitations in this platform, we need to do
this to run the training later:

```
head -n 12000 data/data.xml > data/data.xml.1
Expand All @@ -11,24 +13,21 @@ We check the status of the project:
`dvc status`{{execute}}
When the file `data/data.xml` is changed, DVC can detect it by analyzing the
corresponding `.dvc` file. Run `dvc add` to save its new version to cache and
update `data/data.xml.dvc` to match the new hash of the data file:
Run `dvc add` to save its new version to cache and update `data/data.xml.dvc`.
`dvc add data/data.xml`{{execute}}
It updates the MD5 hash value in the corresponding `dvc` file:
It updates the hash value in the corresponding `dvc` file:
`git diff`{{execute}}
We can commit the changed hash value to track the version of the file:
Commit the changed hash value to track the version of the file:
`git commit -a -m "Dataset updates"`{{execute}}
We push the new version of file to the default remote:
Push the new version of the file to the default remote:
`dvc push`{{execute}}
So, each version of `data/data.xml.dvc` in Git history is linked to a version
of the data file in the DVC cache. This connection is done through
the MD5 hash of the cached file that is also its name on the cache.
So, each version of `data/data.xml.dvc` in the Git history is linked to a
version of the data file in the DVC cache.
32 changes: 12 additions & 20 deletions get-started/versioning/06-switching-between-versions.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,33 @@
Technically speaking DVC is not a version control system.
# Switching between Versions

Although DVC can work without a VCS, history of `.dvc` files is better tracked
with Git. Otherwise we may end up with files in the cache whose paths in the
workspace are missing.
Technically speaking, DVC is not a version control system.

Git serves as the version control system for text and code files. DVC in turn
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in
the workspace efficiently to match them.
Git serves as the VCS for text and code files, although DVC can work without it.

Let's get the previous version of the dataset `data/data.xml`:
Let's get the previous version of `data/data.xml`:

`git checkout HEAD^1 data/data.xml.dvc`{{execute}}
`git checkout HEAD~ data/data.xml.dvc`{{execute}}

When we check what changed using:

`git diff --staged`{{execute}}

we see that `data.xml.dvc` now contains the previous hash value of `data.xml`.

We can see that current hash and the hash value in `data.xml.dvc` is different.
We see that `data.xml.dvc` now contains the previous hash value of `data.xml`.

`md5sum data/data.xml`{{execute}}

To synchronize `data.xml` with the version addressed in `data.xml.dvc` we use
To synchronize `data.xml` with the version addressed in `data.xml.dvc`:

`dvc checkout`{{execute}}

and this copies the previous version of `data.xml` from from local cache. `dvc
checkout` command synchronizes data files in the workspace to match the `.dvc`
files content. Now we can see that the value in `data.xml.dvc` and the hash
value of `data.xml` are identical.
Now we can see that the value in `data.xml.dvc` and the hash value of
data.xml` are identical.

```
grep 'md5' data/data.xml.dvc
md5sum data/data.xml
```{{execute}}
Instead of `checkout` we can use `pull` again. The difference is that `dvc
pull` also downloads missing data into cache, while `dvc checkout` only can
restore data that already in cache.
Instead of `dvc checkout`, we could use `dvc pull` again. Pulling also
downloads missing data from the remote storage, whereas `dvc checkout` can only
restore that's already in the local cache.
12 changes: 6 additions & 6 deletions get-started/versioning/index.json
Original file line number Diff line number Diff line change
Expand Up @@ -6,27 +6,27 @@
"details": {
"steps": [
{
"title": "Track a file or directory",
"title": "Step 1",
"text": "01-track-a-file-or-directory.md"
},
{
"title": "How does it work?",
"title": "Step 2",
"text": "02-how-does-it-work.md"
},
{
"title": "Data remotes",
"title": "Step 3",
"text": "03-data-remotes.md"
},
{
"title": "Saving and retrieving data",
"title": "Step 4",
"text": "04-saving-and-retrieving-data.md"
},
{
"title": "Making changes",
"title": "Step 5",
"text": "05-making-changes.md"
},
{
"title": "Switching between versions",
"title": "Step 6",
"text": "06-switching-between-versions.md"
}
],
Expand Down

0 comments on commit dbe14ca

Please sign in to comment.