This repository has been archived by the owner on Jul 5, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 11
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Corrections in #27 applied. Titles changed back to "Step n"
- Loading branch information
Showing
7 changed files
with
91 additions
and
119 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,43 +1,32 @@ | ||
Let's take a look at inside the `.dvc` file to learn how it keeps track of our | ||
data file: | ||
# How does it work? | ||
|
||
For a data file named `data.xml`, DVC keeps the tracking information in | ||
`data.xml.dvc`. | ||
Let's take a look at inside the `.dvc` file to learn how it keeps track of our | ||
data: | ||
|
||
`cat data/data.xml.dvc`{{execute}} | ||
|
||
Note that this is valid YAML 1.2 file. There is a field named `md5` in the | ||
file. | ||
|
||
This field is used to keep track the changes in `data.xml`. Let's check whether | ||
the hash calculated by `md5sum` is identical with the hash calculated by DVC. | ||
> This is a valid YAML 1.2 file. `md5` field is used to keep track changes in | ||
> `data.xml`. Let's check whether the hash calculated by `md5sum` is identical | ||
> with the hash calculated by DVC. | ||
``` | ||
md5sum data/data.xml | ||
grep 'md5' data/data.xml.dvc | ||
```{{execute}} | ||
DVC uses `md5` field to actually address the file in its cache. The hash | ||
is a pointer to the content in DVC cache. | ||
If the content of `data.xml` is changed, its MD5 hash will change. DVC is able | ||
to detect the change by comparing the new MD5 hash to the MD5 hash kept in | ||
DVC uses the hash to address the file in the cache. DVC detects | ||
the change by comparing the new MD5 hash to the hash kept in | ||
`data.xml.dvc`. | ||
Now let's see how cache is organized: | ||
Let's see how the cache is organized: | ||
`tree .dvc/cache`{{execute}} | ||
DVC saved a copy of `data/data.xml` to `.dvc/cache/` using its MD5 hash as a | ||
file name. The first two characters of the hash is used as a directory name and | ||
the rest as a file name. | ||
When a data or model file's content changes, its address in the cache changes | ||
too. This allows to keep track history of large data files. | ||
DVC saved a copy of `data/data.xml` to `.dvc/cache/` using the first two | ||
characters as directory and the rest as file name. | ||
> The default setting for DVC is to copy content both in the cache and the | ||
> workspace. To save storage space and time when dealing with large files, DVC | ||
> can use `symlinks`, `hardlinks`, or `reflinks`, depending on the file system. | ||
> Read more in the user guide for [Large Dataset | ||
> To save storage space and time when dealing with large files, DVC can use | ||
> `symlinks`, `hardlinks`, or `reflinks` depending on the file system. Read | ||
> more in the user guide for [Large Dataset | ||
> Optimization](https://dvc.org/doc/user-guide/large-dataset-optimization). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,41 +1,33 @@ | ||
Technically speaking DVC is not a version control system. | ||
# Switching between Versions | ||
|
||
Although DVC can work without a VCS, history of `.dvc` files is better tracked | ||
with Git. Otherwise we may end up with files in the cache whose paths in the | ||
workspace are missing. | ||
Technically speaking, DVC is not a version control system. | ||
|
||
Git serves as the version control system for text and code files. DVC in turn | ||
creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in | ||
the workspace efficiently to match them. | ||
Git serves as the VCS for text and code files, although DVC can work without it. | ||
|
||
Let's get the previous version of the dataset `data/data.xml`: | ||
Let's get the previous version of `data/data.xml`: | ||
|
||
`git checkout HEAD^1 data/data.xml.dvc`{{execute}} | ||
`git checkout HEAD~ data/data.xml.dvc`{{execute}} | ||
|
||
When we check what changed using: | ||
|
||
`git diff --staged`{{execute}} | ||
|
||
we see that `data.xml.dvc` now contains the previous hash value of `data.xml`. | ||
|
||
We can see that current hash and the hash value in `data.xml.dvc` is different. | ||
We see that `data.xml.dvc` now contains the previous hash value of `data.xml`. | ||
|
||
`md5sum data/data.xml`{{execute}} | ||
|
||
To synchronize `data.xml` with the version addressed in `data.xml.dvc` we use | ||
To synchronize `data.xml` with the version addressed in `data.xml.dvc`: | ||
|
||
`dvc checkout`{{execute}} | ||
|
||
and this copies the previous version of `data.xml` from from local cache. `dvc | ||
checkout` command synchronizes data files in the workspace to match the `.dvc` | ||
files content. Now we can see that the value in `data.xml.dvc` and the hash | ||
value of `data.xml` are identical. | ||
Now we can see that the value in `data.xml.dvc` and the hash value of | ||
data.xml` are identical. | ||
|
||
``` | ||
grep 'md5' data/data.xml.dvc | ||
md5sum data/data.xml | ||
```{{execute}} | ||
Instead of `checkout` we can use `pull` again. The difference is that `dvc | ||
pull` also downloads missing data into cache, while `dvc checkout` only can | ||
restore data that already in cache. | ||
Instead of `dvc checkout`, we could use `dvc pull` again. Pulling also | ||
downloads missing data from the remote storage, whereas `dvc checkout` can only | ||
restore that's already in the local cache. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters