Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commit cmdref updates per 1.x #1626

Merged
merged 22 commits into from
Aug 18, 2020
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions content/docs/command-reference/checkout.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,7 +130,7 @@ below.

The workspace looks like this:

````dvc
```dvc
.
├── README.md
├── data
Expand All @@ -145,8 +145,8 @@ The workspace looks like this:
├── featurization.py
├── prepare.py
├── requirements.txt
└── train.py```
````
└── train.py
```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

This repository includes the following tags, that represent different variants
of the resulting model:
Expand Down
136 changes: 69 additions & 67 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
# commit

Record changes to DVC-tracked files in the <abbr>project</abbr>, by updating
[DVC-files](/doc/user-guide/dvc-files-and-directories) and saving
Record changes to DVC-tracked files in the <abbr>project</abbr>, by updating the
[stages](/doc/command-reference/run) (or `.dvc` files) and saving
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't change stages, can you check what it does and try again? 🙂
And no need for parenthesis around "and .dvc files".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think commit only saves data to cache, it doesn't update dvc.lock or .dvc files. But the previous definition had the updating part so I didn't change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey sorry, I was out a few days. I'm back, let me review this again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continued this conversation in #1626 (review).

<abbr>outputs</abbr> to the <abbr>cache</abbr>.

## Synopsis
Expand All @@ -21,9 +21,9 @@ positional arguments:
The `dvc commit` command is useful for several scenarios, when data already
tracked by DVC changes: when a [stage](/doc/command-reference/run) or
[pipeline](/doc/command-reference/pipeline) is in development/experimentation;
when manually editing or generating DVC <abbr>outputs</abbr>; or to force
DVC-file updates without reproducing stages or pipelines. These scenarios are
further detailed below.
when manually editing or generating DVC <abbr>outputs</abbr>; or to force update
the stages (or `.dvc` files) without reproducing the stages or pipelines. These
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
scenarios are further detailed below.

- Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the `--no-commit` option of
Expand All @@ -36,23 +36,26 @@ further detailed below.
- It's always possible to manually execute the source code used in a stage
without DVC (outputs should be unprotected or removed first in certain cases,
see `dvc unprotect`). Once a desirable result is reached, use `dvc add` or
`dvc commit` as appropriate to update DVC-files and store changed data to the
cache.
`dvc commit` as appropriate to update the corresponding stage (or `.dvc` file)
and store changed data to the cache.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- Sometimes we want to edit source code, config, or data files in a way that
doesn't cause changes in the results of their data pipeline. We might write
add code comments, change indentation, remove some debugging printouts, or any
other change that doesn't cause changed stage outputs. However, DVC will
notice that some <abbr>dependencies</abbr> and have changed, and expect you to
reproduce the whole pipeline. If you're sure no pipeline results would change,
just use `dvc commit` to force update the related DVC-files and cache.
just use `dvc commit` to force update the related stage (or `.dvc` file) and
cache.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed this to save updated DVC-tracked files to the cache.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same. Continued in #1626 (review)


Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating a DVC-file. What _commit_ means is that DVC:
<abbr>cache</abbr> after creating or updating the `dvc.yaml` (or `.dvc` file).
What _commit_ means is that DVC:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, this is closer but still not quite right. Is dvc.yaml the relevant special file to mention?

Copy link
Contributor Author

@imhardikj imhardikj Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually dvc.lock and .dvc will be updated if we run repro after doing changes in some files(deps).
And dvc.yaml will be updated in case we run a stage for first time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here I think instead of mentioning all three(which I did currently), it can be updating the stage or .dvc file😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"updating the stage" would be incorrect. Again, commit does not change the stage at all. A stage = a command, its dependencies (paths), and outputs (paths).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Continued in #1626 (review).


- Computes a hash for the file/directory.
- Enters the hash value and file name into the DVC-file.
- Enters the hash value and file name in the corresponding stage (or `.dvc` file
).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
- Tells Git to ignore the file/directory (adding them to `.gitignore`). (Note
that if the <abbr>project</abbr> was initialized with no Git support
(`dvc init --no-scm`), this does not happen.)
Expand All @@ -62,26 +65,26 @@ There are many cases where the last step is not desirable (for example rapid
iterations on an experiment). The `--no-commit` option prevents the last step
from occurring (on the commands where it's available), saving time and space by
not storing unwanted <abbr>data artifacts</abbr>. The file hash is still
computed and added to the DVC-file, but the actual data file is not saved in the
cache. This is where the `dvc commit` command comes into play. It performs that
last step (saving the data in cache).
computed and added to the particular stage (or `.dvc` file), but the actual data
file is not saved in the cache. This is where the `dvc commit` command comes
into play. It performs that last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
force-update the [DVC-files](/doc/user-guide/dvc-files-and-directories) and save
data to cache. They are still useful, but keep in mind that DVC can't guarantee
reproducibility in those cases.
force-update the related stage (or `.dvc` file) and save data to cache. They are
still useful, but keep in mind that DVC can't guarantee reproducibility in those
cases.

## Options

- `-d`, `--with-deps` - determines files to commit by tracking dependencies to
the target DVC-files (stages). If no `targets` are provided, this option is
ignored. By traversing all stage dependencies, DVC searches backward from the
target stages in the corresponding pipelines. This means DVC will not commit
files referenced in later stages than the `targets`.
the target stages (or `.dvc` files). If no `targets` are provided, this option
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
is ignored. By traversing all stage dependencies, DVC searches backward from
the target stages in the corresponding pipelines. This means DVC will not
commit files referenced in later stages than the `targets`.

- `-R`, `--recursive` - determines the files to commit by searching each target
directory and its subdirectories for DVC-files to inspect. If there are no
directories among the `targets`, this option is ignored.
directory and its subdirectories for stages (or `.dvc` files) to inspect. If
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
there are no directories among the `targets`, this option is ignored.

- `-f`, `--force` - commit data even if hash values for dependencies or outputs
did not change.
Expand Down Expand Up @@ -140,83 +143,83 @@ stage with `dvc run --no-commit`, or reproduce an entire pipeline using
development of the stage is finished, `dvc commit` can be used to store data
files in the cache.

In the `featurize.dvc` stage, `src/featurize.py` is executed. A useful change to
make is adjusting a parameter to `CountVectorizer` in that script. Namely,
adjusting the `max_features` value in the line below changes the resulting
model:
In the `featurize` stage, `src/featurization.py` is executed. A useful change to
make is adjusting the `max_features` parameter to `CountVectorizer` in that
script. The parameters are defined in `params.yaml` file. Updating the value of
`max_features` to 6000 in `params.yaml` changes the resulting model:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```python
bag_of_words = CountVectorizer(stop_words='english',
max_features=6000, ngram_range=(1, 2))
```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
featurize:
max_features: 6000
ngrams: 2
```

This edit introduces a change that would cause the `featurize.dvc`, `train.dvc`
and `evaluate.dvc` stages to execute if we ran `dvc repro`. But if we want to
try several values for `max_features` and save only the best result to the
cache, we can run it like this:
This edit introduces a change that would cause the `featurize`, `train` and
`evaluate` stages to execute if we ran `dvc repro`. But if we want to try
several values for `max_features` and save only the best result to the cache, we
can run it like this:

```dvc
$ dvc repro --no-commit evaluate.dvc
$ dvc repro --no-commit evaluate
```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

We can run this command as many times as we like, editing `featurize.py` any way
we like, and so long as we use `--no-commit`, the data does not get saved to the
cache. Let's verify that's the case:
We can run this command as many times as we like, editing `featurization.py` any
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
way we like, and so long as we use `--no-commit`, the data does not get saved to
the cache. Let's verify that's the case:

First verification:

```dvc
$ dvc status

evaluate.dvc:
changed deps:
modified: data/features
modified: model.pkl
train.dvc:
changed outs:
not in cache: model.pkl
featurize:
changed outs:
not in cache: data/features
train:
changed outs:
not in cache: model.pkl
```
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Now we can look in the cache directory to see if the new version of `model.pkl`
is indeed _not in cache_ as claimed. Look at `train.dvc` first:
is indeed _not in cache_ as claimed. Let's look at the latest state of `train`
in `dvc.lock` first:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```yaml
cmd: python src/train.py data/features model.pkl
deps:
- md5: d05e0201a3fb47c878defea65bd85e4d
path: src/train.py
- md5: b7a357ba7fa6b726e615dd62b34190b4.dir
path: data/features
md5: b91b22bfd8d9e5af13e8f48523e80250
outs:
- cache: true
md5: 70599f166c2098d7ffca91a369a78b0d
metric: false
path: model.pkl
persist: false
wdir: .
train:
cmd: python src/train.py data/features model.pkl
deps:
- path: data/features
md5: de03a7e34e003e54dde0d40582c6acf4.dir
- path: src/train.py
md5: ad8e71b2cca4334a7d3bb6495645068c
params:
params.yaml:
train.n_estimators: 100
train.seed: 20170428
outs:
- path: model.pkl
md5: 9aba000ba83b341a423a81eed8ff9238
```

To verify this instance of `model.pkl` is not in the cache, we must know the
path to the cached file. In the cache directory, the first two characters of the
hash value are used as a subdirectory name, and the remaining characters are the
file name. Therefore, had the file been committed to the cache, it would appear
in the directory `.dvc/cache/70`. Let's check:
in the directory `.dvc/cache/9a`. Let's check:

```dvc
$ ls .dvc/cache/70
ls: .dvc/cache/70: No such file or directory
$ ls .dvc/cache/9a
ls: .dvc/cache/9a: No such file or directory
```

If we've determined the changes to `featurize.py` were successful, we can
If we've determined the changes to `featurization.py` were successful, we can
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
execute this set of commands:

```dvc
$ dvc commit
$ dvc status
Data and pipelines are up to date.
$ ls .dvc/cache/70
599f166c2098d7ffca91a369a78b0d
ba000ba83b341a423a81eed8ff9238
```

We've verified that `dvc commit` has saved the changes into the cache, and that
Expand All @@ -226,8 +229,7 @@ the new instance of `model.pkl` is there.

It is also possible to execute the commands that are executed by `dvc repro` by
hand. You won't have DVC helping you, but you have the freedom to run any
command you like, even ones not defined in a
[DVC-file](/doc/user-guide/dvc-files-and-directories). For example:
command you like, even ones not defined in `dvc.yaml`. For example:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ python src/featurization.py data/prepared data/features
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/remove.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ Let's imagine we have a stage named `train` in our `dvc.yaml` file, and
corresponding files in the <abbr>workspace</abbr>:

```yaml
test:
train:
cmd: python train.py data.py
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
deps:
- data.csv
Expand Down