Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp save: initial implementation #8599

Merged
merged 4 commits into from
Nov 28, 2022
Merged

exp save: initial implementation #8599

merged 4 commits into from
Nov 28, 2022

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Nov 21, 2022

copy of #8408

Makes it possible to save changes in the current workspace as a dvc experiment.

closes #6746

@codecov
Copy link

codecov bot commented Nov 21, 2022

Codecov Report

Base: 93.89% // Head: 94.11% // Increases project coverage by +0.22% 🎉

Coverage data is based on head (4fb79dd) compared to base (8124396).
Patch coverage: 95.07% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8599      +/-   ##
==========================================
+ Coverage   93.89%   94.11%   +0.22%     
==========================================
  Files         432      435       +3     
  Lines       33159    33298     +139     
  Branches     4663     4677      +14     
==========================================
+ Hits        31133    31340     +207     
+ Misses       1582     1528      -54     
+ Partials      444      430      -14     
Impacted Files Coverage Δ
dvc/commands/experiments/__init__.py 100.00% <ø> (ø)
dvc/commands/experiments/save.py 82.14% <82.14%> (ø)
dvc/repo/experiments/executor/local.py 90.79% <94.11%> (+2.24%) ⬆️
dvc/repo/experiments/save.py 100.00% <100.00%> (ø)
tests/func/experiments/test_experiments.py 99.72% <100.00%> (ø)
tests/func/experiments/test_save.py 100.00% <100.00%> (ø)
tests/unit/command/test_experiments.py 99.59% <100.00%> (+0.01%) ⬆️
tests/remotes/git_server.py 33.33% <0.00%> (-9.10%) ⬇️
dvc/testing/fixtures.py 89.07% <0.00%> (-6.73%) ⬇️
dvc/analytics.py 91.04% <0.00%> (ø)
... and 18 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@daavoo daavoo force-pushed the 8408 branch 4 times, most recently from 4d6520e to a7dc375 Compare November 24, 2022 12:16
@daavoo daavoo marked this pull request as ready for review November 24, 2022 12:17
@daavoo daavoo requested a review from dberenbaum November 24, 2022 12:17
@daavoo daavoo self-assigned this Nov 24, 2022
@daavoo daavoo added A: experiments Related to dvc exp feature is a feature labels Nov 24, 2022
@dberenbaum
Copy link
Collaborator

Thanks @daavoo!

I'm still seeing issues with files not being saved by the experiment:

# Add stage to dvc.yaml
$ dvc stage add -q -n exp_save -M metrics.yaml 'echo "foo: 1" > metrics.yaml'
# Add random file
$ echo misc > misc
# Generate metrics
$ dvc repro -q
# Track everything with git
$ git add .
# Save experiment
$ dvc exp save -q
# Stash everything from the workspace
$ git stash -u
# Apply the experiment to recover saved files
$ dvc exp apply exp-48145
# Metrics aren't saved
$ dvc exp show
 ────────────────────────────────────
  Experiment                Created
 ────────────────────────────────────
  workspace                 -
  main                      09:14 AM
  └── d453430 [exp-48145]   09:14 AM
 ────────────────────────────────────
# No files were recovered during apply
$ git status
On branch main
nothing to commit, working tree clean

There are 2 issues:

  1. dvc exp run saves metrics.yaml, dvc.yaml, and dvc.lock regardless of whether they are added to git, but dvc exp save loses all of these. It seems like this is needed for it to be useful the first time the experiment is saved.
  2. Neither dvc exp run or dvc exp save saves arbitrary new files, even if they are added to git. This is related to the idea from the previous PR that it should include all untracked files. It might not be a blocker for now if solving the first issue is enough for dvclive, but it seems like there needs to be some mechanism to save new files to an experiment (either all untracked files or at least those that have been added to git).

repo: "Repo",
name: Optional[str] = None,
force: bool = False,
include_untracked: Optional[List[str]] = None,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. Neither dvc exp run or dvc exp save saves arbitrary new files, even if they are added to git. This is related to the idea from the previous PR that it should include all untracked files. It might not be a blocker for now if solving the first issue is enough for dvclive, but it seems like there needs to be some mechanism to save new files to an experiment (either all untracked files or at least those that have been added to git).

@dberenbaum To cover the DVCLive scenario I included this option (which is not exposed in the CLI but I can expose it).

This allows to pass pattern(s) of (potentially) untracked files to run git add internally.

@daavoo
Copy link
Contributor Author

daavoo commented Nov 24, 2022

1. `dvc exp run` saves `metrics.yaml`, `dvc.yaml`, and `dvc.lock` regardless of whether they are added to git, but `dvc exp save` loses all of these. It seems like this is needed for it to be useful the first time the experiment is saved.

dvc exp run will not save dvc.lock in the experiment ref if you run the same steps, replacing exp save with exp run. After exp apply you will only see dvc.yaml and metrics.yaml.

It will include in the ref the dvc.yaml because it is needed for executing the pipeline and metrics.yaml because it's produced by the execution pipeline. dvc.lock will be present in the workspace as an untracked file as a result of the execution, but it will not be the dvc.lock produced by repro -q, that one is lost.

I don't mean to dismiss the point, but in exp save there is no execution at all, so the differences in behavior kind of make sense to me. If we want to match exp run, I guess we could default to use include_untracked with dvc.lock, dvc.yaml and all outputs in the pipeline?

In the DVCLive scenario, DVCLive will take care of including the untracked files by #8599 (comment)

In the DVC scenario, I can expose the same option to the CLI, and/or we can be very explicit in docs about the behavior of staged and/or untracked files.

I think the UI is already clear. These are the current warnings:

# staged
$ dvc exp save
WARNING: Your workspace contains staged Git changes which will be unstaged before saving this experiment.
WARNING: The following untracked files were present in the workspace before saving but will not be included in the experiment commit:
	metrics.yaml, dvc.lock, dvc.yaml
# untracked
$ dvc exp save
WARNING: The following untracked files were present in the workspace before saving but will not be included in the experiment commit:
	metrics.yaml, dvc.lock, dvc.yaml

@daavoo daavoo force-pushed the 8408 branch 3 times, most recently from e33034e to bcf3f66 Compare November 25, 2022 11:01
Comment on lines +71 to +78
save_parser.add_argument(
"-I",
"--include-untracked",
action="append",
default=[],
help="List of untracked paths to include in the experiment.",
metavar="<path>",
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dberenbaum

$ dvc stage add -q -n exp_save -M metrics.yaml 'echo "foo: 1" > metrics.yaml'
$ echo misc > misc    
$ dvc repro -q
$ dvc exp save -I "misc" -I "metrics.yaml" -I "dvc.lock" -I "dvc.yaml" 
$ git stash -u
$ dvc exp apply exp-48145  
$ git status
On branch master
Untracked files:
  (use "git add <file>..." to include in what will be committed)
	dvc.lock
	dvc.yaml
	metrics.yaml
	misc
$ dvc exp show
dvc exp show
 ────────────────────────────────────────── 
  Experiment                Created    foo  
 ────────────────────────────────────────── 
  workspace                 -            1  
  master                    11:57 AM     -  
  └── 71d7388 [exp-48145]   11:58 AM     1  
 ────────────────────────────────────────── 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In DVCLive I am hardcoding the option to include_untracked=self.dir

Copy link
Collaborator

@dberenbaum dberenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo I'll open a follow-up issue to discuss how to handle untracked files. For now, as long as it works with DVCLive, let's merge it. Having a public CLI command isn't even a requirement right now.

@daavoo daavoo enabled auto-merge (rebase) November 28, 2022 12:59
dtrifiro and others added 3 commits November 28, 2022 14:17
Remove metrics.

Move tests to separate file.

Fix `experiments.get_exact_name` usage.

Remove unused checkpoint logic.
Staged changes were causing a merge error.
Warn and unstage instead of erroring, to match exp run behavior.
@daavoo daavoo force-pushed the 8408 branch 2 times, most recently from 6d61790 to 1144cfa Compare November 28, 2022 13:22
Allow to include a list of potentially untracked files.
Covers the DVCLive use case where in the first run the generated files won't be tracked in Git yet.
)
ref: Optional[str] = dvc.scm.get_ref(EXEC_BRANCH, follow=False)
exp_ref = ExpRefInfo.from_ref(ref) if ref else None
untracked = dvc.scm.untracked_files()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note: this check might slow down exp save considerably when large untracked directories are present, but I guess it's fine for now

queue = repo.experiments.workspace_queue
logger.debug("Saving workspace in %s", os.getcwd())

staged, _, _ = repo.scm.status(untracked_files="no")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're checking for untracked files above, how about:

Suggested change
staged, _, _ = repo.scm.status(untracked_files="no")
staged, unstaged,untracked = repo.scm.status(untracked_files="no")
if untracked:
logger.warning(
"The following untracked files were present in "
"the workspace before saving but "
"will not be included in the experiment commit:\n"
"\t%s",
", ".join(untracked),
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this change and remove the other call, but we need to handle include_untracked

@daavoo daavoo merged commit a036eaf into main Nov 28, 2022
@daavoo daavoo deleted the 8408 branch November 28, 2022 14:49
@dberenbaum
Copy link
Collaborator

@daavoo @dtrifiro Can you open a docs ticket for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature is a feature
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

exp save: save workspace state to an experiment ref
3 participants