-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
exp run: checks out data dependencies (destructive) [QA] #5593
Comments
This is a side effect of experiments being identified by an internal hash of the pipeline. We don't store these hashes for the parent commits (HEAD) so the first run will be treated as a unique experiment.
edit: as noted in #5595, for
This is needed for temp directory runs, since we have to |
duplicate empty experiment issue (1) will be fixed in #5600 |
On 2.
OK cool, but it can still be destructive. Should we at least warn users when there are changes in Cc @dberenbaum
Good point, agreed. An alternative to keep both kinds of runs consistent would be to copy the workspace data to the tmp dirs... But having exps based on |
This seems reasonable, or suggesting users either
I don't think data dependencies need to be based on |
The decision to not directly copy data here was for supporting CI use cases and eventual remote (machine) execution. The expected workflow there will be that you need to We also currently avoid duplicating data for the local tempdir runs (assuming the user is using any non- |
Hey guys just FYI I don't have any more points here. Can we decide what's the ideal solution if any to 2. and 3. above? (3 is new, but connected to 2.)
Basically the only possible problem of keeping this behavior even with the warning, is that |
Thanks @jorgeorpinel. It's been awhile since I thought about this issue. Coming back to it now, it's unclear to me if |
I see what you mean @dberenbaum but
@pmrowla did explain how we got here ☝️ but why is that needed? It probably makes sense for |
I agree with Peter that behavior shouldn't differ between workspace vs temp runs. If we dvc commit whatever is in the workspace, it can be checked out in both workspace and temp runs without changing anything in the workspace. |
Hmmm workspace runs vs queued, and esp. temp runs are already inherently different anyway. Is there a reason to seek that exact consistency? Because it seems we need to pick between that OR consistency with Idk about |
It sounds like
We would only be |
I think from the user's perspective, workspace or temp runs should not be inherently different. This can even be extended to remote execution - if I run |
@jorgeorpinel I'm not seeing the inconsistency with |
Oh OK, I see.
@pmrowla What I mean is that it's evident that queued runs may need to change the data in the workspace first, to reflect what was present when they were queued (which we agree is an intuitive user expectation). What I don't see is a need for regular runs to do the same (I find that unintuitive). And yes, a
@dberenbaum |
p.s. AFAIK committing data is meant as part of the tracking mechanism (at the end of |
The important thing here is that we are committing two different states though.
The "temp data" here is really "modified dependencies", and will end up being committed into cache anyways in the normal repro use case. Also, by definition, for experiments everything has to be either git-committed or saved to cache in order for experiments to actually work. If we allowed the user to force behavior where we do not cache or git-commit certain objects within an experiment, the user would not be able to actually restore/apply that entire experiment state later on. I think we are really just debating semantics here, and it seems like everyone agrees that for queued/temp dir runs, we should be doing:
For workspace runs, the issue is currently that we only do
From an architecture perspective, IMO the actual execution (the steps taken to actually execute the user's pipeline) should be identical in all cases, whether it's in the user's workspace, on a completely different remote machine, or in a temporary directory. The only differences should be in setting up the environment for that execution. Meaning that in this case the only extra step should be creating the temp directory. Basically the what it comes down to is that the way experiments were designed (with remote executors in mind) is this: At the time we queue an experiment, we don't actually care whether we are going to run the experiment in the workspace or in a temp directory (or on some other remote machine). All we do is shove the requested experiment into our queue, to be executed at some point in the future (in any arbitrary machine/environment). Even for workspace runs, we do actually "queue" the experiment. Meaning that this step is intended to be consistent regardless of how (or more specifically where) we end up running that experiment. So in order to keep this consistency, we should just do At actual execution time, our executor does the following:
This behavior is intended to be consistent everywhere, and it's important to remember that the |
True.
I didn't necessarily agree that we should commit data in either case. Originally this was about checking out deps to match some lockfile (needed due to implementation for queued runs) which could be unexpected/destructive on regular exp runs. But thinking about it, we do say that exps are based on the last Git commit, so exp runs are different to My only other idea is to consider running all exps as Up to you guys, this is getting overcomplicated probably. But thanks for all the feedback! |
If your concern is that the user's This is the same kind of change that is made when you use For queued runs, the user will never see any modified For workspace runs, the only thing they will see is the same modified Again, I think we are confusing what will be an internal implementation detail with behavior that would affect the user. Doing this internal |
One thing that should be documented as a difference between This also needs to be reflected in the CLI, currently supplying the |
@jorgeorpinel Do you agree that dvc should keep whatever changes are in the working tree at the time of This is the intended behavior for both workspace and temp runs. You pointed out that it's not the current behavior because dvc-tracked deps are checked out to match the lockfile in the last commit, and I think we all agree that this is unexpected/destructive. The suggested implementation by @pmrowla is a way to achieve keeping all changes in the working tree and avoiding the unexpected behavior you noted. |
Makes sense. I keep getting confused about that 🙂 Also means (internally) that the exp commit will have
TBH I'm still not seeing how |
I don't really see a reason why it should be hidden? Being able to do a run in the background while still being able to continue modifying whatever I want in my workspace while that job is running seems useful to me. |
I see. For very long experiments that makes sense indeed! OK I should document the use then... ⌛
UPDATE: Mentioned about |
explain --temp use case per iterative/dvc#5593 (comment)
explain --temp use case per iterative/dvc#5593 (comment)
This comment has been minimized.
This comment has been minimized.
The issue with creating a new It looks like this is not currently documented, see: #5029
For workspace runs since the files are present in the workspace itself, this issue doesn't show up, but for temp directory runs, we only populate the directory with files which are tracked in either git or DVC, untracked files are ignored. The reason for this is that the alternative would be for us to explicitly include all of the untracked files in our experiment git commits. Doing so will likely result in us git-committing objects which do not belong in git (like large binary files, or things like venv dirs or |
This comment has been minimized.
This comment has been minimized.
OK let me look into that ⌛ |
Agree that this seems like a slightly different issue. For 2 (data dependencies), it seems that the focus is on dependencies already tracked by dvc, where their hashes and other metadata are already tracked by Git in For 3 (code/yaml changes), it seems that the focus is on untracked files. Files not tracked by Git will only be reflected in workspace experiments and not in temp experiments. Workspace experiments will use whatever is in the workspace, but temp experiments copy files to the temp dir by checking them out with Git. In this case, I think it's unclear what the expected behavior is, and @pmrowla noted some of the challenges:
Just a few days ago, someone on Discord was running into this problem when trying an experiment that relied on credentials in ignored files. |
OK, actually I'm going to move this new conversation to a separate issue: #5801 |
What's the status on the remaining part of this issue (checking out data dependencies)? @jorgeorpinel do you still have questions, and are you okay with the |
No more Qs and yes! I agree with the proposed solution 👍 |
1. It produces a first unchanged expUPDATE: Addressed in #5600
This works even when there are no changes to the committed project version (
HEAD
). Below we can see there are differences in metrics or params:Is there a use case for this? Otherwise I'd vote to block it.
2. It checks out data dependencies (destructive)
Even when I changed a dependency, which could be the basis for my experiment, all the pipeline data was checked out again (undoing my manual changes), so this
exp
is the same as the previous one (1).BTW if this was the first time I
exp run
, it would be easy to miss that the data I changed was restored silently in the process. I'd just see that the exp results are the same as inHEAD
which would be misleading.2.2 Does it really behave exactly like
repro
?https://dvc.org/doc/command-reference/exp/run reads:
We also say:
But when I change an
add
ed data file and tryexp run
, it undoes those changes and reports that no stage has changed.repro
re-adds it instead, and then runs any stage downstream.3. Not all changes to "code" can be queuedIn the docs we say "Before running an experiment, you'll probably want to make modifications such as data and code updates, or hyperparameter tuning." Is this the intended behavior?
Because, if we see dvc.yaml as code — and I think we do as we one of DVC's principles is pipeline codification — then this statement isn't completely true when it comes to queueing experiments. It works with regular experiments though, which use the workspace files (not a tmp copy), which makes me think this may be unintended (as we want all kinds of experiments to be consistent in behavior AFAIK).
Specifically, if you create (or modify) dvc.yaml between queued runs and then try to
--run-all
, you get errors. Example:The text was updated successfully, but these errors were encountered: