Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ML experiments and hyperparameters tuning #2799

Closed
dmpetrov opened this issue Nov 16, 2019 · 68 comments
Closed

ML experiments and hyperparameters tuning #2799

dmpetrov opened this issue Nov 16, 2019 · 68 comments
Assignees
Labels
feature request Requesting a new feature

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Nov 16, 2019

UPDATE: Skip to #2799 (comment) for a summary and updated requirements, and #2799 (comment) for the beginning of the implementation discussion.

Problem

There are a lot of discussions on how to manage ML experiments with DVC. Today's DVC design allows ML experiments through Git-based primitives such as commits and branches. This works nicely for large ML experiments when code writing and testing required. However, this model is too heavy for the hyperparameters tuning stage when the user makes dozens of small, one-line changes in config or code. Users don't want to have dozens of Git-commits or branches.

Requirements

A lightweight abstraction needs to be created in DVC to support hyperparameters-like tiny experiments without Git-commits. Hyperparameters tunning stage can be considered as a separate user activity outside of Git workflow. But the result of this activity still needs to be managed by Git preferably by a single commit.

High-level requirements to the hyperparameters tunning stage:

  1. Run. Run dozens of experiments without committing any results into Git while keeping track of all the experiments. Each of the experiments includes a small config change or code change (usually, 1-2 lines).
  2. Compare. A user should be able to compare two experiments: see diffs for code (and probably metrics)
  3. Visualize. A user should be able to see all the experiments results: metrics that were generated. It might be some table with metrics or a graph. CSV table needs to be supported for custom visualization.
  4. Propagate. Choose "the best" experiment (not necessarily the highest metrics) and propagate it to the workspace (bring all the config and code changes. Important: without retraining). Then it can be committed to Git. This is the final result of the current hyperparameter tunning stage. After that, the user can continue to work with a project in a regular Git workflow.
  5. Store. Some (or all) of the experiments might be still useful (in additional to "the best" one). A user should be able to commit them to the Git as well. Preferably in a single commit to keep the Git history clean.
  6. Clean. Not useful experiments should be removed with all the code and data artifacts that were created. A special subcommand of dvc gc might be needed.
  7. [*] Parallel. In some cases, the experiments can be run in parallel which aligns with DVC parallel execution plans: Running DVC in production #2212, repro: add scheduler for parallelising execution jobs #755. This might not be implemented now (in the 1st version of this feature) but it is important to support parallel execution by this new lightweight abstraction.
  8. Group. Iterations of hyperparameters tuning might be not related to each other and need to be managed and visualized separately. Experiments need to be grouped somehow.

What should NOT be covered by this feature?

This feature is NOT about the hyperparameter grid-search. In most cases, hyperparameters tuning is done by users manually using "smart" assumptions and hypotheses about hyperparameter space. Grid-search can be implemented on top of this feature/command using bash for example.

  1. The ability to run the experiments from bash might be also a requirement for this feature request.

Possible implementations

This is an open question but many data scientists create directories for each of the experiments. In some cases, people create directories for a group of experiments and then experiments inside. We can use some of these ideas/practices to better align with users' experience and intuition.

Actions

This is a high-level feature request (epic). The requirements and an initial design need to be discussed and more feature requests need to be created. @iterative/engineering please share your feedback. Is something missing here?

EDITED:

Related issues

#2379
#2532
#1018 can be relevant (?)
Discussion

@dmpetrov dmpetrov added the feature request Requesting a new feature label Nov 16, 2019
@casperdcl
Copy link
Contributor

casperdcl commented Nov 16, 2019

I think I almost-but-not-quite understand the aim here. I feel like I'm missing some key concept.

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Each of the experiments includes a small config change or code change (usually, 1-2 lines).

This could be satisfied by for example by a bash script looping through param choices with @nteract/papermill for notebook users. I think it would be quite hard to try to write a tool to do this in a language/platform agnostic way. It's hard enough with papermill which is pretty niche.

To be all-encompassing we'd have to wind up supporting multiple ways of passing in params: env vars, cli args, sed -r 's/<search>/<repl>/g', and (nightmare) language-specific ways.

see diffs for code (and probably metrics)

Again a papermill-like approach (bash script spawning multiple notebooks and kernels, each with different params, each outputting a dvc metrics-like file) could do this

some table with metrics or a graph. CSV table needs to be supported for custom visualization.

Would need to create a formal metrics specification, or at least be very intelligent about automatically interpreting and visualising whatever the end-users throw at us.

Choose "the best" experiment (not necessarily the highest metrics) and propagate it to the workspace

Not sure how "best" can be automated with "not necessarily the highest metrics"

  1. Store./6. Clean./7. [*] Parallel.

All could be handled by the bash script.

Experiments need to be grouped somehow.

Probably part of any potential formal metrics spec.

This feature is NOT about the hyperparameter grid-search

and

create directories for each of the experiments [...] directories for a group of experiments

Really seems like end-users writing bash/batch scripts would solve this.


Overall I feel like this has two requirements:

  1. implement (or create) a formal metrics spec (which we can then use for visualisations etc)
  2. document/add a tutorial for writing scripts to manage multiple experiments

I'd be against designing (1) from scratch owing to:

Also vaguely related maybe worth considering org-wide project boards (https://github.com/orgs/iterative/projects) for managing epics as well as cross-repo issues (e.g. iterative/dvc.org#765 and iterative/example-versioning#5)

@dmpetrov
Copy link
Member Author

@casperdcl good questions but let's start with the major one:

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Let's imagine you are jumping to a hyperparameter tunning stage. You need to run a few experiments. You don't know in advance how many experiments are needed. Usually, it takes 10-20 but it might easily take 50-100.

Questions:

  1. What abstraction would you choose? Commits to master? A new branch and commits in the branch? Is it okay for you to have 50 commits in a row?
  2. You end up having 50 commits. How to get all the results and compare them to find the best result?
  3. If Git abstractions work and new standards are not needed why a big portion of data scientists (including ex-developers) do not use this and prefer to create 50 dirs instead of 50 commits?

@casperdcl
Copy link
Contributor

casperdcl commented Nov 17, 2019

Run dozens of experiments without committing any results into Git while keeping track of all the experiments.

This seems to be almost a contradiction - the most robust way to "keep track" is to commit separately.

Ah I think we were both not using accurate language :) You do indeed want to commit results in some form (metrics for each experiment/summary of metrics/metadata to allow easy reproduction of experiments - which could just be the looping script). You don't necessarily want to commit runs (saved models, generated tweaked source code).

And when I said commit separately I should've just said commit. (Separately implies multiple commits, which isn't necessary unless you want to save each model and its outputs... which may actually still be useful. 1. Run multiple experiments 2. Save each in a separate branch commit 3. Collate metrics and use them to delete most branches. No clear advantage of this over multiple dirs. Maybe if you want to save the 2 best models on two different branches which will then fork?)

I think the rest of my comment dealt with the multi-dir, single-commit approach anyway (which as I understand is what you also intended).

@dmpetrov
Copy link
Member Author

Yeah :) Sorry, I put the description in a very abstract form to not to push to any solutions. This abstract form gives a lot of opportunities for different interpretations which is probably the root cause of the misunderstanding. To be clear, I don't see any other solutions besides dirs yet but it would be great if we can consider other options.

I definitely want to give an ability to commit the results (both metrics as well as runs) but not necessarily all the results (ups to a user).

I think the rest of my comment dealt with the multi-dir, single-commit approach anyway (which as I understand is what you also intended).

👍

@Suor
Copy link
Contributor

Suor commented Nov 18, 2019

Preferably in a single commit to keep the Git history clean.

Doesn't sound like clean to me. It would be a very messy commit and if that experiments involve code changes it would be way easier to have a commit for each, this way you can git checkout it.

Additionally, if we have a git commit for each experiment we want to save then it would be very easy to save associated artifacts too.

Not useful experiments should be removed with all the code and data artifacts that were created

We might simply dvc run --no-commit, and no need to gc anything in the end.

Parallel. ... which aligns with DVC parallel execution plans:

Not necessarily, if we make a dir copy for each experiment, than that would be a different dvc repo, and we won't need any parallel processing for single repo.

@pared
Copy link
Contributor

pared commented Nov 18, 2019

What worries me the most is the weight of the project. If we decide to go with dir approach, we need to either make a repo copy for each experiment or somehow link/use dvc controlled artifacts from original repo. I think that copy is fine for first version, but later we need to come up with something that would not duplicate our artifacts. That would probably align with parallel execution plans too.

@casperdcl
Copy link
Contributor

casperdcl commented Nov 18, 2019

About the whole dir copy thing... The bash loop + papermill workflow I gave as an example (granted only works for python notebooks) would create one dir per test, and said dir would only contain a notebook with one different parameter cell, as well as potentially some outputs. All notebooks would use (i.e. import) the same code and data from the root directory. And all you'd need to commit is the bash loop script & the metrics files from the output directories in order to reproduce/track what happened. May need random.seed(1337) or similar to reproduce identically but you get the idea.

My main concern is this all seems very language-, code layout-, and OS-specific and best left to the user to figure out. I think it would be helpful if we gave a concrete example of how dvc could assist in a workflow (e.g. this dummy C++ program training on this MNIST data on linux with a bash script subbing in (or passing in via CLI params) these 10 different params for 10 output dirs, running 2 jobs at a time, outputting metrics.csv, etc...)

I feel like trying to create an app to automate this process in generic scenarios is a bit like trying to create an app to help people use a computer. Sounds more like a manual/course than a product.

@dmpetrov
Copy link
Member Author

dmpetrov commented Nov 19, 2019

Preferably in a single commit to keep the Git history clean.

Doesn't sound like clean to me. It would be a very messy commit and if that experiments involve code changes it would be way easier to have a commit for each, this way you can git checkout it.

Additionally, if we have a git commit for each experiment we want to save then it would be very easy to save associated artifacts too.

@Suor you are right, but ideally, it should be a user's choice - some folks are very against 50 commits and it would be great to provide some options to avoid this (if we can :) ).

In the dir-per-experiment paradigm, all the experiments might be easily saved in a single commit with all the artifacts (changed files and outputs) since they are separated. What do you think about this approach?

ADDED:

We might simply dvc run --no-commit, and no need to gc anything in the end.

Yeap. An additional, experiment specific option might be helpful like dvc repro --exp tune_lr

Parallel. ... which aligns with DVC parallel execution plans:

Not necessarily, if we make a dir copy for each experiment, than that would be a different dvc repo, and we won't need any parallel processing for single repo.

First, it looks like we have a bit different opinions regarding implementation. I assume that we copy all the artifacts in an experiment dir which gives us an ability to commit experiments (one by one or in a bulk). While you assume that we clone a repo to a dir. We can discuss the pros and cons of these methods. I won't be surprised if we find more options.

Thus, it depends on implementation. If it is a separate repo as a dir then we cannot commit it in the main repo. In this case, you are right in the above - separate commits above will be required.

If we run in a separate dir with no cloning (just copying and instantiating data artifacts) then we parallel running support might require.

@dmpetrov
Copy link
Member Author

we need to either make a repo copy for each experiment or somehow link/use dvc controlled artifacts

@pared you are right. I don't think we can afford to make a copy of data artifacts. So, there is only one option - the most complicated one, unfortunately.

@dmpetrov
Copy link
Member Author

My main concern is this all seems very language-, code layout-, and OS-specific and best left to the user to figure out.

Exactly. Notebook is kind of a specific language. I'd suggest building a language-agnostic version first based on config files or code file changes - copy all code in a dir, instantiate all the data files and run an experiment. Later we can introduce something more language/Notebook specific.

I think it would be helpful if we gave a concrete example of how dvc could assist in a workflow

Totally! We definitely need an example. This issue was created to initiate the discussion and collect the initial set of requirements. But the development process of MVP should be example-driven.

I feel like trying to create an app to automate this process in generic scenarios is a bit like trying to create an app to help people use a computer. Sounds more like a manual/course than a product.

I see this as an attempt to help users use one of the "best practices" - save all the experiments (in dirs :) ) and compare the results.

@jorgeorpinel
Copy link
Contributor

What about trying to automatically generate a Git submodule for experiments? 1. Somehow mark code or data files as "under experimentation". 2. Watch those files and make a commit every time they're written (similar to IPython Notebook checkpoints) 3. Tell DVC to stop watching this experiment.

And do we have any ideas on what the interface would look like? Another command, a separate tool, a UI?

a big portion of data scientists (including ex-developers)... prefer to create 50 dirs instead of 50 commits

If this is the case. Perhaps a file linking system or UI that shows the user a growing set of virtual dirs simultaneously, one per experiment. Either based in the single-commit, multiple dir strategy, the git submodule, or something else.

@pared
Copy link
Contributor

pared commented Nov 19, 2019

@dmpetrov Do you think that we could restrict (at least in the beginning) experiments feature to systems where linking is possible? That would eliminate the risk of experiment throttling disk space. Also in such case implementation does not seem too hard. We would just need to create a repo with default *-link cache type and point cache to the master project cache.

@dmpetrov
Copy link
Member Author

What about trying to automatically generate a Git submodule for experiments?

@jorgeorpinel Might be I didn't get the idea but Git-submodule means a separate repo. So, we end up having 50 Git repositories instead of 50 commits. It looks like an even more havier approach that we currently have.

And do we have any ideas on what the interface would look like? Another command, a separate tool, a UI?

Initially, the command line one. I see that as part or repro. Line vi config.yaml && dvc repro --exp tune_lr - will create a dir with changed files and new outputs.

@jorgeorpinel
Copy link
Contributor

No, just one submodule with a single copy of the source code, and 50 commits in it. Although now that I think about it, it's similar to just making a branch, and the latter is probably easier...

@dmpetrov
Copy link
Member Author

@pared No restriction is needed. We should use link-type that was specified by the user. My point is - we cannot create a copy if the user prefers reflinks. Also, I don't think we need to create any repo. Experiments should work in an existing repo.

@pared
Copy link
Contributor

pared commented Nov 20, 2019

@dmpetrov I agree that experiments should work in existing repo. What I have in mind by "creating the new repo" was that I imagined, that we will store each experiment as a current repo "copy" in some special directory, like .dvc/exp/tune_lr_v1 and so on. Are we on the same page here? Or do you imagine it differently?

@dmpetrov
Copy link
Member Author

@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir tune_lr_v1/.

@alexvoronov
Copy link

By the way, I have not seen anyone mentioning MLflow. I haven't tried it myself yet, but the description promises to manage the ML lifecycle, including experimentation and reproducibility. How did they solved this issue? Any chance to just integrate/build on top of that or some other similar tool? Or an API for integrating third-party ML lifecycle tools?

@Suor
Copy link
Contributor

Suor commented Nov 21, 2019

In the dir-per-experiment paradigm, all the experiments might be easily saved in a single commit with all the artifacts (changed files and outputs) since they are separated. What do you think about this approach?

I thought of those dirs as copies of a git/dvc repo. So if you commit it's state, probably to a branch, to a separate commit you might access all the artifacts easily. It will work with gc seamlessly and so on. A copy dvc repo also retains all the functionality, you may cd into it and explore it. You can diff it to original with any dir diff tool, like meld. This are supposed to share cache and use some light-weight links.

Do you suggest a copy of everything in a subdir, but still being the same git/dvc repo? And then committing the whole structure. Not sure how this will work, but I haven't thought about that much.

And, yes if that is the same repo you most probably need parallelized runs.

The thing is with subdirs in a single repo, we can't refer to different versions of an artifact by changing rev, we will also need to change path. And those paths won't be consistent between revs. This might be an issue or not.

Also, how do you mainline some experiment then? Do we need som specific dvc command for that?

If we run in a separate dir with no cloning (just copying and instantiating data artifacts) then we parallel running support might require.

Checking out artifacts is an issue both implementations have. We can simply checkout artifacts for a new copy if we use fast links. But if we use copy we might want to make some lightweight links to already checked out copies in the original dir. This could be ignored or at least wait for a while though.

We have @slow_link_guard to at least keep people informed about that.

What about trying to automatically generate a Git submodule for experiments?

I don't see any advantage of a git submodule over a simple clone. Why should we complicate this?

Initially, the command line one. I see that as part or repro. Line vi config.yaml && dvc repro --exp tune_lr - will create a dir with changed files and new outputs.

I see that a basic building block is creating a dir copy (a clone or just a copy) and checking out artifacts there. Maybe cd there. Then a user may do whatever he/she wants inside:

dvc exp tune_some_thing
dvc repro some_stage.dvc

cd ../..
cd exp/tune_some_thing

# later
vim ...
dvc repro some_stage.dvc

Or maybe it's ok to bundle it from the start, like @dmitry envisions. Not sure --exp under repro or a separate command is better:

dvc experiment <experiment-name> some_stage.dvc
# or
dvc exp <experiment-name> some_stage.dvc
# or even
dvc try <experiment-name> some_stage.dvc  

We will need commands to manage all these, probably. If these are just dirs then we can commit everything as is, which is a plus. But we will still need something to diff, compare metrics, mainline an experiment.

Since these are just dirs (and clones are mostly dirs too) we get some of these for free, which I like a lot:

meld . exp/tune_some_thing  # compare dirs
rm -rf exp/tune_some_thing  # discard experiment
cp -r exp/tune_some_thing . # mainline, not sure this one is correct

@casperdcl
Copy link
Contributor

Regarding the MLFlow, idea - it looks like an augmented conda env.yml file which supports tracking input CLI params and you need to use their python API for logging outputs/results/metrics.

They do have a nice web UI for visualising said logs, though.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Nov 21, 2019

I don't see any advantage of a git submodule over a simple clone.

The thing is that if you clone a Git repo inside a Git repo and add it, Git just ignores the inner repo's contents. I think it stages a dummy empty file with the name of the embedded repos' dir. So we may be forced to use submodules, depending on the specific needs. Here's Git's output when you clone a repo inside a repo and stage it:

warning: adding embedded git repository: {INNER_REPO}
hint: You've added another git repository inside your current repository.
hint: Clones of the outer repository will not contain the contents of
hint: the embedded repository and will not know how to obtain it.
hint: If you meant to add a submodule, use:
hint: 
hint: 	git submodule add <url> {INNER_REPO}
hint: 
hint: If you added this path by mistake, you can remove it from the
hint: index with:
hint: 
hint: 	git rm --cached {INNER_REPO}
hint: 
hint: See "git help submodule" for more information.

@dmpetrov
Copy link
Member Author

I thought of those dirs as copies of a git/dvc repo.

@Suor it looks like we are on the same page with that.

Do you suggest a copy of everything in a subdir, but still being the same git/dvc repo? And then committing the whole structure. Not sure how this will work, but I haven't thought about that much.

And, yes if that is the same repo you most probably need parallelized runs.

Right. Yes, I think we should consider this subdir-in-the-same-repo option. This allows a user to commit many subdirs in a single commit or just remove subdirs using a regular rm -rf tune_lr_v1/.

The thing is with subdirs in a single repo, we can't refer to different versions of an artifact by changing rev, we will also need to change path. And those paths won't be consistent between revs. This might be an issue or not.

If you copy a whole structure and change paths in dvc-files it should not be an issue except the cases when a whole path was used like /Users/dmitry/src/myproj/file.txt. I don't think we should care about this case.

Also, how do you mainline some experiment then? Do we need som specific dvc command for that?

🤷‍♂️ The option I like the most so far: dvc repro --exp tune_lr_v1 . A separate command is fine: dvc exp tune_lr_v1

We can simply checkout artifacts for a new copy if we use fast links. But if we use copy we might want to make some lightweight links to already checked out copies in the original dir.

I don't think we need to invent something new here. We should use the same data file liking strategy as specified in a repo. From the file management point of view, the experiment subdirs play the same role as branches and commits and should use the same strategy.

Or maybe it's ok to bundle it from the start, like @dmitry envisions. Not sure --exp under repro or a separate command is better:

Yeah. I'd prefer to create and execute an experiment as a single, simple command. No matter if it is repro or a dedicated one.

Since these are just dirs (and clones are mostly dirs too) we get some of these for free, which I like a lot:

meld . exp/tune_some_thing  # compare dirs
rm -rf exp/tune_some_thing  # discard experiment
cp -r exp/tune_some_thing . # mainline, not sure this one is correct

Exactly! We will get a lot of stuff for free. More than that - it should align well with data scientists' intuition of creating dirs for experiments.

The last command (cp -r ext/...) won't work, unfortunately - we might need a new command dvc exp propagate exp/tune_some_thing (to the current dir by default)

@dmpetrov
Copy link
Member Author

Any chance to just integrate/build on top of that or some other similar tool? Or an API for integrating third-party ML lifecycle tools?

@alexvoronov the integration itself is a good idea. Unfortunately, DVC experiments cannot be built on top of MlFlow because MlFlow has a different purpose and focuses on metrics visualization. But the visualization part can be nicely implemented on top of existing solutions. There are a few more MlFlow analogs: Weights & Biases, comet ml and others. It would be great to create a unified integration with these tools.

@casperdcl brought a good point about conda env.yml. It might be another integration.

We should definitely keep the UI and visualization in mind but I would not start with that.

@pared
Copy link
Contributor

pared commented Nov 28, 2019

Yeah. I'd prefer to create and execute an experiment as a single, simple command. No matter if it is repro or a dedicated one.

@dmpetrov how would this work? What I have in mind is:

  • doing some changes in my repo, not necessarily committing them
  • running dvc repro --exp tune_lr train_model.dvc
  • dvc takes care of creating an experiment directory and moving all the stuff there, also running

What I don't like about incorporating experiments into repro is that it assumes that we want to run the experiment. Will it always be the case?

What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just dvc experiment run tune_lr tune_lr_v1 tune_lr_v2, go home and get back to finished tasks?
I think the experiment should be a separate command that has three main steps:

  • create a directory with an experiment
  • run the experiment(s)
  • choose experiment(s) which you would like to preserve

First two could be joined into one with some flag like dvc experiment --run tune_lr


I want to get back to creating the experiment directory:

@pared yes, it is very likely we will need to store a copy of the current repo. It might be directly in project root dir tune_lr_v1/.

I think it should not be in the project root dir.

  • In the case of several dozens of experiments, root dir will look terribly
  • removing unwanted experiments will be hell if someone uses "creative" naming

If experiments will be in dedicated directory (.experiments, .dvc/exp or whatever):

  • easy to .git and .dvc ignore
  • finished with experimenting? no problem just rm -rf .experiments

@dashohoxha
Copy link
Contributor

@pared first of all let me make the disclaimer that I have not followed this discussion very carefully and I am not sure that I understand all the ideas presented here. So, it is quite possible that I don't know what I am talking about.

What if I want to prepare a few experiment "drafts" by editing my repo, and then, at the end of the day, just dvc experiment run tune_lr tune_lr_v1 tune_lr_v2, go home and get back to finished tasks?
I think the experiment should be a separate command that has three main steps:

  • create a directory with an experiment
  • run the experiment(s)
  • choose experiment(s) which you would like to preserve

Using a command like dvc experiment ... seems interesting to me.
@pared is it possible to show with a simple bash script or with a simple example what the command dvc experiment create ... is supposed to do? Or is it possible to explain how we could do this manually without using dvc experiment?

If experiments will be in dedicated directory (.experiments, .dvc/exp or whatever)

If these experiments are going to be managed transparently (meaning that the users only use dvc experiment ... to manage them, don't touch them manually), then it seems a good idea to use something like .dvc/experiments/.

@pared
Copy link
Contributor

pared commented Nov 29, 2019

@dashohoxha
I will try to explain first, if it will not be enough, I can try to prepare some draft:

  1. The user makes some changes inside repo to adjust repo state to his experiment (eg change fully connected model code to CNN in some image recognition project, change number of layers, change learning rate adjustment algorithm), they will probably not be committed to the current branch, but it's up to further discussion.

  2. The user runs dvc experiment create {ename}, dvc copies current repo state to .dvc/experiment/{ename}, and links artifacts properly

  3. The user can run the experiment with another command.

  4. There is a set of commands allowing to manage experiments (choose "the winner" and move it to the current repo, choose few "winners" and [for example] make branch from each one).

So in few words experiment create would be advanced cp . .dvc/experiment/{ename}.
What do you think about that?

@dashohoxha
Copy link
Contributor

So in few words experiment create would be advanced cp . .dvc/experiment/{ename}.

So, basically you want to clone all the data and DVC-files to an experiment directory, which can use the same cache (.dvc/cache/) as the main project. With a deduplicating/reflink filesystem this should work.

It is not clear whether you modify the pipeline (or the parameters) of an experiment before you create it or after you create it, and how you are going to do it.

[By the way, rsync might be a better option than cp in this case, but this is not relevant to the discussion.]

@pmrowla
Copy link
Contributor

pmrowla commented Oct 20, 2020

Closed since experiments functionality has been implemented internally (but not released/finalized). At this point, it will be more useful for specific experiments related issues to be handled in their own tickets.

UPDATE: See https://github.com/iterative/dvc/wiki/Experiments

@efiop efiop unpinned this issue Oct 21, 2020
@jorgeorpinel
Copy link
Contributor

Hi! Question on this @pmrowla, did dvc exp end up interacting with the run-cache at all? Or is there any way to turn a cached run into a checkpoint/experiment? Thanks

@pmrowla
Copy link
Contributor

pmrowla commented Feb 4, 2021

Hi! Question on this @pmrowla, did dvc exp end up interacting with the run-cache at all? Or is there any way to turn a cached run into a checkpoint/experiment? Thanks

exp run uses run cache in the same way as repro. So if stages that would be run for any checkpoint/experiment are already cached, we will use the cached version unless exp run -f is used.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 4, 2021

Sounds good. I assume then there's no way to explore/pick run-caches and "promote" them into registered experiments as of now. I.e. they're basically separate features.

@pmrowla
Copy link
Contributor

pmrowla commented Feb 5, 2021

Yeah, they are really two separate features.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature
Projects
None yet
Development

No branches or pull requests