exp run: Support composing and dumping Hydra config. #8093

daavoo · 2022-08-04T11:36:28Z

The feature will be used depending on whether config.hydra.enabled is True or False.

Uses https://hydra.cc/docs/advanced/compose_api/ to build the config and dump it to params.yaml. The content of the output file will be overwritten.

config.hydra.config_dir and config.hydra.config_name can be used to customize the values passed to the APIs used from hydra (hydra.initialize_config_dir and hydra.compose).

Can be combined with --set-param overrides.

Closes #8082

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

daavoo · 2022-08-04T11:37:32Z

@dberenbaum please take a look

dberenbaum · 2022-08-06T19:51:06Z

Nice work @daavoo! I really like how it achieves two important product goals:

Makes DVC easier and more flexible for complex configuration.
Makes it easier to track final parameters for Hydra users.

I came across two big questions when trying it:

Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop 🤔.
Is this the simplest/best way to handle custom files (not params.yaml)? The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users. Maybe instead of hydra.output_file, we need an option to configure the default params file (see configurable default params.yaml (or templating entire pipelines) #7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

More minor questions:

Should it work for repro 🤔 ? If not, we should be able to explain.
How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?

Interesting side effect: dvc exp run -S parameters persist to the next experiment by default but Hydra will reset to default each experiment. I prefer the Hydra behavior and wonder how we can simplify the experience enough to make it the default (separate from this PR). 🤔

daavoo · 2022-08-08T10:56:28Z

Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop 🤔.

I thought about merging but found that it would be confusing when combined with -set-param to determine which arguments are meant to be passed to hydra and which are meant to override the current status of the output_file.

Perhaps it would be better to clearly establish that hydra output_file will be overwritten on each exp run invocation.
Users can move the hardcoded values of output_file to hydra.config_name and the result would be the same as merging.

daavoo · 2022-08-08T11:59:42Z

Should it work for repro 🤔 ? If not, we should be able to explain.

It should not, because of the same reason --set-param only works for exp run. We are creating/editing a file and it's required to track the changes with git before running the pipeline

daavoo · 2022-08-08T12:12:25Z

How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?

Did you find any particular issue?

I copy-pasted the structure of the link to multiconf folder.

[hydra]
    output_file = params.yaml
    config_dir = multiconf

# dvc.yaml
stages:
  train:
    cmd: python train.py
    deps:
    - train.py
    params:
    - params.yaml:

# train.py
import dvc.api

def my_app() -> None:
    print(dvc.api.params_show())

if __name__ == "__main__":
    my_app()

$ dvc exp run
Running stage 'train':                                                
> python train.py
{'server': {'site': {'fb': {'domain': 'facebook.com'}, 'google': {'domain': 'google.com'}}, 'host': 'localhost', 'port': 443}}

$ dvc exp run -S 'server/site=[google,amazon]'
Running stage 'train':                                                
> python train.py
{'server': {'site': {'google': {'domain': 'google.com'}, 'amazon': {'domain': 'amazon.com'}}, 'host': 'localhost', 'port': 443}}

daavoo · 2022-08-08T13:00:30Z

Is this the simplest/best way to handle custom files (not params.yaml)?

My main motivation for going with hydra.output_file was to make it explicit, as I consider the composing and dumping a somehow aggressive operation. In addition, the compose and dump should also happen without --set-param, so users should have a way of toggling the behavior.

The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users.

Is there something specific that resulted obscure to you? I am not sure how much it comes from --set-param defaulting to params.yaml when no args and how much by adding the hydra.output_file.

What do you think about having a special key for --set-param named hydra (dvc exp run --set-param hydra:foo=1), which would explicitly indicate that those are args to be passed for the hydra compose and dump feature??
This could also alleviate #8093 (comment) .

Maybe instead of hydra.output_file, we need an option to configure the default params file (see #7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

Not sure I like the idea of entangling the hydra compose and dump with the DVC defaults parameters file.

dberenbaum · 2022-08-08T17:01:27Z

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

daavoo · 2022-08-08T17:23:42Z

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

daavoo · 2022-08-09T08:12:16Z

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

Another issue with configuring at the .dvc level is that it can be impractical in scenarios with multiple dvc.yaml (i.e. monorepo) unless dvc init --subdir is used.

daavoo · 2022-08-09T09:00:58Z

Some ideas for enabling this at the stage level:

Consider hydra a reserved key inside params.

If hydra is present in params of a stage, trigger the compose and dump functionality.

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra

Q) How to configure?

Something like the following might look appealing but completely change the semantics / internal logic of params:

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra:
            output_file: params.yaml
            config_dir: conf

Introduce callbacks?

There were old discussions about this. But basically, the hydra compose and dump could be implemented/understood as a stage callback:

stages:
  train: 
    cmd: python train.py ${train_config}
    callbacks:
      - hydra:
          output_file: params.yaml
          config_dir: conf
    params:
        - train_config

dberenbaum · 2022-08-09T20:59:20Z

Pros and cons of the dvc.yaml approach:

Pros

More granular control over where and how to use Hydra. For example, I could use the same hydra config_dir in two stages with different output_files for each and specify different --set-param values for each.
More transparent to other users that Hydra is being used.

Cons

If reusing the same config across stages or dvc.yaml files, it becomes repetitive (is this what you meant by "impractical in scenarios with multiple dvc.yaml? @daavoo?). For example, if I want to use the same hydra config across many stages, I need to add the hydra-specific section to each one.
Can't be used outside of stages (such as in foreach or global vars). This restricts its usage and makes it hard to configure for complex scenarios, which is where Hydra is needed most.
Harder to run without Hydra (more tightly coupled).

IMO the most likely and useful way to integrate Hydra and DVC is to have a massive Hydra config dir that gets reused throughout the project. For existing Hydra users, they are more likely to have a single Hydra config dir, and it's less work to integrate Hydra if it doesn't have to be configured at each stage in the pipeline. For that scenario, configuring at the project level in .dvc/config seems better, but we can also try to get feedback from users here.

dberenbaum · 2022-08-12T17:21:48Z

For future reference, I set up https://github.com/dberenbaum/complex_config_example when I was testing this out 😅 (edit: main branch uses current dvc functionality, hydra branch uses this PR)

daavoo · 2022-08-17T14:22:22Z

Latest update:

config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.
By default, composed config will be dumped to params.yaml.
config.hydra.output_file is now optional and can be used for changing the default behavior.
Contents of output_file will be overwritten.

dberenbaum · 2022-08-19T15:40:17Z

By default, composed config will be dumped to params.yaml.
config.hydra.output_file is now optional and can be used for changing the default behavior.
Contents of output_file will be overwritten.

I'm worried this will confuse more people than it helps. Once set, it seems hard to explain how it works and clunky to have to use -S custom_params.yaml:... for all Hydra overrides. If you feel it's needed, maybe you have a better idea of how to clearly explain the behavior?

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?

daavoo · 2022-08-19T15:45:28Z

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?

I don't have a specific use case in mind. I guess I was trying to cover most scenarios.
I can remove the option from the P.R. and wait for people to ask (if ever happens)

d-miketa · 2022-08-20T09:49:39Z

Hi and thank you for working on this! As a user of both Hydra and DVC I thought it might be helpful to contribute a realistic use case. It may already be handled well by the current implementation, I don’t know.

Hydra’s great for modular projects where interchangeable modules may have heterogeneous configuration options. For example, I may want to use either a random forest or a multi-layer perceptron as my model architecture. Its two config files may look like:

# config_dir/model/rf.yaml

_target_: my_package.models.RandomForest
tree_depth: 3

# config_dir/model/mlp.yaml

_target_: my_package.models.MLP
n_hidden: 128

I’d like to now run a sequence of experiments.

RF with tree_depth=4.
MLP with n_hidden=192
RF with tree_depth=4, again (I forgot about running it earlier, say)

The final experiment should be immediately recovered from cache but experiment 2 may have overwritten the value of model.n_hidden inside whatever file’s tracking DVC params. So while 1 was run with model.tree_depth=4 model.n_hidden=128, 3 will be instantiated with model.tree_depth=4 model.n_hidden=192. But the two settings describe fundamentally the same experiment, since n_hidden is not used by RF at all.

Is this case correctly handled by your parameter storage?

daavoo · 2022-08-22T09:58:24Z

Is this case correctly handled by your parameter storage?

@d-miketa I think so, with proper setup. I am making some assumptions here based on what you shared (like the use of hydra.utils.instantiate).

Here is the setup:

`conf` directory

$ tree conf
conf
├── config.yaml
└── model
    ├── mlp.yaml
    └── rf.yaml
$ cat conf/config.yaml 
defaults:
  - model: rf
$ cat conf/model/mlp.yaml 
_target_: models.MLP
n_hidden: 128                                                    
$ cat conf/model/rf.yaml  
_target_: models.RandomForest
tree_depth: 3

`models/__init__.py`

@dataclass
class MLP:
    n_hidden: int

@dataclass
class RandomForest:
    tree_depth: int

`train.py`

import dvc.api
from hydra.utils import instantiate

def my_app() -> None:
    model = instantiate(dvc.api.params_show()["model"])
    print(model)

if __name__ == "__main__":
    my_app()

`dvc.yaml`

stages:
  train:
    cmd: python train.py
    params:
    - params.yaml:

So, with that, I run:

dvc config hydra.config_dir conf

$ dvc exp run -S model.tree_depth=4 
...
> python train.py
RandomForest(tree_depth=4)  
...

$ dvc exp run -S model=mlp -S model.n_hidden=192 
...
> python train.py
MLP(n_hidden=192) 
...

$ dvc exp run -S model.tree_depth=5
...
> python train.py
RandomForest(tree_depth=5)  
...

And:

$ dvc exp show
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                Created    model._target_        model.tree_depth   model.n_hidden  
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                 -          models.RandomForest   5                  -               
  master                    11:47 AM   models.RandomForest   3                  -               
  ├── ea452af [exp-bf727]   11:48 AM   models.RandomForest   5                  -               
  ├── bb97bdf [exp-c8cde]   11:48 AM   models.MLP            -                  192             
  └── 730ecec [exp-5b346]   11:48 AM   models.RandomForest   4                  -               
 ──────────────────────────────────────────────────────────────────────────────────────────────

dberenbaum · 2022-08-27T12:59:30Z

config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.

Sorry to come back to this so late, but I'm still slightly confused.

hydra.config_dir acts as both a way to turn on the feature and to set the dir. However, hydra.config_name is optional, has a default value, and doesn't impact whether the feature is turned on.

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:

Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).
Requiring both flags to be set to turn on the behavior, and raising a helpful error if only one is set.

daavoo · 2022-08-29T08:41:33Z

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:
- Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).

I think this one option has the cleanest UX. Added config.hydra.enabled and made config_dir and config_name both optional with defaults

d-miketa · 2022-08-29T11:44:30Z

@daavoo Ahhh that’s amazing! So good to see the integration’s naturally taking care of this problem - you clearly made some sound architectural choices. 😁

dberenbaum · 2022-08-29T12:03:05Z

@iterative/dvc Could someone review please?

efiop · 2022-08-29T13:19:15Z

@dtrifiro @skshetry Could you both take a look, please?

skshetry · 2022-08-29T14:30:53Z

@daavoo, I have rebased it to main, and added some skips to the tests that does not work in 3.11.

dtrifiro · 2022-08-31T12:51:57Z

dvc/utils/hydra.py

+        overrides: List of `Hydra Override`_ patterns.
+
+    .. _Hydra Override:
+        https://hydra.cc/docs/next/advanced/override_grammar/basic/


This link is dead?

https://hydra.cc/docs/advanced/override_grammar/basic/

dvc/repo/experiments/queue/base.py

The feature will be used depending on whether `config.hydra.enabled` is True or False. Uses `hydra.initialize_config_dir` and `hydra.compose` (from https://hydra.cc/docs/advanced/compose_api/) to build the config and dump it to `params.yaml`. The content of the output file will be overwritten. `config.hydra.config_dir` and `config.hydra.config_name` can be used to customize the values passed to `hydra.initialize_config_dir` and `hydra.compose`. Can be combined with `--set-param` overrides. Closes #8082

daavoo requested a review from dberenbaum August 4, 2022 11:37

daavoo force-pushed the hydra-compose branch 2 times, most recently from 32b94a2 to 0c74211 Compare August 4, 2022 12:07

daavoo force-pushed the hydra-compose branch from c78a361 to c999e73 Compare August 8, 2022 10:47

daavoo force-pushed the hydra-compose branch from c999e73 to d0fdb6f Compare August 8, 2022 11:58

daavoo added A: experiments Related to dvc exp A: params Related to dvc params feature is a feature labels Aug 9, 2022

dberenbaum mentioned this pull request Aug 9, 2022

ref: Hydra support for exp run --set-param iterative/dvc.org#3834

Merged

daavoo force-pushed the hydra-compose branch 2 times, most recently from 5d61212 to a97966e Compare August 17, 2022 14:11

daavoo marked this pull request as ready for review August 17, 2022 14:11

daavoo requested a review from a team as a code owner August 17, 2022 14:11

daavoo requested a review from dtrifiro August 17, 2022 14:11

daavoo self-assigned this Aug 18, 2022

daavoo force-pushed the hydra-compose branch 2 times, most recently from a97966e to b185339 Compare August 19, 2022 15:49

daavoo force-pushed the hydra-compose branch from b185339 to 657f2b6 Compare August 22, 2022 08:04

daavoo force-pushed the hydra-compose branch 2 times, most recently from 9cdb7f1 to a05e7fe Compare August 29, 2022 08:39

dberenbaum approved these changes Aug 29, 2022

View reviewed changes

skshetry force-pushed the hydra-compose branch from a05e7fe to 569740a Compare August 29, 2022 14:30

dtrifiro reviewed Aug 31, 2022

View reviewed changes

daavoo force-pushed the hydra-compose branch from 569740a to 2edd21e Compare August 31, 2022 13:26

dtrifiro reviewed Aug 31, 2022

View reviewed changes

dvc/repo/experiments/queue/base.py Show resolved Hide resolved

dtrifiro approved these changes Aug 31, 2022

View reviewed changes

daavoo enabled auto-merge (rebase) August 31, 2022 14:58

daavoo and others added 2 commits August 31, 2022 16:58

skip tests that depend on hydra.compose in 3.11

765376a

daavoo force-pushed the hydra-compose branch from 2edd21e to 765376a Compare August 31, 2022 14:58

daavoo merged commit 639eeed into main Aug 31, 2022

daavoo deleted the hydra-compose branch August 31, 2022 15:24

This was referenced Sep 12, 2022

Conditional parameters #8283

Closed

Hydra Composition iterative/dvc.org#3960

Merged

dberenbaum mentioned this pull request Dec 21, 2022

Minimal modifications to params.yaml when running --set-param #8720

Open

dberenbaum mentioned this pull request Jul 10, 2023

Advanced experiment configuration iterative/dvc.org#4670

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

exp run: Support composing and dumping Hydra config. #8093

exp run: Support composing and dumping Hydra config. #8093

daavoo commented Aug 4, 2022 •

edited

Loading

daavoo commented Aug 4, 2022

dberenbaum commented Aug 6, 2022

daavoo commented Aug 8, 2022 •

edited

Loading

daavoo commented Aug 8, 2022 •

edited

Loading

daavoo commented Aug 8, 2022

daavoo commented Aug 8, 2022

dberenbaum commented Aug 8, 2022

daavoo commented Aug 8, 2022 •

edited

Loading

daavoo commented Aug 9, 2022

daavoo commented Aug 9, 2022

dberenbaum commented Aug 9, 2022

dberenbaum commented Aug 12, 2022 •

edited

Loading

daavoo commented Aug 17, 2022

dberenbaum commented Aug 19, 2022

daavoo commented Aug 19, 2022

d-miketa commented Aug 20, 2022

daavoo commented Aug 22, 2022

dberenbaum commented Aug 27, 2022

daavoo commented Aug 29, 2022

d-miketa commented Aug 29, 2022 •

edited

Loading

dberenbaum commented Aug 29, 2022

efiop commented Aug 29, 2022

skshetry commented Aug 29, 2022

dtrifiro Aug 31, 2022

daavoo Aug 31, 2022

exp run: Support composing and dumping Hydra config. #8093

exp run: Support composing and dumping Hydra config. #8093

Conversation

daavoo commented Aug 4, 2022 • edited Loading

daavoo commented Aug 4, 2022

dberenbaum commented Aug 6, 2022

daavoo commented Aug 8, 2022 • edited Loading

daavoo commented Aug 8, 2022 • edited Loading

daavoo commented Aug 8, 2022

daavoo commented Aug 8, 2022

dberenbaum commented Aug 8, 2022

daavoo commented Aug 8, 2022 • edited Loading

daavoo commented Aug 9, 2022

daavoo commented Aug 9, 2022

dberenbaum commented Aug 9, 2022

dberenbaum commented Aug 12, 2022 • edited Loading

daavoo commented Aug 17, 2022

dberenbaum commented Aug 19, 2022

daavoo commented Aug 19, 2022

d-miketa commented Aug 20, 2022

daavoo commented Aug 22, 2022

dberenbaum commented Aug 27, 2022

daavoo commented Aug 29, 2022

d-miketa commented Aug 29, 2022 • edited Loading

dberenbaum commented Aug 29, 2022

efiop commented Aug 29, 2022

skshetry commented Aug 29, 2022

dtrifiro Aug 31, 2022

Choose a reason for hiding this comment

daavoo Aug 31, 2022

Choose a reason for hiding this comment

daavoo commented Aug 4, 2022 •

edited

Loading

daavoo commented Aug 8, 2022 •

edited

Loading

daavoo commented Aug 8, 2022 •

edited

Loading

daavoo commented Aug 8, 2022 •

edited

Loading

dberenbaum commented Aug 12, 2022 •

edited

Loading

d-miketa commented Aug 29, 2022 •

edited

Loading