Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp run: Support composing and dumping Hydra config. #8093

Merged
merged 2 commits into from
Aug 31, 2022
Merged

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented Aug 4, 2022

The feature will be used depending on whether config.hydra.enabled is True or False.

Uses https://hydra.cc/docs/advanced/compose_api/ to build the config and dump it to params.yaml. The content of the output file will be overwritten.

config.hydra.config_dir and config.hydra.config_name can be used to customize the values passed to the APIs used from hydra (hydra.initialize_config_dir and hydra.compose).

Can be combined with --set-param overrides.

Closes #8082


@daavoo daavoo requested a review from dberenbaum August 4, 2022 11:37
@daavoo
Copy link
Contributor Author

daavoo commented Aug 4, 2022

@dberenbaum please take a look

@daavoo daavoo force-pushed the hydra-compose branch 2 times, most recently from 32b94a2 to 0c74211 Compare August 4, 2022 12:07
@dberenbaum
Copy link
Collaborator

Nice work @daavoo! I really like how it achieves two important product goals:

  • Makes DVC easier and more flexible for complex configuration.
  • Makes it easier to track final parameters for Hydra users.

I came across two big questions when trying it:

  • Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop 🤔.
  • Is this the simplest/best way to handle custom files (not params.yaml)? The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users. Maybe instead of hydra.output_file, we need an option to configure the default params file (see configurable default params.yaml (or templating entire pipelines) #7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

More minor questions:

Interesting side effect: dvc exp run -S parameters persist to the next experiment by default but Hydra will reset to default each experiment. I prefer the Hydra behavior and wonder how we can simplify the experience enough to make it the default (separate from this PR). 🤔

@daavoo
Copy link
Contributor Author

daavoo commented Aug 8, 2022

Can we merge with the existing output file instead of overwriting? Seems confusing that the behavior differs with and without the hydra options enabled, but I guess this risks leaving in parameters that you wanted to drop 🤔.

I thought about merging but found that it would be confusing when combined with -set-param to determine which arguments are meant to be passed to hydra and which are meant to override the current status of the output_file.

Perhaps it would be better to clearly establish that hydra output_file will be overwritten on each exp run invocation.
Users can move the hardcoded values of output_file to hydra.config_name and the result would be the same as merging.

@daavoo
Copy link
Contributor Author

daavoo commented Aug 8, 2022

Should it work for repro 🤔 ? If not, we should be able to explain.

It should not, because of the same reason --set-param only works for exp run. We are creating/editing a file and it's required to track the changes with git before running the pipeline

@daavoo
Copy link
Contributor Author

daavoo commented Aug 8, 2022

How can I get it to work for https://hydra.cc/docs/patterns/select_multiple_configs_from_config_group/?

Did you find any particular issue?

I copy-pasted the structure of the link to multiconf folder.

[hydra]
    output_file = params.yaml
    config_dir = multiconf
# dvc.yaml
stages:
  train:
    cmd: python train.py
    deps:
    - train.py
    params:
    - params.yaml:
# train.py
import dvc.api

def my_app() -> None:
    print(dvc.api.params_show())

if __name__ == "__main__":
    my_app()
$ dvc exp run
Running stage 'train':                                                
> python train.py
{'server': {'site': {'fb': {'domain': 'facebook.com'}, 'google': {'domain': 'google.com'}}, 'host': 'localhost', 'port': 443}}
$ dvc exp run -S 'server/site=[google,amazon]'
Running stage 'train':                                                
> python train.py
{'server': {'site': {'google': {'domain': 'google.com'}, 'amazon': {'domain': 'amazon.com'}}, 'host': 'localhost', 'port': 443}}

@daavoo
Copy link
Contributor Author

daavoo commented Aug 8, 2022

Is this the simplest/best way to handle custom files (not params.yaml)?

My main motivation for going with hydra.output_file was to make it explicit, as I consider the composing and dumping a somehow aggressive operation. In addition, the compose and dump should also happen without --set-param, so users should have a way of toggling the behavior.

The relationship between hydra.output_file and --set-param seems a bit awkward and hard to wrap my head around. Seems hard to know when --set-param modifies my Hydra config vs. my params file. If I set hydra.output_file to custom.yaml, then --set-param key=val will modify params.yaml and not custom.yaml, which I think will confuse Hydra users.

Is there something specific that resulted obscure to you? I am not sure how much it comes from --set-param defaulting to params.yaml when no args and how much by adding the hydra.output_file.

What do you think about having a special key for --set-param named hydra (dvc exp run --set-param hydra:foo=1), which would explicitly indicate that those are args to be passed for the hydra compose and dump feature??
This could also alleviate #8093 (comment) .

Maybe instead of hydra.output_file, we need an option to configure the default params file (see #7939) and always output there? I think we should at least default to using params.yaml for hydra.output_file.

Not sure I like the idea of entangling the hydra compose and dump with the DVC defaults parameters file.

@dberenbaum
Copy link
Collaborator

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

@daavoo
Copy link
Contributor Author

daavoo commented Aug 8, 2022

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

@daavoo
Copy link
Contributor Author

daavoo commented Aug 9, 2022

@daavoo I haven't had a chance to look at your responses yet, but @dmpetrov asked why Hydra is configured in .dvc/config instead of at the pipeline/stage level inside dvc.yaml?

No strong reasons. I think the main reason was that I didn't know how it could be configured without requiring a new section or a change in the semantics of the existing sections. Wanted to add it in a way that could reuse the existing semantics.

Another issue with configuring at the .dvc level is that it can be impractical in scenarios with multiple dvc.yaml (i.e. monorepo) unless dvc init --subdir is used.

@daavoo
Copy link
Contributor Author

daavoo commented Aug 9, 2022

Some ideas for enabling this at the stage level:

  • Consider hydra a reserved key inside params.

If hydra is present in params of a stage, trigger the compose and dump functionality.

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra

Q) How to configure?

Something like the following might look appealing but completely change the semantics / internal logic of params:

stages:
  train: 
    cmd: python train.py ${train_config}
    params:
        - hydra:
            output_file: params.yaml
            config_dir: conf 
  • Introduce callbacks?

There were old discussions about this. But basically, the hydra compose and dump could be implemented/understood as a stage callback:

stages:
  train: 
    cmd: python train.py ${train_config}
    callbacks:
      - hydra:
          output_file: params.yaml
          config_dir: conf
    params:
        - train_config

@daavoo daavoo added A: experiments Related to dvc exp A: params Related to dvc params feature is a feature labels Aug 9, 2022
@dberenbaum
Copy link
Collaborator

Pros and cons of the dvc.yaml approach:

Pros

  • More granular control over where and how to use Hydra. For example, I could use the same hydra config_dir in two stages with different output_files for each and specify different --set-param values for each.
  • More transparent to other users that Hydra is being used.

Cons

  • If reusing the same config across stages or dvc.yaml files, it becomes repetitive (is this what you meant by "impractical in scenarios with multiple dvc.yaml? @daavoo?). For example, if I want to use the same hydra config across many stages, I need to add the hydra-specific section to each one.
  • Can't be used outside of stages (such as in foreach or global vars). This restricts its usage and makes it hard to configure for complex scenarios, which is where Hydra is needed most.
  • Harder to run without Hydra (more tightly coupled).

IMO the most likely and useful way to integrate Hydra and DVC is to have a massive Hydra config dir that gets reused throughout the project. For existing Hydra users, they are more likely to have a single Hydra config dir, and it's less work to integrate Hydra if it doesn't have to be configured at each stage in the pipeline. For that scenario, configuring at the project level in .dvc/config seems better, but we can also try to get feedback from users here.

@dberenbaum
Copy link
Collaborator

dberenbaum commented Aug 12, 2022

For future reference, I set up https://github.com/dberenbaum/complex_config_example when I was testing this out 😅 (edit: main branch uses current dvc functionality, hydra branch uses this PR)

@daavoo daavoo force-pushed the hydra-compose branch 2 times, most recently from 5d61212 to a97966e Compare August 17, 2022 14:11
@daavoo daavoo marked this pull request as ready for review August 17, 2022 14:11
@daavoo daavoo requested a review from a team as a code owner August 17, 2022 14:11
@daavoo daavoo requested a review from dtrifiro August 17, 2022 14:11
@daavoo
Copy link
Contributor Author

daavoo commented Aug 17, 2022

Latest update:

  • config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.
  • By default, composed config will be dumped to params.yaml.
    config.hydra.output_file is now optional and can be used for changing the default behavior.
    Contents of output_file will be overwritten.

@daavoo daavoo self-assigned this Aug 18, 2022
@dberenbaum
Copy link
Collaborator

  • By default, composed config will be dumped to params.yaml.
    config.hydra.output_file is now optional and can be used for changing the default behavior.
    Contents of output_file will be overwritten.

I'm worried this will confuse more people than it helps. Once set, it seems hard to explain how it works and clunky to have to use -S custom_params.yaml:... for all Hydra overrides. If you feel it's needed, maybe you have a better idea of how to clearly explain the behavior?

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?

@daavoo
Copy link
Contributor Author

daavoo commented Aug 19, 2022

We are only supporting a single Hydra output file, so why not use the default params file? If people don't want the path params.yaml, we can fix that in #7939. Is there a use case for having a different default parameters file and Hydra parameters file?

I don't have a specific use case in mind. I guess I was trying to cover most scenarios.
I can remove the option from the P.R. and wait for people to ask (if ever happens)

@daavoo daavoo force-pushed the hydra-compose branch 2 times, most recently from a97966e to b185339 Compare August 19, 2022 15:49
@d-miketa
Copy link
Contributor

Hi and thank you for working on this! As a user of both Hydra and DVC I thought it might be helpful to contribute a realistic use case. It may already be handled well by the current implementation, I don’t know.

Hydra’s great for modular projects where interchangeable modules may have heterogeneous configuration options. For example, I may want to use either a random forest or a multi-layer perceptron as my model architecture. Its two config files may look like:

# config_dir/model/rf.yaml

_target_: my_package.models.RandomForest
tree_depth: 3
# config_dir/model/mlp.yaml

_target_: my_package.models.MLP
n_hidden: 128

I’d like to now run a sequence of experiments.

  1. RF with tree_depth=4.
  2. MLP with n_hidden=192
  3. RF with tree_depth=4, again (I forgot about running it earlier, say)

The final experiment should be immediately recovered from cache but experiment 2 may have overwritten the value of model.n_hidden inside whatever file’s tracking DVC params. So while 1 was run with model.tree_depth=4 model.n_hidden=128, 3 will be instantiated with model.tree_depth=4 model.n_hidden=192. But the two settings describe fundamentally the same experiment, since n_hidden is not used by RF at all.

Is this case correctly handled by your parameter storage?

@daavoo
Copy link
Contributor Author

daavoo commented Aug 22, 2022

Is this case correctly handled by your parameter storage?

@d-miketa I think so, with proper setup. I am making some assumptions here based on what you shared (like the use of hydra.utils.instantiate).

Here is the setup:

`conf` directory
$ tree conf
conf
├── config.yaml
└── model
    ├── mlp.yaml
    └── rf.yaml
$ cat conf/config.yaml 
defaults:
  - model: rf
$ cat conf/model/mlp.yaml 
_target_: models.MLP
n_hidden: 128                                                    
$ cat conf/model/rf.yaml  
_target_: models.RandomForest
tree_depth: 3                                                       
`models/__init__.py`
@dataclass
class MLP:
    n_hidden: int

@dataclass
class RandomForest:
    tree_depth: int                                 
`train.py`
import dvc.api
from hydra.utils import instantiate

def my_app() -> None:
    model = instantiate(dvc.api.params_show()["model"])
    print(model)

if __name__ == "__main__":
    my_app()
`dvc.yaml`
stages:
  train:
    cmd: python train.py
    params:
    - params.yaml:

So, with that, I run:

dvc config hydra.config_dir conf
$ dvc exp run -S model.tree_depth=4 
...
> python train.py
RandomForest(tree_depth=4)  
...
$ dvc exp run -S model=mlp -S model.n_hidden=192 
...
> python train.py
MLP(n_hidden=192) 
...
$ dvc exp run -S model.tree_depth=5
...
> python train.py
RandomForest(tree_depth=5)  
...

And:

$ dvc exp show
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  Experiment                Created    model._target_        model.tree_depth   model.n_hidden  
 ────────────────────────────────────────────────────────────────────────────────────────────── 
  workspace                 -          models.RandomForest   5                  -               
  master                    11:47 AM   models.RandomForest   3                  -               
  ├── ea452af [exp-bf727]   11:48 AM   models.RandomForest   5                  -               
  ├── bb97bdf [exp-c8cde]   11:48 AM   models.MLP            -                  192             
  └── 730ecec [exp-5b346]   11:48 AM   models.RandomForest   4                  -               
 ────────────────────────────────────────────────────────────────────────────────────────────── 

@dberenbaum
Copy link
Collaborator

  • config.hydra.config_dir is the config option that enables/disables the hydra compose and dump behavior.

Sorry to come back to this so late, but I'm still slightly confused.

hydra.config_dir acts as both a way to turn on the feature and to set the dir. However, hydra.config_name is optional, has a default value, and doesn't impact whether the feature is turned on.

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:

  • Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).
  • Requiring both flags to be set to turn on the behavior, and raising a helpful error if only one is set.

@daavoo daavoo force-pushed the hydra-compose branch 2 times, most recently from 9cdb7f1 to a05e7fe Compare August 29, 2022 08:39
@daavoo
Copy link
Contributor Author

daavoo commented Aug 29, 2022

This could lead to situations like setting only hydra.config_name and being confused about why it doesn't work. What do you think about either:

- Making a separate flag to turn the feature on and make both of these optional with defaults (conf/config).

I think this one option has the cleanest UX. Added config.hydra.enabled and made config_dir and config_name both optional with defaults

@d-miketa
Copy link
Contributor

d-miketa commented Aug 29, 2022

@daavoo Ahhh that’s amazing! So good to see the integration’s naturally taking care of this problem - you clearly made some sound architectural choices. 😁

@dberenbaum
Copy link
Collaborator

@iterative/dvc Could someone review please?

@efiop
Copy link
Contributor

efiop commented Aug 29, 2022

@dtrifiro @skshetry Could you both take a look, please?

@skshetry
Copy link
Member

@daavoo, I have rebased it to main, and added some skips to the tests that does not work in 3.11.

overrides: List of `Hydra Override`_ patterns.

.. _Hydra Override:
https://hydra.cc/docs/next/advanced/override_grammar/basic/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This link is dead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daavoo daavoo enabled auto-merge (rebase) August 31, 2022 14:58
daavoo and others added 2 commits August 31, 2022 16:58
The feature will be used depending on whether `config.hydra.enabled` is True or False.

Uses `hydra.initialize_config_dir` and `hydra.compose` (from https://hydra.cc/docs/advanced/compose_api/) to build the config and dump it to `params.yaml`. The content of the output file will be overwritten.

`config.hydra.config_dir` and `config.hydra.config_name` can be used to customize the values passed to `hydra.initialize_config_dir` and `hydra.compose`.

Can be combined with `--set-param` overrides.

Closes #8082
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp A: params Related to dvc params feature is a feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

exp: hydra compose and dump
6 participants