Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow setting params with command-line arguments #267

Merged
merged 5 commits into from
Aug 2, 2023

Conversation

alexdewar
Copy link
Collaborator

@alexdewar alexdewar commented Jul 31, 2023

Description

This PR adds the option to set additional parameters when invoking vr_run by using command-line arguments, without having to create additional config files.

For example, it allows you to do something like this:

vr_run --param hydrology.initial_soil_moisture=0.2 --param some.other.param=value dummy_data/

The main motivation for this is that I want to be able to invoke vr_run in parallel with different parameters as part of my work on the SA (see #239). While that is currently possible, it requires generating config files as an intermediate step, which makes the process substantially more complex.

Closes #266.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

For example, you can now run:

$ vr_run --param hydrology.initial_soil_moisture=0.2

You can specify multiple parameters with multiple --param flags.

Closes #266.
@alexdewar alexdewar changed the title Command line params Allow setting params with command-line arguments Jul 31, 2023
@codecov-commenter
Copy link

codecov-commenter commented Jul 31, 2023

Codecov Report

Merging #267 (1ec1fb0) into develop (382daa1) will decrease coverage by 1.02%.
The diff coverage is 43.24%.

@@             Coverage Diff             @@
##           develop     #267      +/-   ##
===========================================
- Coverage    95.28%   94.27%   -1.02%     
===========================================
  Files           43       43              
  Lines         1781     1816      +35     
===========================================
+ Hits          1697     1712      +15     
- Misses          84      104      +20     
Files Changed Coverage Δ
virtual_rainforest/entry_points.py 51.02% <33.33%> (-23.98%) ⬇️
virtual_rainforest/core/config.py 98.52% <83.33%> (-0.47%) ⬇️
virtual_rainforest/main.py 81.19% <100.00%> (ø)

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks sensible to me, just had a query about how parameters defined in both config files and the command line will be handled.

@@ -432,7 +438,7 @@ def build_config(self) -> None:
"""

# Get the config dictionaries
input_dicts = list(self.toml_contents.values())
input_dicts = list(self.toml_contents.values()) + self.extra_params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I right in thinking that this step adds the extra parameters to the existing config dictionaries? If so is there anything here that handles the case where the config files define a parameter which is then supplied as an extra parameter? I'm not actually 100% sure what the desired behaviour would be in that case?

Copy link
Collaborator Author

@alexdewar alexdewar Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a parameter is set in one of the config files and you then also try to set it via a command-line arg, then you'll get an error, just as you would if you tried to set it twice in different config files.

I can see that there might be a case where a user wants to override a parameter set in a config file, but I don't think that should be the default behaviour as it's a bit error prone. If we want to add this feature later we can do (e.g. we could have a separate --param-override flag for this case).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yeah that makes sense! Also agree that overriding parameters seems error prone

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is dangerous and I think there is a case for making the override behaviour the explicit mechanism for this approach.

  • The configuration process at the moment takes a bunch of files and uses those to build the configuration tree, substituting default values for any settings that aren't explicitly mentioned. That gives us a base configuration, with everything set.

  • If we want to tweak those settings, we shouldn't need to worry about whether the base files set a value or if it is just being populated by default. That would put us in a position of having to maintain multiple sets of base configuration files, each one omitting the setting that we want to tweak. It would be neater to have a default config that can just be adjusted as needed.

  • Explicitly providing extra params would then be a way to substitute values in a particular run for a sensitivity analysis, by updating the Config object created from the base configuration for a given run. Those extra params aren't then simply added to the TOML contents, but are fed into a separate Config method. That does then need some kind of a validation step though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that method would be something like:

    def override_config(self, extra_params) -> None:
        """Adjust an already created config to update settings

        Raises:
            ValidationError: if duplicate config definitions occur across files.
        """

        # Copy the current dictionary
        master = deepcopy(self.__dict__)
        
        # Update that dictionary
        # TODO - don't know if this works to replace values in master even when conflicts found, 
        #               or whether we need an explicit switch or method to do override.
        master, conflicts = config_merge(master, extra_params, conflicts=tuple())
        
        # Ignore any conflicts - explicitly overriding settings - so now revalidate against the 
        # schema defined in the base config.
        self.update(master)
        self.validate_config()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment, none of the constants are populated by default in the config (i.e. not using default values set within model schema). It's only after config validation when a set of constants is used to create a model specific constants class that default values come in, These defaults are hardcoded into the various constants classes, and just aren't replaced if there is no value stored in the config to replace them with.

So I don't really see a danger the current approach, unless we change how default constants are handled

Copy link
Collaborator

@davidorme davidorme Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config_merge does indeed return the updated dictionary when conflicts are present:

In [11]: from virtual_rainforest.core.config import Config, config_merge
In [12]: cfg = Config('tests/core/data/all_config.toml')
In [13]: updated, conflicts = config_merge(c, {'core': {'layers': {'soil_layers': 3}}})
In [14]: updated['core']['layers']['soil_layers']
Out[14]: 3

So, this should work as:

    def override_config(self, extra_params) -> None:
        """Adjust an already created config to update settings

        Raises:
            ValidationError: if an invalid value is provided for any configuration settings.
        """
        
        # Update the dictionary
        updated, conflicts = config_merge(self, extra_params, conflicts=tuple())
        
        # Ignore any conflicts - explicitly overriding settings - so now revalidate against the 
        # schema defined in the base config.
        self.update(updated)
        self.validate_config()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Obviously, if changing the grid size or another part of the config that does have a default setting through this method would be risky, but I don't think we would want to do that?

Copy link
Collaborator

@davidorme davidorme Aug 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So configuration settings that are then checked against other inputs/data could cause all sorts of chaos.

But this process should only (?) ever be invoked when the Config is built and validated and bad overrides would cause exactly the same errors as bad initial settings - they can be JSONSchema invalid and get trapped early - or bad/inconsistent in setting up the rest of the simulation and cause a later error.

Exposing a Config.override_config method does raise a route where programmatic use could change the Config half way through a simulation (and we should very clearly say that this is extremely ill-advised!), but in vr_run it would only ever be something like:

config = Config(toml_files)
config.override_config(extra_params)
# Now leave config alone!

Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works - and that is probably the thing that matters now - but I think having to avoid duplicating configuration settings in TOML files and extra_params is another route to piles of slightly differing configuration TOML files. This approach definitely avoids having multiple files that differ in just one (or two or three...) values, but does require unique config files for each param set being altered in a specific sensitivity analysis.

What do you think? @jacobcook1995 - what do you think the dangers here? We'd have to explicitly validate any overridden values, but we have the mechanism to do that.

Incidentally, @alexdewar, my son Thomas has just gotten quite into Exploding Kittens - the Beard Cat icon was instantly familiar 😄

self, cfg_paths: Union[str, Path, list[Union[str, Path]]], auto: bool = True
self,
cfg_paths: Union[str, Path, list[Union[str, Path]]],
extra_params: Optional[list[dict[str, Any]]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious as to why this uses Optional and None when the custom attribute is immediately set to be an empty list if it isn't provided. As opposed to doing:

Suggested change
extra_params: Optional[list[dict[str, Any]]] = None,
extra_params: list[dict[str, Any]] = [],

That is then a mutable default, which is dangerous, but it could be a tuple of dicts instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. We're deepcopying it anyway, so it doesn't matter if it's mutable.

@@ -432,7 +438,7 @@ def build_config(self) -> None:
"""

# Get the config dictionaries
input_dicts = list(self.toml_contents.values())
input_dicts = list(self.toml_contents.values()) + self.extra_params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it is dangerous and I think there is a case for making the override behaviour the explicit mechanism for this approach.

  • The configuration process at the moment takes a bunch of files and uses those to build the configuration tree, substituting default values for any settings that aren't explicitly mentioned. That gives us a base configuration, with everything set.

  • If we want to tweak those settings, we shouldn't need to worry about whether the base files set a value or if it is just being populated by default. That would put us in a position of having to maintain multiple sets of base configuration files, each one omitting the setting that we want to tweak. It would be neater to have a default config that can just be adjusted as needed.

  • Explicitly providing extra params would then be a way to substitute values in a particular run for a sensitivity analysis, by updating the Config object created from the base configuration for a given run. Those extra params aren't then simply added to the TOML contents, but are fed into a separate Config method. That does then need some kind of a validation step though.

@@ -432,7 +438,7 @@ def build_config(self) -> None:
"""

# Get the config dictionaries
input_dicts = list(self.toml_contents.values())
input_dicts = list(self.toml_contents.values()) + self.extra_params
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that method would be something like:

    def override_config(self, extra_params) -> None:
        """Adjust an already created config to update settings

        Raises:
            ValidationError: if duplicate config definitions occur across files.
        """

        # Copy the current dictionary
        master = deepcopy(self.__dict__)
        
        # Update that dictionary
        # TODO - don't know if this works to replace values in master even when conflicts found, 
        #               or whether we need an explicit switch or method to do override.
        master, conflicts = config_merge(master, extra_params, conflicts=tuple())
        
        # Ignore any conflicts - explicitly overriding settings - so now revalidate against the 
        # schema defined in the base config.
        self.update(master)
        self.validate_config()

@jacobcook1995
Copy link
Collaborator

Is there any real danger in terms of the validation? Because the new parameter gets added to the list of config files to validate, i.e. the change is made before any validation takes place

@jacobcook1995
Copy link
Collaborator

@davidorme I guess my view is that overriding existing config settings introduces a fair bit of complexity to this pull request without really addressing any immediate problem?

Like if down the line it becomes very annoying to have to ensure that options have not been populated in config files when doing sensitivity analysis then we can always take the approach you've suggested here, but I'm not sure there's any great need to address it now as a potential future problem (because there are an awful lot of those).

@davidorme
Copy link
Collaborator

This method is a vast improvement on having to write multiple files to run variations for a given SA but I think having to create a separate compatible set of base config files for any given sensitivity analysis is an immediate problem. If there isn't a danger to the override behaviour and if the resulting user experience is better, then I'm not sure why we don't just implement it, given that we already have the mechanisms in place to do so.

@davidorme
Copy link
Collaborator

To give a concrete example:

  • Let's say we run a big long analysis that saves a whole bunch of stuff with some non-default configured settings like plants.beta_value = 0.5 and soil.zeta_value=0.95.
  • Now we want to run an SA to see how much GPP varies if we change plants.beta_value = range(0.4, 0.6, 0.01) and soil.zeta_value= range(0.85, 1.05, 0.1).
  • Those SA runs have to have different config files to our base run, because they cannot contain plants.beta_value = 0.5 and soil.zeta_value=0.95.
  • That is less clear and less reproducible.

@alexdewar
Copy link
Collaborator Author

@davidorme I take your point that that it could be useful to be able to override params so you can reuse config files between simulation runs. I think it would be a mistake to silently override the value the user has put in a config file, though, in case it was accidental; we should at least warn them.

I'm happy to have a go at this, but if it turns out to be finicky, then perhaps we could merge this as is and open a separate issue for the "allow users to override parameters" feature? My main goal atm is to get a minimal version of the SA working so that we can all have a discussion about it. Once that's done we can of course polish things.

@davidorme
Copy link
Collaborator

That sounds good to me - we shouldn't lose the bigger SA picture - but I think this is worth fixing here if we can.

We can warn users - no harm in that - but the context of sensitivity analyses and also the documentation for extra_params should be very clear that we are adjusting configs. Maybe calling it alter_params or override_params to give a clear hint that this is what it will do!

@alexdewar
Copy link
Collaborator Author

Ok, I've had a go. Let me know what you think.

I haven't written tests yet, but can do if people are happy with the implementation.

It turns out override_config() doesn't raise ValidationError after all.
Copy link
Collaborator

@davidorme davidorme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That implementation looks good to me. I'd thought of Config.override_config as something to call post __init__ but embedding it in __init__ also makes sense. I briefly considered suggesting we make it 'private' as Config._override_config to further discourage use, but honestly none of those Config methods are expected to be used except via __init__ in 99% of use cases.

I hadn't thought that the conflicts object provides an automatic list of altered settings - so that is pretty seamless for reporting changes. I guess the thing there is that the conflicts will be against things set in files and things set by default - but honestly the whole point of this mechanism is to make it easy to change settings, so I think that distinction is largely artificial and recording in logs which settings have been changed is useful.

@alexdewar
Copy link
Collaborator Author

Thanks @davidorme.

I think we could probably safely make all of those methods private and remove the auto parameter to __init__. That way we wouldn't have to worry about users using them at all. There aren't any current uses of them outside of __init__ and anyone will want to invoke them directly anyway.

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexdewar
Copy link
Collaborator Author

I'll merge this now so that I can rebase my other work on top of it.

@alexdewar alexdewar merged commit cf74b38 into develop Aug 2, 2023
@alexdewar alexdewar deleted the command_line_params branch August 2, 2023 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vr_run: Allow for setting parameter values from command line
4 participants