Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving to a single registry and importing on request not automatically #323

Conversation

davidorme
Copy link
Collaborator

@davidorme davidorme commented Sep 29, 2023

Description

Issues #278 and #302 request that modules are loaded on demand and not automatically and that the existing registries for models, constants and schema are integrated. Those are difficult to solve in isolation, so this does both.

This PR is currently a work in progress. I think it fully implements the requested functionality but it does nothing to update the tests or docs to match. If the changes makes sense then we can then roll out those additional commits.

This also fixes #124 - bad schemas will now fail at register_module rather than cause the package import to fail.

Overview

  • Two new core submodules are added and one is renamed:

    • schema.py simply moves the JSONSchema loading and merging code out of config to simplify the file structure.
    • registry.py collects and updates the existing registration functions from base_model (register_model), constants (register_constant_class) and config (register_schema). It then adds an umbrella function register_module that wraps those and the MODULE_REGISTRY itself.
    • In the models, the constants.py submodule is used as the home for constants dataclasses, and is used in registering those classes. We expect to want to do that in core too, so the existing functionality incore.constants has moved to core.constants_loader.
  • The MODULE_REGISTRY includes core as well as models. The registry is structured as a dict keyed by module short name, each of which contains a dict of with keys ['model', 'constants', 'schema']. The entries for each of those elements is identical to the previous ..._REGISTRY object. So, for example, the plants model is at MODULE_REGISTRY['plants']['model']. The core module does not have an entry for model.

  • The register_module function automatically imports the requested module and then searches for a single BaseModel subclass (except in core), any data classes in xyz.constants and then the appropriate JSONSchema. Because it searches for the BaseModel subclass using inspect, individual models now only have to import that object in __init__, and all the registration code in __init__ files is removed.

  • Within the vr_run flow, models cannot be registered until the model config is built and merged. At that point, we should have the model list, so modules need to be registered then before we can validate the config, which is inside Config.build_schema. So that method now calls register_module('virtual_rainforest.core') and then register_module('virtual_rainforest.model.xyz') for the requested models right before building the schema.

  • I've then stripped out all of the previous autodiscovery code and all of the apparatus around the previous SCHEMA_REGISTRY, MODEL_REGISTRY and CONSTANTS_REGISTRY.

So - this does run using vr_run - the mechanisms seem to work as expected, but please give this WIP a solid kicking before we do anything rash like testing or documenting it. There are a lot of changes but a lot of it is just moving code locations.

Fixes #278 and #302

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • [ VERY MUCH NO ] All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • [ AGAIN, NO ] Tests added that prove fix is effective or that feature works

@davidorme davidorme linked an issue Sep 29, 2023 that may be closed by this pull request
@davidorme
Copy link
Collaborator Author

davidorme commented Sep 30, 2023

Just to demonstrate the registry structure and mechanism, which we'll need to work into the tests:

In [1]: from virtual_rainforest.core.registry import register_module, MODULE_REGISTRY
In [2]: register_module('virtual_rainforest.models.plants')
[INFO] - registry - register_module(72) - Registering virtual_rainforest.models.plants module components
[INFO] - registry - register_model(145) - Registered class for plants model: PlantsModel
[INFO] - registry - register_schema(179) - Schema registered for module plants: /Users/dorme/Research/Virtual_Rainforest/virtual_rainforest/virtual_rainforest/models/plants/model_schema.json 
[INFO] - registry - register_constants_classes(222) - Constants class plants.PlantsConsts registered
In [3]: register_module('virtual_rainforest.core')
[INFO] - registry - register_module(72) - Registering virtual_rainforest.core module components
[INFO] - registry - register_schema(179) - Schema registered for module core: /Users/dorme/Research/Virtual_Rainforest/virtual_rainforest/virtual_rainforest/core/core_schema.json 
In [4]: MODULE_REGISTRY
{'core': {
    'constants': {},  # None defined yet.
    'schema': {'properties': {'core': ...}},
    },
 'plants': {
     'constants': {'PlantsConsts': <class 'virtual_rainforest.models.plants.constants.PlantsConsts'>},
     'model': <class 'virtual_rainforest.models.plants.plants_model.PlantsModel'>,
     'schema': {'properties': {'plants': ... }}
}

@davidorme
Copy link
Collaborator Author

Point of concern

This revision does not populate the registry for a module until that register_module function is called - and importing a module doesn't call that. That works fine within the context of vr_run, when the sequence of events is enforced, but I might have pulled this back too centrally, and we still want register_module to be run when a module is imported?

I think that would also likely reduce the amount of register_module calls required in testing too?

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change seems sensible to me!

I do wonder whether core should really be contained in the registry though as it kind of creates the impression that virtual_rainforest.core is a module that could be switched out for an alternative "core", which I don't think is something we would ever want to support? I guess the core constants existing outside the registry would be a bit weird though?

virtual_rainforest/models/abiotic_simple/__init__.py Outdated Show resolved Hide resolved
virtual_rainforest/models/animals/__init__.py Show resolved Hide resolved
@davidorme
Copy link
Collaborator Author

I do wonder whether core should really be contained in the registry though as it kind of creates the impression that virtual_rainforest.core is a module that could be switched out for an alternative "core", which I don't think is something we would ever want to support? I guess the core constants existing outside the registry would be a bit weird though?

I did think about having a separate CORE_REGISTRY, but really having a single place to look for module components seemed like less code to maintain. I have just been wondering if we want a core model to hold config validation stuff, but that would really be confusing the use cases!

Copy link
Collaborator

@alexdewar alexdewar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely an improvement. Good work!

I do have some suggestions and queries though.

virtual_rainforest/core/constants_loader.py Outdated Show resolved Hide resolved
Comment on lines 32 to 44
MODULE_REGISTRY: dict[str, Any] = {}
"""Information about module schemas, constants and model, keyed by module name.

The registry is a dictionary keyed by the final part of the module name. The entry for
each module is itself a dictionary with 'model', 'constants' and 'schema' entries. For
example:

MODULE_REGISTRY['plants']['model'] = <virtual_rainforest.models.plants.PlantsModel>
MODULE_REGISTRY['plants']['constants'] = [
'PlantsConsts': <virtual_rainforest.models.plants.constants.PlantsConsts>
]
MODULE_REGISTRY['plants']['schema'] = {schema_dict}
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If each element of MODEL_REGISTRY is a dict with the same members, couldn't you use a dataclass for this instead? That way you get better typing etc.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to raise this point - you had this structure in an earlier post. The problem I ran into here is that importing those types to construct the dataclass causes some horrific circularity issues. Part of the reason I have separated schema and registry here is that they are then very low level and don't require much else from VR. Once you import BaseModel or things like it then I was getting circular imports all over the place.

It would certainly be much neater to type this more explicitly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the relevant types be put into their own .py files then? I can't imagine the MODEL_REGISTRY would be needed in those files, so there shouldn't be any circular import problems.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is that BaseModel has a whole bunch of higher level imports:

from virtual_rainforest.core.axes import AXIS_VALIDATORS
from virtual_rainforest.core.config import Config
from virtual_rainforest.core.data import Data

All of those structures come after configuration, so the circular imports from importing BaseModel seem really hard to avoid.

virtual_rainforest/core/schema.py Show resolved Hide resolved
virtual_rainforest/core/schema.py Outdated Show resolved Hide resolved
virtual_rainforest/main.py Outdated Show resolved Hide resolved
virtual_rainforest/models/animals/__init__.py Show resolved Hide resolved
Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've given a first pass and mentioned a couple of things than look a bit odd to me. Maybe I've missunderstood somehinght.

Other than that, I think loading models on-demand as needed by the simulation to be run is a good move and much cleaner than loading everything. So, while there might be things to polish on the implementation side, my 👍 on purpose of the PR.

virtual_rainforest/core/config.py Outdated Show resolved Hide resolved
virtual_rainforest/core/registry.py Outdated Show resolved Hide resolved
virtual_rainforest/core/registry.py Outdated Show resolved Hide resolved
virtual_rainforest/core/registry.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Oct 12, 2023

Codecov Report

Merging #323 (6e2d949) into develop (cb19229) will decrease coverage by 0.71%.
The diff coverage is 97.22%.

@@             Coverage Diff             @@
##           develop     #323      +/-   ##
===========================================
- Coverage    96.43%   95.73%   -0.71%     
===========================================
  Files           48       51       +3     
  Lines         2440     2484      +44     
===========================================
+ Hits          2353     2378      +25     
- Misses          87      106      +19     
Files Coverage Δ
virtual_rainforest/__init__.py 100.00% <ø> (ø)
virtual_rainforest/core/base_model.py 100.00% <100.00%> (ø)
virtual_rainforest/core/config.py 97.51% <100.00%> (-0.53%) ⬇️
virtual_rainforest/core/constants.py 100.00% <100.00%> (ø)
virtual_rainforest/core/constants_class.py 100.00% <100.00%> (ø)
virtual_rainforest/core/constants_loader.py 100.00% <100.00%> (ø)
virtual_rainforest/core/schema.py 100.00% <100.00%> (ø)
virtual_rainforest/main.py 98.21% <100.00%> (-1.79%) ⬇️
...rtual_rainforest/models/abiotic_simple/__init__.py 100.00% <100.00%> (ø)
...rest/models/abiotic_simple/abiotic_simple_model.py 100.00% <100.00%> (ø)
... and 15 more

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@davidorme davidorme marked this pull request as ready for review October 12, 2023 22:03
@davidorme
Copy link
Collaborator Author

davidorme commented Oct 12, 2023

Round 4

OK - so this now implements a revised mode of operation, following discussion, that explicitly separates the registration process from module import. The MODULE_REGISTRY is populated using register_module - that does import the module to be registered in order to discover components, but module are not registered in their own __init__.py anymore.

The test suite is now updated to work with this new approach and all tests pass. That has included the addition of a global autouse=True fixture that ensures that the MODULE_REGISTRY is cleared before all tests. This is because the registry contents persist between tests and we want to ensure that all tests can stand alone, with the correct module registration, rather than simply passing by accident due to earlier registration.

The docs are update too now. There is one odd sphinx warning that I can't figure out right now, which is something to do with the id_of property of JSONSchema validators.

I've turned this from WIP into a full PR. I know it's a lot of changes but the key things that need attention are really just the latest register_module function in core/registry.py. If that seems solid then the test changes are all pretty clear. I have updated some test structures to make them easier to follow and to use the Config process corretly.

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@@ -31,8 +32,8 @@ def test_vr_run(capsys):
str(example_dir),
"--outpath",
str(example_dir),
"--logfile",
str(example_dir / "vr_example.log"),
# "--logfile",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these supposed to be commented out?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No - but there's a whole bunch of fixing over in #331 which will replace much of this.

),
)
from virtual_rainforest.models.litter.constants import LitterConsts
from virtual_rainforest.models.litter.litter_model import LitterModel
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a particular value in importing these within (almost) every test rather than just at the top of the script. I can definitely see the value for things only used in specific tests, but if LitterModel can't be imported then basically every test will break anyway?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood @dalonsoa correctly, I think the idea is that as much as possible of the environment used by each test should be isolated within it? With some vr things - notably custom Exceptions - we have to import them at the module level so they can be used in parameterisation, but otherwise tests should be self contained? Is that right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. As much as possible, setting up the tests (as oppose to running them) should not depend on the package under test itself, and that means having all the imports - even if repeated - within the tests. This avoids side effects (keeping appropriate isolation) as well as preventing the whole test suit to fail at setup time just because a failure of the package under test.

Having said that, as David points out, there are specific use cases where that is not possible, but it should be the general approach.

@davidorme davidorme merged commit cd8daac into develop Oct 16, 2023
22 checks passed
@davidorme davidorme deleted the 278-importing-virtual_rainforest-package-always-causes-models-to-be-loaded branch October 16, 2023 15:33
@davidorme davidorme mentioned this pull request Oct 16, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants