Add example data #279

alexdewar · 2023-08-03T10:35:17Z

Description

This PR adds some example data as a subpackage inside virtual_rainforest, as suggested by @davidorme. This allows end users to try VR out without having to construct their own dataset or get access to the RDS. This does include some binary files, though they are small in size (< 100Kb total).

I based the dataset on the one at /rds/project/virtual_rainforest/live/dummy_data, with the paths in config files changed to be relative (now that #269 is merged).

I added a new flag to vr_run (--example) to allow users to run a simulation with the example data. I'm not entirely sure about this solution -- it would be nicer to have users just point vr_run at wherever the data is located -- but it might not be easy to find if e.g. VR was installed via pip. Suggestions welcome.

There is also now an example_data_path member in the virtual_rainforest package for convenience, though if you import it then all the models will also be loaded (see #278).

Closes #265.

Type of change

New feature (non-breaking change which adds functionality)
Optimization (back-end change that speeds up the code)
Bug fix (non-breaking change which fixes an issue)

Key checklist

Make sure you've run the pre-commit checks: $ pre-commit run -a
All tests pass: $ poetry run pytest

Further checks

Code is commented, particularly in hard-to-understand areas
Tests added that prove fix is effective or that feature works

Closes #265.

mypy was complaining about lists being invariant, whatever that means.

davidorme · 2023-08-03T10:44:51Z

I was thinking more of having a completely parallel package within the same repo, rather than embedding it within vr. So that we would have a separate package structure, an extra entry in pyproject.toml and then have import virtual_rainforest_example_data as vr_data or use importlib.resources.

This search page shows how the example from pyrealm is structured through the repo:
https://github.com/search?q=repo%3AImperialCollegeLondon%2Fpyrealm%20pyrealm_build_data&type=code

jacobcook1995

LGTM!

Just had a query, how much of a problem will updating those binary files be. Like I know we don't want to be changing them super frequently, but they will need to be altered every time model inputs are changed, which is still going to happen with reasonable (weekly?) frequency. Is the best approach just not to worry and to update them every time a new model input gets added (or removed or changed)?

davidorme · 2023-08-03T10:57:17Z

We don't want to update them too much, because you end up with complete copies of each binary file in the repo history. The files aren't huge, so not a big deal, but it wouldn't take too many updates to bloat the repo.

One thing here might be to develop the csv reader to allow example data with incremental not binary updates - but that is a bit of a sledgehammer/nut solution.

dalonsoa · 2023-08-03T11:14:28Z

From time to time, you can purge the repo from binary files in the history if it becomes too big.

dalonsoa

LGTM!

alexdewar · 2023-08-03T13:29:39Z

@davidorme I hear what you're saying, but I'm not sure that we want the data in a separate repo (for now), for a couple of reasons.

Downloading an additional dataset repo is an extra (unnecessary) job to do before users can actually try out VR. For the sake of the minimal amount of data savings, it's just not worth it, and just makes the project less accessible
VR is currently under very active development and there's nothing like a stable format for data files atm. This means that the dataset repo will also need to be regularly updated in order to be made compatible with the latest version of VR and it means that if you want to try VR with the example data you'll have to make sure you get exactly the right version of the data for exactly the right version of VR or it just won't work

If VR had been around for years and there was a stable data format and you had, e.g. 1GB of example model data, then it might make sense to store that somewhere else (e.g. using git LFS). Even then though, I think it's still nice to have a minimal example dataset to go along with code, so that users can test it out.

Yes, you shouldn't be committing loads and loads of binary data to your git repo as a matter of course, but it's fine to have some small binary files in there, even if they change semi-regularly. People routinely use git to store websites, even where there are image files in there that change all the time. My copy of the VR repo is currently at ~5MB. Even if it ends up being >100MB a few years down the line, that's honestly totally fine too. Let's not make our lives unnecessarily difficult here.

davidorme · 2023-08-03T14:03:30Z

I agree that the binary files here are not likely to be an issue - mostly responding to @jacobcook1995's query 😄

I take your point about keeping the data in sync while vr is in alpha/beta. Ultimately I'm not bothered about the minor increase in the virtual_rainforest package size and just think the separate package is cleaner, but that shouldn't stop us doing this now. I think it is worth adding a long range data package issue to revisit this at some point.

alexdewar · 2023-08-03T14:35:40Z

@davidorme Sounds good to me.

alexdewar added 7 commits August 3, 2023 11:12

Add dummy_data

bf419e0

Remove data readme file

45b6e91

Use relative paths for dummy data config

6d3f8c6

Add example_data package to virtual_rainforest

0356463

Closes #265.

Use Sequence type for vr_run() and Config()

0aebf95

mypy was complaining about lists being invariant, whatever that means.

Allow for running example config with --example flag

5c4667f

Update docs to mention example data

4245af9

alexdewar requested review from dalonsoa, davidorme, jacobcook1995, vgro, LivDaniel and jaideep777 August 3, 2023 10:35

jacobcook1995 approved these changes Aug 3, 2023

View reviewed changes

dalonsoa approved these changes Aug 3, 2023

View reviewed changes

alexdewar merged commit 65f5d38 into develop Aug 3, 2023

alexdewar deleted the add_example_data branch August 3, 2023 14:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example data #279

Add example data #279

alexdewar commented Aug 3, 2023

davidorme commented Aug 3, 2023

jacobcook1995 left a comment

davidorme commented Aug 3, 2023

dalonsoa commented Aug 3, 2023

dalonsoa left a comment

alexdewar commented Aug 3, 2023

davidorme commented Aug 3, 2023

alexdewar commented Aug 3, 2023

Add example data #279

Add example data #279

Conversation

alexdewar commented Aug 3, 2023

Description

Type of change

Key checklist

Further checks

davidorme commented Aug 3, 2023

jacobcook1995 left a comment

Choose a reason for hiding this comment

davidorme commented Aug 3, 2023

dalonsoa commented Aug 3, 2023

dalonsoa left a comment

Choose a reason for hiding this comment

alexdewar commented Aug 3, 2023

davidorme commented Aug 3, 2023

alexdewar commented Aug 3, 2023