Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add example data #279

Merged
merged 7 commits into from
Aug 3, 2023
Merged

Add example data #279

merged 7 commits into from
Aug 3, 2023

Conversation

alexdewar
Copy link
Collaborator

Description

This PR adds some example data as a subpackage inside virtual_rainforest, as suggested by @davidorme. This allows end users to try VR out without having to construct their own dataset or get access to the RDS. This does include some binary files, though they are small in size (< 100Kb total).

I based the dataset on the one at /rds/project/virtual_rainforest/live/dummy_data, with the paths in config files changed to be relative (now that #269 is merged).

I added a new flag to vr_run (--example) to allow users to run a simulation with the example data. I'm not entirely sure about this solution -- it would be nicer to have users just point vr_run at wherever the data is located -- but it might not be easy to find if e.g. VR was installed via pip. Suggestions welcome.

There is also now an example_data_path member in the virtual_rainforest package for convenience, though if you import it then all the models will also be loaded (see #278).

Closes #265.

Type of change

  • New feature (non-breaking change which adds functionality)
  • Optimization (back-end change that speeds up the code)
  • Bug fix (non-breaking change which fixes an issue)

Key checklist

  • Make sure you've run the pre-commit checks: $ pre-commit run -a
  • All tests pass: $ poetry run pytest

Further checks

  • Code is commented, particularly in hard-to-understand areas
  • Tests added that prove fix is effective or that feature works

@davidorme
Copy link
Collaborator

I was thinking more of having a completely parallel package within the same repo, rather than embedding it within vr. So that we would have a separate package structure, an extra entry in pyproject.toml and then have import virtual_rainforest_example_data as vr_data or use importlib.resources.

This search page shows how the example from pyrealm is structured through the repo:
https://github.com/search?q=repo%3AImperialCollegeLondon%2Fpyrealm%20pyrealm_build_data&type=code

Copy link
Collaborator

@jacobcook1995 jacobcook1995 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Just had a query, how much of a problem will updating those binary files be. Like I know we don't want to be changing them super frequently, but they will need to be altered every time model inputs are changed, which is still going to happen with reasonable (weekly?) frequency. Is the best approach just not to worry and to update them every time a new model input gets added (or removed or changed)?

@davidorme
Copy link
Collaborator

We don't want to update them too much, because you end up with complete copies of each binary file in the repo history. The files aren't huge, so not a big deal, but it wouldn't take too many updates to bloat the repo.

One thing here might be to develop the csv reader to allow example data with incremental not binary updates - but that is a bit of a sledgehammer/nut solution.

@dalonsoa
Copy link
Collaborator

dalonsoa commented Aug 3, 2023

From time to time, you can purge the repo from binary files in the history if it becomes too big.

Copy link
Collaborator

@dalonsoa dalonsoa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@alexdewar
Copy link
Collaborator Author

@davidorme I hear what you're saying, but I'm not sure that we want the data in a separate repo (for now), for a couple of reasons.

  1. Downloading an additional dataset repo is an extra (unnecessary) job to do before users can actually try out VR. For the sake of the minimal amount of data savings, it's just not worth it, and just makes the project less accessible
  2. VR is currently under very active development and there's nothing like a stable format for data files atm. This means that the dataset repo will also need to be regularly updated in order to be made compatible with the latest version of VR and it means that if you want to try VR with the example data you'll have to make sure you get exactly the right version of the data for exactly the right version of VR or it just won't work

If VR had been around for years and there was a stable data format and you had, e.g. 1GB of example model data, then it might make sense to store that somewhere else (e.g. using git LFS). Even then though, I think it's still nice to have a minimal example dataset to go along with code, so that users can test it out.

Yes, you shouldn't be committing loads and loads of binary data to your git repo as a matter of course, but it's fine to have some small binary files in there, even if they change semi-regularly. People routinely use git to store websites, even where there are image files in there that change all the time. My copy of the VR repo is currently at ~5MB. Even if it ends up being >100MB a few years down the line, that's honestly totally fine too. Let's not make our lives unnecessarily difficult here.

@davidorme
Copy link
Collaborator

I agree that the binary files here are not likely to be an issue - mostly responding to @jacobcook1995's query 😄

I take your point about keeping the data in sync while vr is in alpha/beta. Ultimately I'm not bothered about the minor increase in the virtual_rainforest package size and just think the separate package is cleaner, but that shouldn't stop us doing this now. I think it is worth adding a long range data package issue to revisit this at some point.

@alexdewar
Copy link
Collaborator Author

@davidorme Sounds good to me.

@alexdewar alexdewar merged commit 65f5d38 into develop Aug 3, 2023
@alexdewar alexdewar deleted the add_example_data branch August 3, 2023 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add example config files to the repo
4 participants