-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Git LFS for Binary Data Handling (acceptance test data) #1974
Conversation
In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity. If this PR is still important to you, then please comment on this PR and the stale label will be removed. Otherwise this PR will be automatically closed in 30 days time. |
bump |
I have updated the README.md with details about git LFS. |
dcc3f5e
to
183c35b
Compare
Ultimately, it would make sense to move away from file checksum comparisons for IMPROVER acceptance testing (but I wont suggest radically overhauling the testing in the same PR, unless we abs. need to). I'm not familiar with the testing framework of IMPROVER, @gavinevans do you have any ideas as the the reason behind the checksum failures? I have rebased this dev. branch against latest IMPROVER master, and pulled in git LFS test data from metoppv/improver_test_data@4435108 (i.e. latest). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cpelley 👍
I think that the usage of git lfs
as you've outlined on this PR makes sense. I've added a few comments to help with some of the failures.
@@ -0,0 +1,16 @@ | |||
# IMPROVER acceptance test data | |||
|
|||
This directory represents that data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This directory represents that data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests. | |
This directory represents the data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this typo is still present.
@@ -260,7 +260,7 @@ def kgo_root(): | |||
try: | |||
test_dir = os.environ[ACC_TEST_DIR_ENVVAR] | |||
except KeyError: | |||
return ACC_TEST_DIR_MISSING | |||
test_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this change has caused the acceptance test data to be picked up as part of the GitHub Actions, and therefore the checksums have been compared. We haven't included the acceptance test data within the GitHub Actions previously but this would be a potential benefit of including the acceptance test data within this repo using git lfs.
I think that the checksum comparisons are failing because the acceptance test data within this repo when run on GitHub Actions are actually the text pointer files, rather than the actual netCDF files. I think that the options therefore are:
- Disable the acceptance tests on GitHub Actions. I think that this can be done using
pytest -m "not acc"
within thepytest without coverage
andpytest with coverage
sections in GitHub Actions. - Add
git lfs pull
to thepytest without coverage
andpytest with coverage
sections in GitHub Actions. I've tried this here and it seems like this has worked and the acceptance test has actually run on GitHub Actions. This would probably be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The checksum calculation is also catching the README.md and LICENSE files which don't exist in the reference copy, nor should they. We should exclude these files from the checksum calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test_init_files_exist
test is now failing: https://github.com/metoppv/improver/blob/master/improver_tests/test_source_code.py#L67 as we usually expect every directory to contain an __init__.py
file. As this requirement doesn't make sense for the acceptance test data, you'll need to update the test to exclude the improver_tests/acceptance/resources
directory.
Thanks for this Carwyn. I can see the advantages of this approach, particularly with @gavinevans suggestions allowing us to run the acceptance tests as part of the actions (though we should be mindful of the number of actions minutes this might consume). My experience of using this is not quite as described. Checking out the branch, or cloning the repository clean in single branch mode (as if this branch were master: |
😱@bayliffe, thanks for doing this experiment. Yeah, so nothing is required to get the data, just seamless without extra steps. If desiring to not pull down data on clone:
Or globally:
Thanks @bayliffe, @gavinevans, I'll investigate and check back in with details of my findings 👍 |
Looks good to me. I get 76.2MB for data under this branch and for data in metoppv/improver_test_data@4435108 |
183c35b
to
ff051fe
Compare
@gavinevans, @bayliffe, all the acceptance tests pass for me locally. I also included an LFS caching support to mitigate against concerns with increased footprint. For now, I'm going to opt for falling back to excluding the acceptance tests in actions since they weren't run before (was a bonus). |
# git LFS cache: https://github.com/actions/checkout/issues/165#issuecomment-1639209867 | ||
- name: Create LFS file list | ||
run: git lfs ls-files --long | cut -d ' ' -f1 | sort > .lfs-assets-id | ||
|
||
- name: LFS Cache | ||
uses: actions/cache@v3 | ||
with: | ||
path: .git/lfs/objects | ||
key: ${{ runner.os }}-lfs-${{ hashFiles('.lfs-assets-id') }} | ||
restore-keys: | | ||
${{ runner.os }}-lfs- | ||
|
||
- name: Git LFS Pull | ||
run: git lfs pull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
git lfs pull
hooked into cache.
Through our experiments we have utilised all the bandwidth :( I don't this would be ordinarily a problem so long as we don't clone the repository all the time. Next week when our quota resets, I'll have a go at trying to change default behaviour so that it doesn't fetch the lfs data on checkout/clone. Under a
or if
|
Converted this to draft until some indeterminate date where implications can be better characterised and understood. |
In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity. If this PR is still important to you, then please comment on this PR and the stale label will be removed. Otherwise this PR will be automatically closed in 30 days time. |
This stale PR has been automatically closed due to a lack of activity. If you still care about this PR, then please re-open this PR. |
Description
This change has been on my radar for a significant time. The test data being made public being a pre-requisite for realising this change, hence only now having this PR.
If approved, we can then eventually remove https://github.com/metoppv/improver_test_data (perhaps that repo. is kept for historic purposes i.e. read-only so no new changes made).
Acceptance data comes form metoppv/improver_test_data@4435108 (to be updated to whatever is the latest commit from improver_test_data) data before merging).
Purpose
This pull request aims to enhance the repository's handling of binary data, specifically in the 'resources' directory under the 'tests' folder. Git LFS (Large File Storage) is introduced to efficiently manage large binary files, providing benefits such as faster cloning, reduced storage requirements, and improved overall repository performance.
Changes Made
Benefits of Using Git LFS
How Git LFS Works
Git LFS replaces large files with text pointers in the repository, while the actual binary content is stored externally. This allows Git to handle binary data more efficiently, without affecting the core functionality of version control. The .gitattributes file is used to specify which files should be managed by Git LFS.
End-user experience/impact
git lfs
handles the distinction for you).git lfs fetch
.git lfs pull
, which fetches both the pointers and the content in one step.Testing
Checklist
Issues