Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Git LFS for Binary Data Handling (acceptance test data) #1974

Closed
wants to merge 5 commits into from

Conversation

cpelley
Copy link
Contributor

@cpelley cpelley commented Jan 2, 2024

Description

This change has been on my radar for a significant time. The test data being made public being a pre-requisite for realising this change, hence only now having this PR.
If approved, we can then eventually remove https://github.com/metoppv/improver_test_data (perhaps that repo. is kept for historic purposes i.e. read-only so no new changes made).

Acceptance data comes form metoppv/improver_test_data@4435108 (to be updated to whatever is the latest commit from improver_test_data) data before merging).

Purpose

This pull request aims to enhance the repository's handling of binary data, specifically in the 'resources' directory under the 'tests' folder. Git LFS (Large File Storage) is introduced to efficiently manage large binary files, providing benefits such as faster cloning, reduced storage requirements, and improved overall repository performance.

Changes Made

  • Added/modified the .gitattributes file to specify Git LFS tracking for files in the 'resources' directory and its subdirectories.
  • Excluded specific text files (LICENSE, README.md) from Git LFS to maintain readability in the repository.

Benefits of Using Git LFS

  • Efficient Storage: Git LFS stores large files outside the main repository, reducing the size of the repository and speeding up cloning for collaborators.
  • Improved Performance: Smoother repository operations as Git LFS fetches binary files on demand, not burdening the version control system with unnecessary file history.
  • Versioning Integrity: Large files are stored separately, preserving the integrity and efficiency of version control for source code.

How Git LFS Works

Git LFS replaces large files with text pointers in the repository, while the actual binary content is stored externally. This allows Git to handle binary data more efficiently, without affecting the core functionality of version control. The .gitattributes file is used to specify which files should be managed by Git LFS.

End-user experience/impact

  • Everything under the improver_test/resources directory is automatically tracked using git lfs as described above and so no impact to the end user here i.e. they treat data files in the same way as any other file under the repository (git lfs handles the distinction for you).
  • The only requirement is that the user have git-lfs installed on their system (available out of the box on systems at the MO).
  • When it has implications to the user:
    • Initial clone of the repository.
      • During the clone, only the metadata and pointers for the large files are downloaded, not the actual binary content.
      • If the user wants to access the actual binary content of the large files, they can run: git lfs fetch.
      • Alternatively, the user can run git lfs pull, which fetches both the pointers and the content in one step.
  • When it has no implications to the user:
    • Otherwise... switching branches, checkouts or merging.
    • Git LFS will automatically fetch the necessary LFS objects for the files in the new branch if they are not already present locally.

Testing

  • Verified Git LFS configurations locally.
  • Confirmed that binary files are properly tracked by Git LFS during various operations.

Checklist

  • The .gitattributes file has been updated to reflect the new Git LFS configurations.
  • Local testing has been conducted to ensure proper functioning.
  • Excluded text files from Git LFS tracking for clarity.
  • Fluff out project root README file with some relevant details.

Issues

@cpelley cpelley changed the title Git lfs test acceptance data Git lfs acceptance test data Jan 2, 2024
@cpelley cpelley marked this pull request as ready for review January 2, 2024 12:52
@cpelley cpelley self-assigned this Jan 2, 2024
@cpelley cpelley added MO review required PRs opened by non-Met Office developers that require a Met Office review BoM review required PRs opened by non-BoM developers that require a BoM review labels Jan 2, 2024
@cpelley cpelley changed the title Git lfs acceptance test data Add Git LFS for Binary Data Handling (acceptance test data) Jan 2, 2024
Copy link

In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity.

If this PR is still important to you, then please comment on this PR and the stale label will be removed.

Otherwise this PR will be automatically closed in 30 days time.

@github-actions github-actions bot added the Stale label Apr 20, 2024
@PaulAbernethy
Copy link
Contributor

bump

@github-actions github-actions bot removed the Stale label Apr 25, 2024
@cpelley
Copy link
Contributor Author

cpelley commented Apr 25, 2024

I have updated the README.md with details about git LFS.

image

@cpelley cpelley force-pushed the git_lfs_test_data branch 2 times, most recently from dcc3f5e to 183c35b Compare April 25, 2024 10:50
@cpelley
Copy link
Contributor Author

cpelley commented Apr 25, 2024

Ultimately, it would make sense to move away from file checksum comparisons for IMPROVER acceptance testing (but I wont suggest radically overhauling the testing in the same PR, unless we abs. need to).

I'm not familiar with the testing framework of IMPROVER, @gavinevans do you have any ideas as the the reason behind the checksum failures?

I have rebased this dev. branch against latest IMPROVER master, and pulled in git LFS test data from metoppv/improver_test_data@4435108 (i.e. latest).

Copy link
Contributor

@gavinevans gavinevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cpelley 👍

I think that the usage of git lfs as you've outlined on this PR makes sense. I've added a few comments to help with some of the failures.

@@ -0,0 +1,16 @@
# IMPROVER acceptance test data

This directory represents that data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This directory represents that data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests.
This directory represents the data required to run the [IMPROVER](https://github.com/metoppv/improver) acceptance tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this typo is still present.

@@ -260,7 +260,7 @@ def kgo_root():
try:
test_dir = os.environ[ACC_TEST_DIR_ENVVAR]
except KeyError:
return ACC_TEST_DIR_MISSING
test_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "resources")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this change has caused the acceptance test data to be picked up as part of the GitHub Actions, and therefore the checksums have been compared. We haven't included the acceptance test data within the GitHub Actions previously but this would be a potential benefit of including the acceptance test data within this repo using git lfs.

I think that the checksum comparisons are failing because the acceptance test data within this repo when run on GitHub Actions are actually the text pointer files, rather than the actual netCDF files. I think that the options therefore are:

  1. Disable the acceptance tests on GitHub Actions. I think that this can be done using pytest -m "not acc" within the pytest without coverage and pytest with coverage sections in GitHub Actions.
  2. Add git lfs pull to the pytest without coverage and pytest with coverage sections in GitHub Actions. I've tried this here and it seems like this has worked and the acceptance test has actually run on GitHub Actions. This would probably be preferable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checksum calculation is also catching the README.md and LICENSE files which don't exist in the reference copy, nor should they. We should exclude these files from the checksum calculation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test_init_files_exist test is now failing: https://github.com/metoppv/improver/blob/master/improver_tests/test_source_code.py#L67 as we usually expect every directory to contain an __init__.py file. As this requirement doesn't make sense for the acceptance test data, you'll need to update the test to exclude the improver_tests/acceptance/resources directory.

@bayliffe
Copy link
Contributor

Thanks for this Carwyn. I can see the advantages of this approach, particularly with @gavinevans suggestions allowing us to run the acceptance tests as part of the actions (though we should be mindful of the number of actions minutes this might consume).

My experience of using this is not quite as described. Checking out the branch, or cloning the repository clean in single branch mode (as if this branch were master: git clone -b git_lfs_test_data --single-branch [email protected]:metoppv/improver.git) both led to all of the acceptance test data files being retrieved immediately. The resources directory is ~150MB and I could interrogate the data immediately, without performing the suggested git lfs specific steps. Gavin found his resources directory to be somewhat smaller than the current acceptance test data repository. Perhaps I'm just special, but I'd appreciate any thoughts on why this is?

@cpelley
Copy link
Contributor Author

cpelley commented May 14, 2024

My experience of using this is not quite as described.

😱@bayliffe, thanks for doing this experiment. Yeah, so nothing is required to get the data, just seamless without extra steps.
I have no idea why I was under the impression an explicit git lfs fetch was needed now 🤷
I'll update the documentation to reflect this.

If desiring to not pull down data on clone:

GIT_LFS_SKIP_SMUDGE=1 git clone ...

Or globally:

git lfs install --skip-smudge

Gavin found his resources directory to be somewhat smaller than...

Thanks @bayliffe, @gavinevans, I'll investigate and check back in with details of my findings 👍

@cpelley
Copy link
Contributor Author

cpelley commented May 14, 2024

The resources directory is ~150MB... Gavin found his resources directory to be somewhat smaller than...

Looks good to me. I get 76.2MB for data under this branch and for data in metoppv/improver_test_data@4435108

@cpelley cpelley force-pushed the git_lfs_test_data branch from 183c35b to ff051fe Compare May 14, 2024 13:04
@cpelley
Copy link
Contributor Author

cpelley commented May 14, 2024

@gavinevans, @bayliffe, all the acceptance tests pass for me locally. I also included an LFS caching support to mitigate against concerns with increased footprint.
I see that the conda environment yaml files are pinned but don't specify exact builds. This will certainly lead to differences in environment compared to our pre-built one locally. I suspect this to be the route cause of the remaining failures.

For now, I'm going to opt for falling back to excluding the acceptance tests in actions since they weren't run before (was a bonus).
I'll leave in the github action workflow refactor and LFS caching in-place for when someone comes back to enabling the acceptance tests through actions.

Comment on lines +18 to +31
# git LFS cache: https://github.com/actions/checkout/issues/165#issuecomment-1639209867
- name: Create LFS file list
run: git lfs ls-files --long | cut -d ' ' -f1 | sort > .lfs-assets-id

- name: LFS Cache
uses: actions/cache@v3
with:
path: .git/lfs/objects
key: ${{ runner.os }}-lfs-${{ hashFiles('.lfs-assets-id') }}
restore-keys: |
${{ runner.os }}-lfs-

- name: Git LFS Pull
run: git lfs pull
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

git lfs pull hooked into cache.

@cpelley
Copy link
Contributor Author

cpelley commented May 23, 2024

Through our experiments we have utilised all the bandwidth :(

image

I don't this would be ordinarily a problem so long as we don't clone the repository all the time.
Having said that, I don't think we want to make the repository problematic to clones by 3rd parties.

Next week when our quota resets, I'll have a go at trying to change default behaviour so that it doesn't fetch the lfs data on checkout/clone.

Under a .lfsconfig file of the root:

[lfs]
    skip-smudge = true

or if skip-smudge isn't supported via the lfsconfig, then:

    fetchexclude = *

@cpelley cpelley marked this pull request as draft June 13, 2024 09:55
@cpelley
Copy link
Contributor Author

cpelley commented Jun 13, 2024

Converted this to draft until some indeterminate date where implications can be better characterised and understood.

Copy link

In order to maintain a backlog of relevant PRs, we automatically label them as stale after 60 days of inactivity.

If this PR is still important to you, then please comment on this PR and the stale label will be removed.

Otherwise this PR will be automatically closed in 30 days time.

@github-actions github-actions bot added the Stale label Aug 13, 2024
Copy link

This stale PR has been automatically closed due to a lack of activity.

If you still care about this PR, then please re-open this PR.

@github-actions github-actions bot closed this Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BoM review required PRs opened by non-BoM developers that require a BoM review MO review required PRs opened by non-Met Office developers that require a Met Office review Paused Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants