-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Use a registry file for test data instead of hard-coding paths in conftest.py
#439
Comments
For reference, I used I looked in pooch and it seems you do have to download a file at a time. So for a folder we would need to parse the registry ourselves (or more accurately get it from the registry dict as below) and call fetch for each file. This function should work (not tested): def fetch_pooch_directory(registry: pooch.Pooch, directory_name: str, processor=None, downloader=None, progressbar=False):
for name in registry.registry_files:
if name.startswith(f"{directory_name}/"):
registry.fetch(name, processor=processor, downloader=downloader, progressbar=progressbar)
return str(registry.abspath / directory_name) And you'd use it with e.g. how you set up the conftest as: @pytest.fixture
def test_data_registry():
"""
Create a test data registry for BrainGlobe.
Returns:
pooch.Pooch: The test data registry object.
"""
registry = pooch.create(
path=None,
base_url="https://gin.g-node.org/BrainGlobe/...",
env="BRAINGLOBE_TEST_DATA_DIR",
)
registry.load_registry(".../pooch_registry.txt")
return registry
def test_something(test_data_registry):
root = fetch_pooch_directory(test_data_registry, "cellfinder/my_images")
assert root The one thing is that each repo should have its own registry file which contains only the files used in that repo. So we don't e.g. need to list all the cellfinder data in the brainglobe-utils registry file in that repo. Only files used in brainglobe-utils (whether they are all in a single gin repo). |
Can the individual files themselves not be compressed? I'm cautious of our workflows not grinding to a halt with slow network connections. |
I do compress the tifs ( But, on the CI, we should always cache the registry. So that only the first time after a change in the registry, or for a new python version, or the first time running locally will it take a while to download (gin seems to be very slow actually). But after that github will just get it from the cache. |
Following up from #440.
I wasn't sure if you were asking for suggestions on hosting. The obvious ones such as AWS et al come to mind. Maybe your university already has some preferred storage solutions that they sponsor. But it may not be free and may not outlive the lab. Perhaps using some cloud provider on CI with a secret key to speed up CI and fallback to g-node elsewhere if someone locally is trying to run the tests. And same for getting the trained model files. Github can host files with these repo limits, but I don't think git is the best place to manage test data. A hack I did |
I'm very hesitant to move storage platforms, for all the reasons you mention, cost, longevity etc. GIN is quite slow, but it works well for our other projects, once the data in cached. Is it possible to reduce the size of the test data in some way? |
Is your feature request related to a problem? Please describe.
Currently, we use a
pooch.registry
for test data. This is great, but we have plans to add some zip files that have lots of (compressed) subfolders. This is opaque (you need to download them before seeing what's in them), but will be needed in https://github.com/matham/brainglobe-utils/pull/2/files. We also hard-code the relative paths to test data on GIN inconftest.py
, which isn't great.Describe the solution you'd like
(Credit to @matham who suggested most, if not all of this 🙏 )
pooch
to create a registry csv file with hashespooch
has a nicer way of doing this) and fetches all files in a desired subfolder.This will allow:
Describe alternatives you've considered
\
Additional context
We want to move all test data out of GitHub repositories. If this plan is successful, we can roll it out to other BrainGlobe repos.
Related to brainglobe/BrainGlobe#5
The text was updated successfully, but these errors were encountered: