Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement cache for URL downloads #27

Open
stscieisenhamer opened this issue Jan 9, 2019 · 10 comments
Open

Implement cache for URL downloads #27

stscieisenhamer opened this issue Jan 9, 2019 · 10 comments
Labels
help wanted Extra attention is needed

Comments

@stscieisenhamer
Copy link
Collaborator

Issue

Implement a cache for URL downloads.

Strawman implementation would be controlled by two environmental variables:

  • BIGDATA_CACHE
    Points to a "more accessible" folder to use to cache downloaded data. If not set or not accessible, no caching would occur.
  • BIGDATA_TTL
    If caching is enabled, use this time, in minutes, to allow the cache to live. If not specified, default would be 24 hours (24 * 60 minutes). If set to something like "-1" or "0", cache lives on indefinitely.

Cache file names would be a hash based on the requested URL.

@pllim
Copy link
Collaborator

pllim commented Jan 9, 2019

Isn't this basically CRDS? I thought for jwst, @jdavies-st already have some shared storage set up? What is the use case that prompted this request?

@pllim
Copy link
Collaborator

pllim commented Jan 9, 2019

p.s. If you want caching, the download functions in astropy.utils.data already do it, though they don't have timer for cache invalidation.

@jhunkeler
Copy link
Contributor

I know what he's getting at. If you execute tests locally it will re-download the upstream file(s) for each run. That can become a massive time sink if you're actively developing a test with large data, so I agree... This thing should have an option to cache downloaded files somewhere.

On the RT server this would be unwanted but for all other cases it's definitely worth it.

@pllim
Copy link
Collaborator

pllim commented Jan 10, 2019

I thought for local tests, we are supposed to use a local clone of the Artifactory stuff with jfrog CLI? I am confused.

@jhunkeler
Copy link
Contributor

jhunkeler commented Jan 10, 2019

Off the top of my head, I think this is a valid use case...

A developer is writing a new set of tests for data that doesn't exist upstream, so they create the directory structure they intend to use (anywhere on disk) and set BIGDATA to the local path. Now they can run their test as many times as they want with no overhead.

When the developer decides to run a different test suite (for whatever reason) ci_watson downloads each file individually to the current _jail directory. When the developer re-runs that test suite those files are downloaded again (and again).

If a cache was in place the developer would only incur X minutes of overhead once. As files are deleted due to the TTL or a header check, they'll be replaced by the server at a later time. This is useful, because sometimes you might not want BIGDATA fully populated due to sheer size, but you still might have to run some things repeatedly, which also takes time.

So if you know you're going to run your non-upstream tests, and a subset of another test tree's data that does live upstream, the cache will speed up execution dramatically. Of course, if you run the entire suite you'll end up with all of the files, however, those files are considered temporary instead of a cyst ever-growing on your local filesystem.

@pllim pllim added the help wanted Extra attention is needed label Jan 10, 2019
@stscieisenhamer
Copy link
Collaborator Author

And though a dev/user may have jfrog access, such caching allows one to get only the data needed for whatever testsuite is being run, without the dev/user hunting around manually.

Also, another use case involves concurrent development and the fact the data itself is not versioned. If dev A is working on a module, let's say for bugfixing, with caching, dev A would now have a local and, if set to not be cleared, permanent data set to test against. If dev B then works on, not necessarily the same code but code that uses some common set of data, which dev B then changes, these changes would not affect dev A work until such time that dev A is about to merge. At that point, dev A should clear their local cache and check regression.

All of this would be with the convenience of just setting/unsetting an environmental (in the strawman case), as opposed to lots of explicit hunting/copying/deleting.

@pllim
Copy link
Collaborator

pllim commented Mar 8, 2022

I don't think this is relevant anymore with #53

@pllim pllim closed this as completed Mar 8, 2022
@stscieisenhamer
Copy link
Collaborator Author

FYI: This has nothing to do with CRDS. This is caching locally files retrieved from artifactory.

@pllim
Copy link
Collaborator

pllim commented Mar 9, 2022

CRDS handles the cache now.

@pllim pllim reopened this Mar 9, 2022
@pllim
Copy link
Collaborator

pllim commented Mar 9, 2022

OK, I misunderstood. This is for Artifactory data, not CRDS ref files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants