Implement cache for URL downloads #27

stscieisenhamer · 2019-01-09T22:01:05Z

Issue

Implement a cache for URL downloads.

Strawman implementation would be controlled by two environmental variables:

BIGDATA_CACHE
Points to a "more accessible" folder to use to cache downloaded data. If not set or not accessible, no caching would occur.
BIGDATA_TTL
If caching is enabled, use this time, in minutes, to allow the cache to live. If not specified, default would be 24 hours (24 * 60 minutes). If set to something like "-1" or "0", cache lives on indefinitely.

Cache file names would be a hash based on the requested URL.

pllim · 2019-01-09T22:40:58Z

Isn't this basically CRDS? I thought for jwst, @jdavies-st already have some shared storage set up? What is the use case that prompted this request?

pllim · 2019-01-09T22:42:05Z

p.s. If you want caching, the download functions in astropy.utils.data already do it, though they don't have timer for cache invalidation.

jhunkeler · 2019-01-10T03:15:45Z

I know what he's getting at. If you execute tests locally it will re-download the upstream file(s) for each run. That can become a massive time sink if you're actively developing a test with large data, so I agree... This thing should have an option to cache downloaded files somewhere.

On the RT server this would be unwanted but for all other cases it's definitely worth it.

pllim · 2019-01-10T04:36:11Z

I thought for local tests, we are supposed to use a local clone of the Artifactory stuff with jfrog CLI? I am confused.

jhunkeler · 2019-01-10T05:31:46Z

Off the top of my head, I think this is a valid use case...

A developer is writing a new set of tests for data that doesn't exist upstream, so they create the directory structure they intend to use (anywhere on disk) and set BIGDATA to the local path. Now they can run their test as many times as they want with no overhead.

When the developer decides to run a different test suite (for whatever reason) ci_watson downloads each file individually to the current _jail directory. When the developer re-runs that test suite those files are downloaded again (and again).

If a cache was in place the developer would only incur X minutes of overhead once. As files are deleted due to the TTL or a header check, they'll be replaced by the server at a later time. This is useful, because sometimes you might not want BIGDATA fully populated due to sheer size, but you still might have to run some things repeatedly, which also takes time.

So if you know you're going to run your non-upstream tests, and a subset of another test tree's data that does live upstream, the cache will speed up execution dramatically. Of course, if you run the entire suite you'll end up with all of the files, however, those files are considered temporary instead of a cyst ever-growing on your local filesystem.

stscieisenhamer · 2019-01-10T13:56:01Z

And though a dev/user may have jfrog access, such caching allows one to get only the data needed for whatever testsuite is being run, without the dev/user hunting around manually.

Also, another use case involves concurrent development and the fact the data itself is not versioned. If dev A is working on a module, let's say for bugfixing, with caching, dev A would now have a local and, if set to not be cleared, permanent data set to test against. If dev B then works on, not necessarily the same code but code that uses some common set of data, which dev B then changes, these changes would not affect dev A work until such time that dev A is about to merge. At that point, dev A should clear their local cache and check regression.

All of this would be with the convenience of just setting/unsetting an environmental (in the strawman case), as opposed to lots of explicit hunting/copying/deleting.

pllim · 2022-03-08T19:50:39Z

I don't think this is relevant anymore with #53

stscieisenhamer · 2022-03-09T03:11:08Z

FYI: This has nothing to do with CRDS. This is caching locally files retrieved from artifactory.

pllim · 2022-03-09T14:16:35Z

CRDS handles the cache now.

pllim · 2022-03-09T14:29:14Z

OK, I misunderstood. This is for Artifactory data, not CRDS ref files.

pllim added the help wanted Extra attention is needed label Jan 10, 2019

pllim closed this as completed Mar 8, 2022

pllim reopened this Mar 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement cache for URL downloads #27

Implement cache for URL downloads #27

stscieisenhamer commented Jan 9, 2019

pllim commented Jan 9, 2019

pllim commented Jan 9, 2019

jhunkeler commented Jan 10, 2019

pllim commented Jan 10, 2019

jhunkeler commented Jan 10, 2019 •

edited

Loading

stscieisenhamer commented Jan 10, 2019

pllim commented Mar 8, 2022

stscieisenhamer commented Mar 9, 2022

pllim commented Mar 9, 2022

pllim commented Mar 9, 2022

Implement cache for URL downloads #27

Implement cache for URL downloads #27

Comments

stscieisenhamer commented Jan 9, 2019

Issue

pllim commented Jan 9, 2019

pllim commented Jan 9, 2019

jhunkeler commented Jan 10, 2019

pllim commented Jan 10, 2019

jhunkeler commented Jan 10, 2019 • edited Loading

stscieisenhamer commented Jan 10, 2019

pllim commented Mar 8, 2022

stscieisenhamer commented Mar 9, 2022

pllim commented Mar 9, 2022

pllim commented Mar 9, 2022

jhunkeler commented Jan 10, 2019 •

edited

Loading