Locally caching remote or inline resources via "cache" key #243

danfowler · 2016-02-04T15:29:44Z

As of 1.0.0-beta.15, only one of path, url, or data is allowed on a given resource. In discussing this, it was suggested that it is useful to have a local path to which to download a remote (url) or inline (data) resource so that non-datapackage-aware tooling can access. As a result of discussion in #223, it has been suggested that a new key, cache, be used on a resource to specify a local path to "cache" remote or inline resources to disk.

Examples:

{
  "url": "http://xyz.com/dataset.csv",
  "cache": "dataset.csv"
}

{
  "data": [
    {"a": "1", "b": "2"}
  ],
  "cache": "dataset.csv"
}

So, when a resource has a url or data key, it SHOULD also have a cache key. An implementation that is reading the data in the data package, should then first check cache for the data, if it exists, and, if not, save the data as specified in url or data to cache. Otherwise, cache acts just like path. When a resource has a path key, an implementation should ignore cache.

WDYT? @morty @amercader @rgrp @pwalsh

The text was updated successfully, but these errors were encountered:

pwalsh · 2016-02-04T15:37:56Z

Big +1 on cache.

Neutral on the description of what implementations should do. What an implementation should do, in this case, is likely highly specific to the requirements of the implementation, and not an issue for the spec to address.

aubergene · 2016-02-12T19:10:14Z

How about storing the ETAG or some other hash of the document from the last time it was retrieved?

pwalsh · 2016-03-07T05:54:05Z

@rgrp I think I'd want to modify this, that cache can be a path or a url. WDYT?

pwalsh · 2016-03-07T05:56:03Z

@aubergene about ETAG or hash of document serves a different need IMHO. The need for a cache comes from the possibility of a resource hosted elsewhere becoming unavailable.

rossjones · 2016-03-07T07:41:33Z

I'd be really keen to support url as an option for the cache, even though it opens up a can of worms where the cached copy (of file or url) could potentially be out of date. Although it complicates thing, I agree with @aubergene that having a content-hash for the cache would be useful in determining if it is out of date - which may or may not be acceptable, but at least the user would know.

rufuspollock · 2016-03-07T14:41:20Z

OK, this seems fairly sensible and straightforward. Only question is whether cache is an object or a single path value. Object would be something like:

"cache": "path-to-cached-file"

# or cache as object
"cache": {
  "path": ...
  "hash": ...
  "etag / lastCached": timestamp when cache object was last stored ...
}

* `hash`: I am not sure about including a hash on the cache item since, in theory, this easily re-computed and then compared against the source object. However, I'm not using this in practice myself right now so like to here from potential implementors.
* `etag / lastCached`: i'm always wary of these things going out of date (i.e. a chance is made but you don't update the datapackage.json). Wrong info is worse than no info. Again any sense of the value of this would be useful in deciding whether it was worth the additional complexity.

pwalsh · 2016-03-07T14:47:24Z

+1 "cache": "path-to-cached-file"
-1 on hash and etag for cache as part of spec.

rossjones · 2016-03-07T14:52:33Z

Our recent look at 60k files we have archived showed < 50% had an etag, so it may not actually be that useful.

Agree the hash can be calculated by the client, but where the cache is a URL I need to fetch it first - the useful thing with the hash and lastCached date here is determining if it actually IS out of date. If the hash doesn't match the primary and/or the date doesn't match my expectations, then I know I need to investigate further. If there cache is a naked url, or file path, I have absolutely no idea without processing it whether it may be different than the original data.

rufuspollock · 2016-03-08T11:30:32Z

@rossjones ok so etag can be dropped. I guess my question here would be to do a progressive enhancement:

Right now: add a simple "cache": "path"
If it looks like storing the hash is pretty useful to several people then we can add that later (this can be backwards compatible by allowing simple string or object as value for cache)

Any thoughts?

@pwalsh one aside is interaction with cache and parts idea #228 - i guess in that cache might be an array but i think we can deal with that in #22 8

rufuspollock · 2016-03-23T11:35:22Z

OK so something like:

cache: a string relative path or url to another "cached" version of the data in this resource. Often used by data package consumers who wish to cache a copy of the data in case the data moves or disappears.

Only remaining question is whether this is a pattern or in formal spec (idea being we may leave a standard pattern and add to spec later if interest is large enough - want to keep spec as streamlined as possible).

Logic for pattern is this more for consumers than publishers (publishers would rarely implement ...).

pwalsh · 2016-07-12T08:58:14Z

@rgrp as mentioned in other issues, I'm partial to this being in the spec.

roll · 2016-08-06T20:19:47Z

Consider data package:

# http://example.com/datapackage.py
resources: [
  {
    path: "http://example.com/data/data.csv",
    cache: "data.csv"
  }
]

What an implementation should do on a concrete user machine after dp = DataPackage('http://example.com/datapackage.py')? If it's an implementation decision how to provide caching then just cacheable flag on dp/resource level will be enough. Because for me {AS_IMPLEMENTOR) the best way looks like to use url hashes for some global cache regardless of cache_path set by data packager. But it's not give a chance to other tools to get access to data files. So it could work only for local data packages or where base_path is passed?

Also there is a problem of cache invalidation.. And storing cached files inside the datapackage could mess with git. Also name cache is a little bit confusing - it looks like it's about optimization. So many questions as I see it.

roll · 2016-08-09T11:01:44Z

Also really related to #250
And I suppose it's important to resolve this both as soon as possible because it affects implementations.

Have we considered path + local_path instead of path + cache? Because cache sounds really like optimization thing..

rufuspollock · 2016-08-11T13:09:28Z

OK, the blocker here will be someone drafting some specific suggested language. I'm +1 on having this - either in core spec or an extension. Language should be drafted as markdown and posted here in a comment or in a gist.

pwalsh · 2016-08-11T15:18:20Z

@roll cache can be local path or url, just like path, so path and local_path are not an option.

pwalsh · 2016-08-11T15:19:42Z

@rgrp I'm away for 2 weeks, but yes it would be great to have a draft. Also, I'm +1 on have cache as an optional property in the core spec, not as an extension.

roll · 2016-08-11T15:39:42Z

@pwalsh
if 'cache` is url - what does it mean for implementations?

+1 for core spec

pwalsh · 2016-08-11T15:47:39Z

@roll some of the info is lost as it is spread out over other issues (eg: #223 (comment) ).

However, bottom line is:

datapackages can have resources on remote servers. remote servers, or the resources themselves, can disappear.
to deal with the case where a remote resource much disappear, some system much keep a copy. how, or what they do with updating this copy, is not a concern for the spec. the spec just says, if you keep a copy, put it on cache
then, implementing code can fallback to cache if path errors

waylonflinn · 2016-10-09T16:33:01Z

👍

pwalsh · 2016-11-10T08:25:35Z

@rgrp and others, any further comments on this?

I note that calling it cache has possibly led to a small amount of confusion above (e.g.: cache invalidation). I don't think we need to change the terminology because of this, rather, we have to ensure it is worded in a way that corresponds with usage in spoken English, rather than as a programming construct.

I'm happy to do the PR on this one.

Any comments from the @frictionlessdata/specs-working-group before I do that?

jpmckinney · 2016-11-11T01:44:47Z

I may be misunderstanding.

It seems to me like any tools reading data packages may want to handle caching in their own way - down to where they store the cached file. Imagine a tool that reads remote CSVs into a database. The database already serves as a cache. There's no need to store the remote CSV as a local file at the path described by cache. What's the harm in letting each tool handle its own cache?

If by 'cache' we mean 'alternative path', then why not make path an array of all the possible paths for that file?

I don't think cache should mean both "dear implementations, please store a copy of the file at this path" and also "please try to read this path if the path doesn't work". Implementations don't have write access to all the remote servers that can be set in cache, for instance.

I also think there's a security risk if the interpretation of cache is that implementations must store a copy of the file at that path.

But maybe I'm confused by this proposal.

pwalsh · 2016-11-13T07:33:29Z

hey @jpmckinney

By cache we mean alternative path.

path as an array can't be used for this, as we are overloading path as an array for data distributed across multiple files #228

But actually, you've just articulated the best and most clear reason why cache can't be part of the spec itself - the entire use case for this, from myself and related from others, is so that implementing platforms can hold their own copy of the source, as a fallback for if the original source becomes unavailable (ignoring how or why it could become unavailable). If cache is exposed as a user-facing part of the spec, then, it prevents platforms from overwriting this key for their internal use.

So, I'll instead present this as a "pattern" for usage, outside of the spec, and consider adding support for it to the python and javascript libraries for usage in platforms implementing the spec. ( cc @akariv @roll @amercader )

roll · 2016-11-17T06:42:18Z

I'm +1 for any spec simplification instead of complication.

rufuspollock · 2016-11-17T11:11:30Z

Removing from v1 milestone as now a pattern item.

pwalsh · 2016-12-19T13:16:06Z

This was done in #331 and will close when it is merged.

[patterns][l]: Add patterns section with patterns for: private properties, caching, languages and more. Fixes #243 Fixes #321 Fixes #190 Fixes #171

danfowler added the spec-datapackage label Feb 4, 2016

pwalsh mentioned this issue Feb 18, 2016

Spec now mandates only and only one of "data", "path" or "url" frictionlessdata/datapackage-py#44

Closed

pwalsh mentioned this issue Mar 7, 2016

Multiple URLs for resources #247

Closed

pwalsh mentioned this issue Mar 20, 2016

Add the cache property when handling CSV-from-URL openspending/openspending#870

Closed

rufuspollock added Ready for PR labels Mar 23, 2016

pwalsh mentioned this issue Jul 12, 2016

Merging path and url for resources #250

Closed

roll added backlog and removed backlog labels Aug 8, 2016

roll self-assigned this Aug 9, 2016

roll added backlog and removed progress labels Aug 11, 2016

roll removed their assignment Aug 11, 2016

roll mentioned this issue Aug 11, 2016

Finalize a few critical to implementations changes to specs frictionlessdata/frictionlessdata.io#267

Closed

5 tasks

roll removed the backlog label Aug 29, 2016

jpmckinney added the blocker label Sep 28, 2016

pwalsh self-assigned this Nov 10, 2016

roll added this to the Version-1 milestone Nov 16, 2016

roll removed the blocker label Nov 16, 2016

rufuspollock removed the spec-datapackage label Nov 16, 2016

rufuspollock mentioned this issue Nov 17, 2016

Create a Patterns / FAQ section #321

Closed

rufuspollock removed the Status: Ready-For-PR label Nov 17, 2016

rufuspollock modified the milestones: Backlog, Version-1 Nov 17, 2016

pwalsh mentioned this issue Dec 11, 2016

Add patterns #331

Merged

rufuspollock closed this as completed in #331 Dec 20, 2016

rufuspollock added a commit that referenced this issue Dec 20, 2016

Merge pull request #331 from frictionlessdata/feature/patterns

4766869

[patterns][l]: Add patterns section with patterns for: private properties, caching, languages and more. Fixes #243 Fixes #321 Fixes #190 Fixes #171

roll added this to Open Knowledge Jun 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Locally caching remote or inline resources via "cache" key #243

Locally caching remote or inline resources via "cache" key #243

danfowler commented Feb 4, 2016

pwalsh commented Feb 4, 2016

aubergene commented Feb 12, 2016

pwalsh commented Mar 7, 2016

pwalsh commented Mar 7, 2016

rossjones commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016

rossjones commented Mar 7, 2016

rufuspollock commented Mar 8, 2016

rufuspollock commented Mar 23, 2016

pwalsh commented Jul 12, 2016

roll commented Aug 6, 2016 •

edited

Loading

roll commented Aug 9, 2016 •

edited

Loading

rufuspollock commented Aug 11, 2016

pwalsh commented Aug 11, 2016

pwalsh commented Aug 11, 2016

roll commented Aug 11, 2016 •

edited

Loading

pwalsh commented Aug 11, 2016

waylonflinn commented Oct 9, 2016 •

edited

Loading

pwalsh commented Nov 10, 2016

jpmckinney commented Nov 11, 2016

pwalsh commented Nov 13, 2016

roll commented Nov 17, 2016 •

edited

Loading

rufuspollock commented Nov 17, 2016

pwalsh commented Dec 19, 2016

Locally caching remote or inline resources via "cache" key #243

Locally caching remote or inline resources via "cache" key #243

Comments

danfowler commented Feb 4, 2016

pwalsh commented Feb 4, 2016

aubergene commented Feb 12, 2016

pwalsh commented Mar 7, 2016

pwalsh commented Mar 7, 2016

rossjones commented Mar 7, 2016

rufuspollock commented Mar 7, 2016

pwalsh commented Mar 7, 2016

rossjones commented Mar 7, 2016

rufuspollock commented Mar 8, 2016

rufuspollock commented Mar 23, 2016

pwalsh commented Jul 12, 2016

roll commented Aug 6, 2016 • edited Loading

roll commented Aug 9, 2016 • edited Loading

rufuspollock commented Aug 11, 2016

pwalsh commented Aug 11, 2016

pwalsh commented Aug 11, 2016

roll commented Aug 11, 2016 • edited Loading

pwalsh commented Aug 11, 2016

waylonflinn commented Oct 9, 2016 • edited Loading

pwalsh commented Nov 10, 2016

jpmckinney commented Nov 11, 2016

pwalsh commented Nov 13, 2016

roll commented Nov 17, 2016 • edited Loading

rufuspollock commented Nov 17, 2016

pwalsh commented Dec 19, 2016

roll commented Aug 6, 2016 •

edited

Loading

roll commented Aug 9, 2016 •

edited

Loading

roll commented Aug 11, 2016 •

edited

Loading

waylonflinn commented Oct 9, 2016 •

edited

Loading

roll commented Nov 17, 2016 •

edited

Loading