Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Locally caching remote or inline resources via "cache" key #243

Closed
danfowler opened this issue Feb 4, 2016 · 25 comments
Closed

Locally caching remote or inline resources via "cache" key #243

danfowler opened this issue Feb 4, 2016 · 25 comments
Assignees

Comments

@danfowler
Copy link
Contributor

As of 1.0.0-beta.15, only one of path, url, or data is allowed on a given resource. In discussing this, it was suggested that it is useful to have a local path to which to download a remote (url) or inline (data) resource so that non-datapackage-aware tooling can access. As a result of discussion in #223, it has been suggested that a new key, cache, be used on a resource to specify a local path to "cache" remote or inline resources to disk.

Examples:

{
  "url": "http://xyz.com/dataset.csv",
  "cache": "dataset.csv"
}
{
  "data": [
    {"a": "1", "b": "2"}
  ],
  "cache": "dataset.csv"
}

So, when a resource has a url or data key, it SHOULD also have a cache key. An implementation that is reading the data in the data package, should then first check cache for the data, if it exists, and, if not, save the data as specified in url or data to cache. Otherwise, cache acts just like path. When a resource has a path key, an implementation should ignore cache.

WDYT? @morty @amercader @rgrp @pwalsh

@pwalsh
Copy link
Member

pwalsh commented Feb 4, 2016

Big +1 on cache.

Neutral on the description of what implementations should do. What an implementation should do, in this case, is likely highly specific to the requirements of the implementation, and not an issue for the spec to address.

@aubergene
Copy link

How about storing the ETAG or some other hash of the document from the last time it was retrieved?

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

@rgrp I think I'd want to modify this, that cache can be a path or a url. WDYT?

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

@aubergene about ETAG or hash of document serves a different need IMHO. The need for a cache comes from the possibility of a resource hosted elsewhere becoming unavailable.

@rossjones
Copy link

I'd be really keen to support url as an option for the cache, even though it opens up a can of worms where the cached copy (of file or url) could potentially be out of date. Although it complicates thing, I agree with @aubergene that having a content-hash for the cache would be useful in determining if it is out of date - which may or may not be acceptable, but at least the user would know.

@rufuspollock
Copy link
Contributor

OK, this seems fairly sensible and straightforward. Only question is whether cache is an object or a single path value. Object would be something like:

"cache": "path-to-cached-file"

# or cache as object
"cache": {
  "path": ...
  "hash": ...
  "etag / lastCached": timestamp when cache object was last stored ...
}

* `hash`: I am not sure about including a hash on the cache item since, in theory, this easily re-computed and then compared against the source object. However, I'm not using this in practice myself right now so like to here from potential implementors.
* `etag / lastCached`: i'm always wary of these things going out of date (i.e. a chance is made but you don't update the datapackage.json). Wrong info is worse than no info. Again any sense of the value of this would be useful in deciding whether it was worth the additional complexity.

@pwalsh
Copy link
Member

pwalsh commented Mar 7, 2016

  • +1 "cache": "path-to-cached-file"
  • -1 on hash and etag for cache as part of spec.

@rossjones
Copy link

Our recent look at 60k files we have archived showed < 50% had an etag, so it may not actually be that useful.

Agree the hash can be calculated by the client, but where the cache is a URL I need to fetch it first - the useful thing with the hash and lastCached date here is determining if it actually IS out of date. If the hash doesn't match the primary and/or the date doesn't match my expectations, then I know I need to investigate further. If there cache is a naked url, or file path, I have absolutely no idea without processing it whether it may be different than the original data.

@rufuspollock
Copy link
Contributor

@rossjones ok so etag can be dropped. I guess my question here would be to do a progressive enhancement:

  • Right now: add a simple "cache": "path"
  • If it looks like storing the hash is pretty useful to several people then we can add that later (this can be backwards compatible by allowing simple string or object as value for cache)

Any thoughts?

@pwalsh one aside is interaction with cache and parts idea #228 - i guess in that cache might be an array but i think we can deal with that in #22 8

@rufuspollock
Copy link
Contributor

OK so something like:

cache: a string relative path or url to another "cached" version of the data in this resource. Often used by data package consumers who wish to cache a copy of the data in case the data moves or disappears.

Only remaining question is whether this is a pattern or in formal spec (idea being we may leave a standard pattern and add to spec later if interest is large enough - want to keep spec as streamlined as possible).

Logic for pattern is this more for consumers than publishers (publishers would rarely implement ...).

@pwalsh
Copy link
Member

pwalsh commented Jul 12, 2016

@rgrp as mentioned in other issues, I'm partial to this being in the spec.

@roll
Copy link
Member

roll commented Aug 6, 2016

Consider data package:

# http://example.com/datapackage.py
resources: [
  {
    path: "http://example.com/data/data.csv",
    cache: "data.csv"
  }
]

What an implementation should do on a concrete user machine after dp = DataPackage('http://example.com/datapackage.py')? If it's an implementation decision how to provide caching then just cacheable flag on dp/resource level will be enough. Because for me {AS_IMPLEMENTOR) the best way looks like to use url hashes for some global cache regardless of cache_path set by data packager. But it's not give a chance to other tools to get access to data files. So it could work only for local data packages or where base_path is passed?

Also there is a problem of cache invalidation.. And storing cached files inside the datapackage could mess with git. Also name cache is a little bit confusing - it looks like it's about optimization. So many questions as I see it.

@roll roll added backlog and removed backlog labels Aug 8, 2016
@roll roll self-assigned this Aug 9, 2016
@roll
Copy link
Member

roll commented Aug 9, 2016

Also really related to #250
And I suppose it's important to resolve this both as soon as possible because it affects implementations.


Have we considered path + local_path instead of path + cache? Because cache sounds really like optimization thing..

@rufuspollock
Copy link
Contributor

OK, the blocker here will be someone drafting some specific suggested language. I'm +1 on having this - either in core spec or an extension. Language should be drafted as markdown and posted here in a comment or in a gist.

@pwalsh
Copy link
Member

pwalsh commented Aug 11, 2016

@roll cache can be local path or url, just like path, so path and local_path are not an option.

@pwalsh
Copy link
Member

pwalsh commented Aug 11, 2016

@rgrp I'm away for 2 weeks, but yes it would be great to have a draft. Also, I'm +1 on have cache as an optional property in the core spec, not as an extension.

@roll
Copy link
Member

roll commented Aug 11, 2016

@pwalsh
if 'cache` is url - what does it mean for implementations?

+1 for core spec

@pwalsh
Copy link
Member

pwalsh commented Aug 11, 2016

@roll some of the info is lost as it is spread out over other issues (eg: #223 (comment) ).

However, bottom line is:

  • datapackages can have resources on remote servers. remote servers, or the resources themselves, can disappear.
  • to deal with the case where a remote resource much disappear, some system much keep a copy. how, or what they do with updating this copy, is not a concern for the spec. the spec just says, if you keep a copy, put it on cache
  • then, implementing code can fallback to cache if path errors

@waylonflinn
Copy link

waylonflinn commented Oct 9, 2016

👍

@pwalsh pwalsh self-assigned this Nov 10, 2016
@pwalsh
Copy link
Member

pwalsh commented Nov 10, 2016

@rgrp and others, any further comments on this?

I note that calling it cache has possibly led to a small amount of confusion above (e.g.: cache invalidation). I don't think we need to change the terminology because of this, rather, we have to ensure it is worded in a way that corresponds with usage in spoken English, rather than as a programming construct.

I'm happy to do the PR on this one.

Any comments from the @frictionlessdata/specs-working-group before I do that?

@jpmckinney
Copy link

I may be misunderstanding.

It seems to me like any tools reading data packages may want to handle caching in their own way - down to where they store the cached file. Imagine a tool that reads remote CSVs into a database. The database already serves as a cache. There's no need to store the remote CSV as a local file at the path described by cache. What's the harm in letting each tool handle its own cache?

If by 'cache' we mean 'alternative path', then why not make path an array of all the possible paths for that file?

I don't think cache should mean both "dear implementations, please store a copy of the file at this path" and also "please try to read this path if the path doesn't work". Implementations don't have write access to all the remote servers that can be set in cache, for instance.

I also think there's a security risk if the interpretation of cache is that implementations must store a copy of the file at that path.

But maybe I'm confused by this proposal.

@pwalsh
Copy link
Member

pwalsh commented Nov 13, 2016

hey @jpmckinney

By cache we mean alternative path.

path as an array can't be used for this, as we are overloading path as an array for data distributed across multiple files #228

But actually, you've just articulated the best and most clear reason why cache can't be part of the spec itself - the entire use case for this, from myself and related from others, is so that implementing platforms can hold their own copy of the source, as a fallback for if the original source becomes unavailable (ignoring how or why it could become unavailable). If cache is exposed as a user-facing part of the spec, then, it prevents platforms from overwriting this key for their internal use.

So, I'll instead present this as a "pattern" for usage, outside of the spec, and consider adding support for it to the python and javascript libraries for usage in platforms implementing the spec. ( cc @akariv @roll @amercader )

@roll roll added this to the Version-1 milestone Nov 16, 2016
@roll roll removed the blocker label Nov 16, 2016
@roll
Copy link
Member

roll commented Nov 17, 2016

I'm +1 for any spec simplification instead of complication.

@rufuspollock
Copy link
Contributor

Removing from v1 milestone as now a pattern item.

@rufuspollock rufuspollock modified the milestones: Backlog, Version-1 Nov 17, 2016
@pwalsh pwalsh mentioned this issue Dec 11, 2016
@pwalsh
Copy link
Member

pwalsh commented Dec 19, 2016

This was done in #331 and will close when it is merged.

rufuspollock added a commit that referenced this issue Dec 20, 2016
[patterns][l]: Add patterns section with patterns for: private properties, caching, languages and more.

Fixes #243
Fixes #321
Fixes #190
Fixes #171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

8 participants