Merging path and url for resources #250

rufuspollock · 2016-03-07T11:52:28Z

At the moment we have path and url. I originally had this to make it super easy for tool implementors (no lists of web protocols to match against `http://, https://, ftp://, etc).

At the same time it adds cognitive complexity to the spec and for publishers and confusion about whether one could use both #223 #232

The approach would be probably going forward just to have path and define like:

the location of the resource specified either as a fully qualified url or a unix-style path.
Data Packages may be found online or locally on disk and path should be interpreted
relative to that context (on disk it is a file path, online it is a url path). 

If a relative path is used it is interpreted relative to the base directory for the data
package (the parent directory of the datapackage.json).

Examples:

http://ex.datapackages.org/big-csv/my-big.csv
/examples/my-csv.csv  # absolute path
examples/my-csv.csv   # relative path

The text was updated successfully, but these errors were encountered:

pwalsh · 2016-03-07T11:58:45Z

+1.

Easy enough to implement code and check if we have a protocol scheme or just a plain old path to file.

rufuspollock · 2016-03-23T11:21:46Z

Questions:

Do we change to a new name e.g. location - this is breaking!
How do implementors distinguish fully qualified urls from (relative) paths?

My sense if we do do this, approach would be:

Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
Deprecate url
Require implementors to do some regex'ing to distinguish urls from relative paths ...

Aside:

Must paths always be relative to the data package base directory to avoid security risks

ivotron · 2016-07-09T07:30:04Z

For use cases where the data can't be fetched (e.g. too big), one would still like to retrieve a datapackage.json that has a URL pointing to where the data resides. Would this issue (merging path and URL) cover this use case?

pwalsh · 2016-07-10T04:59:23Z

@ivotron yes, that issue is covered right now, and would not be removed by the proposal in this issue.

pwalsh · 2016-07-12T08:53:22Z

So @rgrp I'm in agreement with your points:

Merge to path and allow it to be a full url (makes semantic sense: "path to the resource file")
Deprecate url
Require implementors to do some regex'ing to distinguish urls from relative paths

Additionally, cache should be implemented, and follow the same rules. (worth mentioning here for the possibility that users were relying on presence of both url and path to keep local copies of remote resources.)

@roll @amercader @akariv @vitorbaptista you are all writing code around this in various places. What do you think?

roll · 2016-07-12T09:13:40Z

tabulator has just path for anything and OPTIONAL scheme and format to explicitly describe the path kind.

vitorbaptista · 2016-07-15T18:01:20Z

Sorry if this was discussed before. We've agreed on removing the base path on #232, however there are some relative paths being described here. How does the path resolution would work with this proposed change? Consider both the cases when a datapackage is remote (e.g. http://somewhere.com/datapackage.json) and local (e.g. /home/vitor/data/datapackage.json)

pwalsh · 2016-07-18T18:42:57Z

@vitorbaptista I think, always relative to the directory that contains datapackage.json for local filepaths, and, remote resources should always be fully qualified URLs.

vitorbaptista · 2016-07-19T09:27:38Z

So, imagine we have this datapackage.json:

{
    "name": "foo",
    "resources": [
        { "path": "data/foo.csv" }
    ],
}

If this datapackage is at http://example.org/datapackage.json, its resource would be at http://example.org/data/foo.csv. If it's in /home/vitor/foo/datapackage.json, the resource would be at /home/vitor/foo/data/foo.csv. All good. Now, consider this other datapackage:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

This resource then doesn't depend on the path for the datapackage itself. Even if the datapackage.json lives in /home/vitor/bar/datapackage.json, its resource would still live at http://example.org/bar/data/bar.csv.

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Am I missing something?

pwalsh · 2016-07-19T09:29:48Z

@vitorbaptista those examples are correct as I understand it, but still:

If those examples are correct, we will always have to download the resources from their URLs, even if the datapackage is inside a ZIP file.

Not quite sure what you are getting it here.

rufuspollock · 2016-07-19T11:38:13Z

@pwalsh relative paths work for remote urls too - that's really important. In both cases you'd compute relative to the datapackage.json base directory.

@vitorbaptista your examples are correct and 👍 to @pwalsh comment.

Note: I'm really thinking explaining this stuff will go in the patterns or FAQs section on the fd.io site (we could add to the spec but it gets kind of bulky).

vitorbaptista · 2016-07-19T14:52:06Z

@pwalsh Consider the second example:

{
    "name": "bar",
    "resources": [
        { "path": "http://example.org/bar/data/bar.csv" }
    ],
}

Also consider that we have the files in the local filesystem:

/home/vitor/bar/datapackage.json
/home/vitor/bar/data/datapackage.json

Then I open python and try to load this, as such:

import datapackage
dp = datapackage.DataPackage('/home/vitor/bar/datapackage.json')

Where should datapackage.DataPackage look for the datapackage's resources?

Going through its resources array, it'll only find { "path": "http://example.org/bar/data/bar.csv" }. As far as I can see, there's no (easy) way for the datapackage library to know that this resource also lives in ./data/bar.csv relative to the datapackage.json, so it'll end up downloading it from the remote URL even though we have a local copy.

rufuspollock · 2016-07-19T17:28:51Z

@vitorbaptista it should load the value of path which is the url.

vitorbaptista · 2016-07-19T17:49:23Z

@rgrp So if a datapackage.json has a resource with a full URL in its path, the resource should always be downloaded from the Internet, even if we have the data locally, right?

Sounds fine to me, just making sure this was the intended behaviour.

rufuspollock · 2016-07-19T18:07:29Z

@vitorbaptista yes.

roll · 2016-08-05T09:31:48Z

format field is already available so what about scheme field? Sometimes there could be a need to set a scheme explicitly. And implementations could be able to pass this path,format,scheme as it is to tabulator.

rufuspollock · 2016-08-09T09:09:53Z

OK, I think this is ready to go in. It is a breaking change so I think I will need to add an actual warning to the spec mentioning that to support data packages created under earlier version of the spec implementors should convert url to path.

Welcome final +1 / -1 as this is a significant change (please put these as explicit comments rather than emoticons on this comment!)

/cc @pwalsh @paulfitz @danfowler @vitorbaptista @Floppy @jpmckinney ...

akariv · 2016-08-09T09:39:13Z

Sorry for joining this discussion so late, but I seem to be missing something. I realise that this PR is the result of very long discussions in various threads - but when trying to compare the proposed change vs. the current state, I get this:

In the current state, we have:

url - absolute path to resource
path - relative path to resource
(only one of these attributes can appear)

In the proposed state, we will have

path - relative or absolute path to resource
cache - relative path to resource
(both of these attributes can appear together)

So it looks like we've introduced some ambiguity to the spec, complicated the implementation (which needs to detect network schemas), and it's also a breaking change. Plus it doesn't feel to me that the original goal of reducing cognitive complexity is being achieved.

Just an idea - If we allowed path and url to co-exist, and also added a recommended resolving order (path before url) and a few examples and common use cases to the spec, we might be able to achieve the same functionality without breaking the spec or burdening implementors.

roll · 2016-08-09T09:57:40Z

Do we handle cache in this proposal?

akariv · 2016-08-09T10:12:00Z

Not in this specific change, but based on @pwalsh 's comment I understand
that this is the final intention.

On Tue, 9 Aug 2016 at 12:58 roll [email protected] wrote:

Do we handle cache in this proposal?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#250 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAQMde9s0_mS_FVYVv9eiv0Qfm-CaiH9ks5qeE8WgaJpZM4HqtgY
.

rufuspollock · 2016-08-09T14:21:28Z

@akariv whilst it is conceptually simpler for the "experts", for many users there is confusion about which of the two to use and when to use them (or whether both can be used at once - see the issues referenced in the description). Also people get confused whether to use path or url for stuff online, or whether they can use url offline. Having one value I hope will reduce this confusion a bit.

It is true that consumers of data packages (esp tools) may need to do a bit more work: they now need to work out when they have a full qualified url vs a path. However, for publishers it is simpler and the overall spec is simpler.

rufuspollock · 2016-09-28T06:37:30Z

OK. I think we want to move this forward.

@akariv i see your point about path being guaranteed to be relative. The challenge is people end up using both url and path sometimes and there is some confusion. As someone actively using this in the wild your views are very important so please reiterate if this would be a real issue in your opinion.
cache is unrelated to this change IMO. It may happen but it is not a dependency.

AGREED: We are going to implement this.

Question is what we do for backwards compatibility. I suggest we explicitly state that clients should remap url to path if path not present. Otherwise behaviour is undefined and should raise an error.

If people are happy i will start a pull request so people can comment on the changes.

roll · 2016-09-28T08:47:46Z

@rgrp
awesome

No plans to introduce scheme keyword in terms of path=[scheme]://path-core.[format] to have an explicit ability to force exact schemes?

rufuspollock · 2016-09-28T10:05:36Z

@roll what is use case for implementors? How hard is it to work out scheme.

roll · 2016-09-28T11:14:54Z

@rgrp
Scheme is inferrable from path with no problem so I mean optional explicit schema options as we have for format.

tabulator from born support explicit schema (loading) and format (parsing) options. But it's mostly been done because of some nice symmetry in schema-format parts of path (an url analogy). But also on implementation level it's important to have ability to ensure what loader for file will be used. For example goodtables using tabulator.Stream('path', scheme='http') is sure this call is safe. Explicitly saying the source is remote. This approach something like other flavor of path-url separation approach. Solving to same problem.

But you're right for now I don't know a real use case for it on a spec level (may be @danfowler knows when publishers need to ensure path is ONLY local or remote) -> without a use case no reason to consider it.

roll · 2016-09-28T11:30:11Z

Just to clarify what I mean - consider there is a filesystem supporting many letters drive names - path: drive://path.csv. In this case implantation just fail with UnknownScheme error but if publisher will provide path: drive://path.csv, scheme: file it will be correctly opened as a local file. I suppose thing to investigate in a future.

waylonflinn · 2016-10-09T16:19:07Z

👍

Since the intent is to modify the spec with this change (and I'm working on the relevant code today), I'm going to go ahead and do this in Dataship.

waylonflinn · 2016-10-09T16:44:09Z

I have a question about an edge case. Does the following seem correct?

When a path doesn't have a protocol descriptor but starts with /, it MUST be treated as an absolute path referring to the root of the resource from which the datapackage is accessed.

Example

datapackage at http://www.example.com/user/foo/datapackage.json
contains:

{
  "name" : "foo"
  "resource" : [
    {
      "name" : "bar"
      "path" : "/data/bar.csv"
    }
  ]

The resource bar MUST be found at http://www.example.com/data/bar.csv

rufuspollock · 2016-10-10T09:12:57Z

@waylonflinn great question - and i've actually worked on a draft here and came up with exactly this case. I'm coming to the view that you do not allow absolute paths without a scheme - only relative paths (and also no ../). So basically any path is:

EITHER a url
OR: a relative path (which is interpreted relative to the directory in which datapackage.json is found

This also addresses security concerns.

roll · 2016-10-10T09:23:47Z

HTTP, zip etc uses / as a pointer to a root of container. Is it an option also for us? If / means datapackage.json dir - no security problems at all (just ban .. or getting out of dp dir).

waylonflinn · 2016-10-10T12:45:45Z

@rgrp Thanks for giving this some thought and coming up with a reasonable answer.

Having the absolute path (/) available definitely helps my use case. Is it possible to change this?

rufuspollock · 2016-10-16T11:36:34Z

@waylonflinn really useful feedback. BTW why do you want "root paths" i.e. /...?

Lesson for me: people definitely want to retain allowing / as a starter in paths. I'm ok with that.

Does anyone want to allow ../ paths? I think not though it could be a bit complex to start having these rules in the spec. Maybe more of a thing for "notes for implementors".

For security conscious implementors: they can add a switch to disallow all absolute and ../ paths.

roll · 2016-10-19T06:44:53Z

In my opinion a secure recommendation for implementors will be option to enable absolute/dotted pathes disabled by default

pwalsh · 2016-11-10T08:21:17Z

@rgrp do you have a PR in the wings for this? I think the wording for the spec should be quite straight forward. I think we should also have a set of clear examples alongside that that addresses some of the above points.

Again here, I think in the various places this has been discussed, there is general agreement that it simplifies the prep of a data package for a publisher, and reduces ambiguity in multiple properties that do essentially the same thing, at the relatively small cost of implementing code needing to handle the file pointer disambiguation (fs vs http).

Any comments from the @frictionlessdata/specs-working-group before we start the PR process?

rufuspollock · 2016-11-17T12:43:52Z

OK, here's a specific proposal for the text change:

NOTE: we will disallow absolute paths like /... and parent relative paths ../ for security reasons and simplicity of implementation.

Required Fields - Data Location

Resource information MUST contain a property describing the location of the
data associated to this resource.

Data associated to a resource can be located online or locally on disk or
inline in the datapackage.json itself.

The location of resource data MUST be specified by the presence of one
(and only one) of these two properties:

path: for data in files online or locally.
data: for data inline in the datapackage.json descriptor itself.

Data in Files - `path`

path the location of the data specified either as a fully qualified url or a relative
unix-style path ('/' as separator). A relative path should be interpreted
relative relative to the directory or URL in which the descriptordatapackage.json listing this resource resides.

Examples:

# fully qualified url
"path": "http://ex.datapackages.org/big-csv/my-big.csv"

# relative path
"path": "examples/my-csv.csv"

`data`: (inline) a field containing the data directly inline in the
  `datapackage.json` file. Further details below.

Inline Data - `data`

...

rufuspollock · 2016-11-18T15:42:24Z

@waylonflinn note we are planning to disallow / at the start of paths as paths must be relative or fully qualified url. Do you have any comments on this - i note you had required / being permitted for URLs. If you could provide details on your use case that would be valuable.

…rictionlessdata#250.

danfowler added the spec-datapackage label Jul 1, 2016

rufuspollock added the Ready for PR label Jul 19, 2016

roll mentioned this issue Aug 6, 2016

Finalize a few critical to implementations changes to specs frictionlessdata/frictionlessdata.io#267

Closed

5 tasks

roll added backlog and removed backlog labels Aug 8, 2016

roll modified the milestone: specs-v1 Aug 8, 2016

roll added progress and removed priority labels Aug 9, 2016

roll mentioned this issue Aug 9, 2016

Locally caching remote or inline resources via "cache" key #243

Closed

roll added backlog and removed progress labels Aug 11, 2016

roll removed their assignment Aug 11, 2016

roll removed the backlog label Aug 29, 2016

rufuspollock modified the milestone: Current Sep 27, 2016

jpmckinney added the blocker label Sep 28, 2016

roll removed the blocker label Nov 16, 2016

rufuspollock self-assigned this Nov 17, 2016

rufuspollock added a commit to rufuspollock/fd-specs that referenced this issue Nov 28, 2016

[dp][xs]: addendum to last commit to remove unnecessary text - refs f…

d653bdf

…rictionlessdata#250.

rufuspollock closed this as completed in 2aab215 Nov 28, 2016

roll added this to Open Knowledge Jun 27, 2023

Merging path and url for resources #250

Merging path and url for resources #250

Comments

rufuspollock commented Mar 7, 2016 • edited Loading

pwalsh commented Mar 7, 2016

rufuspollock commented Mar 23, 2016

ivotron commented Jul 9, 2016

pwalsh commented Jul 10, 2016

pwalsh commented Jul 12, 2016

roll commented Jul 12, 2016 • edited Loading

vitorbaptista commented Jul 15, 2016

pwalsh commented Jul 18, 2016

vitorbaptista commented Jul 19, 2016

pwalsh commented Jul 19, 2016

rufuspollock commented Jul 19, 2016

vitorbaptista commented Jul 19, 2016

rufuspollock commented Jul 19, 2016

vitorbaptista commented Jul 19, 2016

rufuspollock commented Jul 19, 2016

roll commented Aug 5, 2016

rufuspollock commented Aug 9, 2016

akariv commented Aug 9, 2016

roll commented Aug 9, 2016

akariv commented Aug 9, 2016

rufuspollock commented Aug 9, 2016

rufuspollock commented Sep 28, 2016

roll commented Sep 28, 2016 • edited Loading

rufuspollock commented Sep 28, 2016

roll commented Sep 28, 2016 • edited Loading

roll commented Sep 28, 2016 • edited Loading

waylonflinn commented Oct 9, 2016 • edited Loading

waylonflinn commented Oct 9, 2016 • edited Loading

Example

rufuspollock commented Oct 10, 2016

roll commented Oct 10, 2016 • edited Loading

waylonflinn commented Oct 10, 2016

rufuspollock commented Oct 16, 2016

roll commented Oct 19, 2016

pwalsh commented Nov 10, 2016

rufuspollock commented Nov 17, 2016 • edited Loading

Required Fields - Data Location

Data in Files - path

Inline Data - data

rufuspollock commented Nov 18, 2016

rufuspollock commented Mar 7, 2016 •

edited

Loading

roll commented Jul 12, 2016 •

edited

Loading

roll commented Sep 28, 2016 •

edited

Loading

roll commented Sep 28, 2016 •

edited

Loading

roll commented Sep 28, 2016 •

edited

Loading

waylonflinn commented Oct 9, 2016 •

edited

Loading

waylonflinn commented Oct 9, 2016 •

edited

Loading

roll commented Oct 10, 2016 •

edited

Loading

rufuspollock commented Nov 17, 2016 •

edited

Loading

Data in Files - `path`

Inline Data - `data`