Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Paths as URIs #243

Open
wants to merge 44 commits into
base: main
Choose a base branch
from
Open

Paths as URIs #243

wants to merge 44 commits into from

Conversation

TomNicholas
Copy link
Member

@TomNicholas TomNicholas commented Oct 2, 2024

This PR closes #242 at the data model level - all paths are coerced to absolute URIs (i.e. file:///directory/test.nc or s3://bucket/test.nc) as they go into the Manifest.

As this forbids constructing manifests using relative paths, it requires minor changes to many tests (e.g. test.nc-> /test.nc). It also will require slightly more invasive changes to any tests that involve kerchunk references.

Sub-tasks:

@TomNicholas TomNicholas marked this pull request as ready for review October 19, 2024 02:33
@TomNicholas
Copy link
Member Author

What should we do with the paths in kerchunk references? Are they are always meant as absolute? I guess we should assume they are absolute, unless they have ./ or ../ at the start? Would fsspec ever produce relative paths like that?

cc @martindurant

@martindurant
Copy link
Member

Are they are always meant as absolute?

They are always meant "as interpreted by the target filesystem". The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required. This means, that relative paths do work if the target happens to be the local filesystem (file://), but I think of the other filesystems, only ssh supports this concept at all. I would not expect this to be meaningful for basically any practical case.

Note that the dir:// filesystem adds prefixes to URLs for any filesystem, if that's useful at all.

@martindurant
Copy link
Member

(I am happy to require absolute paths even if it makes some tests slightly more verbose)

@TomNicholas
Copy link
Member Author

Thanks @martindurant !

The nature of that filesystem might be implied by the protocol of a path alone, but commonly additional arguments are also required.

But is the nature of the filesystem explicitly recorded in the kerchunk references format anywhere? Obviously if the prefix is explicit (e.g. s3://) then you know but otherwise?

that relative paths do work

I guess we should assume they are absolute, unless they have ./ or ../ at the start?

Would this approach work then?

(I am happy to require absolute paths even if it makes some tests slightly more verbose)

This might be helpful if the above approach doesn't work.

@martindurant
Copy link
Member

is the nature of the filesystem explicitly recorded in the kerchunk references format

No. The original intention was to have these in the "templates", but in practice, the remote_protocol, remote_options and fss arguments to ReferenceFileSystem are used (and often encoded in Intake prescriptions) in cases of ambiguity.

@TomNicholas TomNicholas mentioned this pull request Nov 15, 2024
21 tasks
@TomNicholas
Copy link
Member Author

TomNicholas commented Nov 22, 2024

Okay important question: Are we trying to support manifests with http URLs in them? Right now we have some tests which create virtual datasets containing http:// URLs as the path, e.g:

virtualizarr/tests/test_backend.py::TestReadFromURL::test_read_from_url[netcdf3-https://github.com/pydata/xarray-data/raw/master/air_temperature.nc-HDF5VirtualBackend]

But these tests (cc @scottyhq ) doesn't actually try to read the data back as loadable xarray variables. AFAIK Icechunk could not read data from a manifest containing http URLs (cc @mpiannucci ), but fsspec presumably could?

Handling this case in the manifest validation is possible but a bit annoying as cloudpathlib doesn't support http paths (drivendataorg/cloudpathlib#455 - at least not yet drivendataorg/cloudpathlib#468), and pathlib will incorrectly conclude that the http url is a relative posix path.

@TomNicholas
Copy link
Member Author

AFAIK Icechunk could not read data from a manifest containing http URLs

It's planned for Icechunk to support HTTP apparently.

Handling this case in the manifest validation is possible

I added support for HTTP in fefab90

@scottyhq
Copy link
Contributor

Are we trying to support manifests with http URLs in them? I added support for HTTP in fefab90

If I'm following correctly, I'd say yes! There are lots of datasets out there that are not in cloud buckets, but are on servers that support http range requests. Agreed that it would be nice if cloudpathlib handled http:// paths

@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself DMR++ labels Nov 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DMR++ internals Kerchunk Relating to the kerchunk library / specification itself
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Forbid relative paths, and use file URI scheme internally?
5 participants