Implement `fsspec`-based `UrlOperation` class #210

mih · 2023-01-12T15:18:43Z

This is used by https://github.com/datalad/datalad-fuse

The aim here would be simpler: Wrap around fsspec.open() to provide the standard operations download, upload, sniff, and possibly delete.

This should be relatively simple in principle. More challenging would be the provisioning of the correct credentials necessary.

The list of supported protocols is impressively long:

>>> print(fsspec.available_protocols())
['file', 'memory', 'dropbox', 'http', 'https', 'zip', 'tar', 'gcs', 
 'gs', 'gdrive', 'sftp', 'ssh', 'ftp', 'hdfs', 'arrow_hdfs', 
 'webhdfs', 's3', 's3a', 'wandb', 'oci', 'asynclocal', 'adl', 'abfs',
 'az', 'cached', 'blockcache', 'filecache', 'simplecache', 'dask', 
 'dbfs', 'github', 'git', 'smb', 'jupyter', 'jlab', 'libarchive', 
 'reference', 'generic', 'oss', 'webdav', 'dvc', 'root']

Individual protocols do require additional software dependencies to be installed. But even the list of built-ins is long https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations

Demo:

>>> import fsspec
>>> of = fsspec.open('github://datalad:datalad@/README.md')
>>> fp = of.open()
>>> fp.size
23922
>>> fp.read(125)
b'     ____            _     ....

This is pretty much all that is needed to implement basic support for all unauthenticated transports.

For AnyUrlOperations we would need to come up with a way to identify URLs that this handler should or could handle.

This IO support could be the basis for RemoteArchive #184 (comment)

But even without it, this tech enables partial archive content access.

Here is a demo. big.zip is 5G in size, and contains a 4-byte text file and a large binary blob. Here I am opening the text file from inside the archive over an SSH connection:

>>> timeit fsspec.open('zip://dummyzip/tiny.txt::sftp://myhost/home/mih/big.zip').open().read()
629 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> fsspec.open('zip://dummyzip/tiny.txt::sftp://juseless.inm7.de/home/mih/big.zip').open().read()
b'123\n'

Size reporting is still possible, hence progress reporting would be possible here too!

>>> fsspec.open('zip://dummyzip/fmriprep-20.0.3.simg::sftp://juseless.inm7.de/home/mih/big.zip').open().size
4873539615

The text was updated successfully, but these errors were encountered:

mih · 2023-01-16T07:27:33Z

We need to support additional settings for S3-compatible stores, because the respective URLs won't have it. This includes, at minimum an endpoint_url to be passed a client_kwargs https://s3fs.readthedocs.io/en/latest/#s3-compatible-storage

Such an endpoint_url could also be used as means to discover a related credential, likely in (optional) combination with a bucket name. An absent endpoint_url could be interpreted as AWS-S3. A distinction by endpoint and bucket name would also replace the need for a type-inflation as presently done in datalad-core (e.g. aws-s3, nda-s3), and we can simply use an s3 credential for all of them.

Such an s3 credential is essentially a user/password combo. It is probably not very useful to handle session tokens due to their lifetime limitations.

This was referenced Jan 12, 2023

Implement AWS S3 UrlOperations #179

Closed

Implement AzureBlobStorageUrlOperations #176

Closed

Draft of FsspecUrlOperations #215

Closed

mih mentioned this issue Jan 17, 2023

dl+archive: URL support for UNCURL? #207

Closed

mih mentioned this issue Sep 20, 2023

Analysis of the RemoteIO framework datalad/datalad-ria#80

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `fsspec`-based `UrlOperation` class #210

Implement `fsspec`-based `UrlOperation` class #210

mih commented Jan 12, 2023 •

edited

Loading

mih commented Jan 16, 2023 •

edited

Loading

Implement fsspec-based UrlOperation class #210

Implement fsspec-based UrlOperation class #210

Comments

mih commented Jan 12, 2023 • edited Loading

mih commented Jan 16, 2023 • edited Loading

Implement `fsspec`-based `UrlOperation` class #210

Implement `fsspec`-based `UrlOperation` class #210

mih commented Jan 12, 2023 •

edited

Loading

mih commented Jan 16, 2023 •

edited

Loading