Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement fsspec-based UrlOperation class #210

Open
mih opened this issue Jan 12, 2023 · 1 comment
Open

Implement fsspec-based UrlOperation class #210

mih opened this issue Jan 12, 2023 · 1 comment

Comments

@mih
Copy link
Member

mih commented Jan 12, 2023

This is used by https://github.com/datalad/datalad-fuse

The aim here would be simpler: Wrap around fsspec.open() to provide the standard operations download, upload, sniff, and possibly delete.

This should be relatively simple in principle. More challenging would be the provisioning of the correct credentials necessary.

The list of supported protocols is impressively long:

>>> print(fsspec.available_protocols())
['file', 'memory', 'dropbox', 'http', 'https', 'zip', 'tar', 'gcs', 
 'gs', 'gdrive', 'sftp', 'ssh', 'ftp', 'hdfs', 'arrow_hdfs', 
 'webhdfs', 's3', 's3a', 'wandb', 'oci', 'asynclocal', 'adl', 'abfs',
 'az', 'cached', 'blockcache', 'filecache', 'simplecache', 'dask', 
 'dbfs', 'github', 'git', 'smb', 'jupyter', 'jlab', 'libarchive', 
 'reference', 'generic', 'oss', 'webdav', 'dvc', 'root']

Individual protocols do require additional software dependencies to be installed. But even the list of built-ins is long https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations

Demo:

>>> import fsspec
>>> of = fsspec.open('github://datalad:datalad@/README.md')
>>> fp = of.open()
>>> fp.size
23922
>>> fp.read(125)
b'     ____            _     ....

This is pretty much all that is needed to implement basic support for all unauthenticated transports.

For AnyUrlOperations we would need to come up with a way to identify URLs that this handler should or could handle.

This IO support could be the basis for RemoteArchive #184 (comment)

But even without it, this tech enables partial archive content access.

Here is a demo. big.zip is 5G in size, and contains a 4-byte text file and a large binary blob. Here I am opening the text file from inside the archive over an SSH connection:

>>> timeit fsspec.open('zip://dummyzip/tiny.txt::sftp://myhost/home/mih/big.zip').open().read()
629 ms ± 8.38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
>>> fsspec.open('zip://dummyzip/tiny.txt::sftp://juseless.inm7.de/home/mih/big.zip').open().read()
b'123\n'

Size reporting is still possible, hence progress reporting would be possible here too!

>>> fsspec.open('zip://dummyzip/fmriprep-20.0.3.simg::sftp://juseless.inm7.de/home/mih/big.zip').open().size
4873539615
@mih
Copy link
Member Author

mih commented Jan 16, 2023

We need to support additional settings for S3-compatible stores, because the respective URLs won't have it. This includes, at minimum an endpoint_url to be passed a client_kwargs https://s3fs.readthedocs.io/en/latest/#s3-compatible-storage

Such an endpoint_url could also be used as means to discover a related credential, likely in (optional) combination with a bucket name. An absent endpoint_url could be interpreted as AWS-S3. A distinction by endpoint and bucket name would also replace the need for a type-inflation as presently done in datalad-core (e.g. aws-s3, nda-s3), and we can simply use an s3 credential for all of them.

Such an s3 credential is essentially a user/password combo. It is probably not very useful to handle session tokens due to their lifetime limitations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant