Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft of FsspecUrlOperations #215

Closed
wants to merge 28 commits into from
Closed

Draft of FsspecUrlOperations #215

wants to merge 28 commits into from

Conversation

mih
Copy link
Member

@mih mih commented Jan 13, 2023

For now this is just covering downloads and has an implementation for authentication against S3 endpoints.

The included code enables downloads of selected archived content, without having to
download the entire archive first.

See datalad/datalad#373

For ZIP/TAR archives, and Github projects this is hooked into AnyUrlOperations and
thereby accessible via the datalad download command.

Demo:

❯ datalad download 'zip://datalad-datalad-cddbe22/requirements-devel.txt::https://zenodo.org/record/7497306/files/datalad/datalad-0.18.0.zip?download=1 -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]

❯ datalad download 'tar://datalad-0.18.0/requirements-devel.txt::https://files.pythonhosted.org/packages/dd/5e/9be11886ef4c3c64e78a8cdc3f9ac3f27d2dac403a6337d5685cd5686770/datalad-0.18.0.tar.gz -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]

As demo'ed in the code, dependening on the capabilities of the
particular filesystem abstraction it needs custom handling of the
actual download process after open() was called.

Closes #210
Closes #179
Closes #217

TODO

  • Come up with some meaningful suits of extra requirements
  • Under some circumstances boto wants to authenticate even if a bucket is public. Investigate and possibly switch to starting with anon=True. However, that would be expensive, because it would prevent automatically using a session token provided via the environment. If that is needed, we should add additional logic.
  • Check S3 versioned access. datalad download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz?versionId=+WMIWSpgtnESd8J2k.BfgJ3Xo7qpQ1Kjm demo.nii.gz' works, but it would be good to have a testcase for accessing a non-recent version of a file. Setting version_aware=False prevents handling about URLs with version tags. What needs testing is:
    • whether not turning it on specifically would make version tags in URLs get ignored (leading to wrong outcomes)
    • whether not turning it off specifically leads to needless requests
      For an openneuro URL a version tag is reported regardless of the version_aware setting and regardless of whether the requesting URL had one.
  • verify that anon-access can be turned off when explicitly provisioning a credential
  • Investigate speed differences: download-url from S3 seems to be 3x faster than download at times.
    In turns out to be mostly a question of how data are read from the "file" pointer. The default was/is to simply iterate over the file object. This leaves the decision making to fsspec. However, not all filesystem implementations support this iteration anyways. Switching to reading direct chunks (of a size declared explicitly from the outside) remove the slowdown entirely (and actually makes the implementation about 15% faster than the (default behavior) of the downloader in datalad-core).
  • Add error handling in download() when a remote URL is already known not to be around from stat'ing

@codecov
Copy link

codecov bot commented Jan 17, 2023

Codecov Report

Patch coverage: 90.22% and project coverage change: +1.07 🎉

Comparison is base (0cb44b0) 90.70% compared to head (9a68f85) 91.77%.

❗ Current head 9a68f85 differs from pull request most recent head d0e3689. Consider uploading reports for the commit d0e3689 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #215      +/-   ##
==========================================
+ Coverage   90.70%   91.77%   +1.07%     
==========================================
  Files         122       87      -35     
  Lines        9100     7664    -1436     
==========================================
- Hits         8254     7034    -1220     
+ Misses        846      630     -216     
Impacted Files Coverage Δ
datalad_next/url_operations/file.py 92.92% <0.00%> (+5.17%) ⬆️
datalad_next/url_operations/fsspec_s3.py 80.00% <80.00%> (ø)
datalad_next/url_operations/fsspec.py 90.90% <90.90%> (ø)
datalad_next/url_operations/tests/test_fsspec.py 98.30% <98.30%> (ø)
datalad_next/url_operations/any.py 73.62% <100.00%> (-12.90%) ⬇️

... and 82 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@mih mih force-pushed the fsspec branch 2 times, most recently from f22fb06 to ad69071 Compare January 18, 2023 10:42
@mih
Copy link
Member Author

mih commented Jan 18, 2023

Using the new config-based handler definitions I took a look at performance and which lever can impact it. While performance can be impacted, the underlying mechanism remain elusive to me.

The below uses features introduced with 42148e0 -- see the commit message for infos.

All examples download a 17.3 MB file from S3 in north-america to my laptop in Germany.

Here is the reference: download-url from core.

❯ rm demo.nii.gz ; time datalad download-url s3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz -O demo.nii.gz --nosave
-> datalad download-url  -O demo.nii.gz --nosave  0.36s user 0.12s system 4% cpu 9.991 total

Smooth progress reporting, 10s.

Now download via fsspec:

❯ rm demo.nii.gz; time datalad \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"fs_kwargs": {"s3": {"anon": true}}}' \
  download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'
datalad -c  -c  download   0.68s user 0.17s system 4% cpu 19.657 total

All on default, same anonymous access that download-url is doing. Chopping progress (every 5MB, seems to match the default cache blocksize of fsspec for S3), 20s.

Make it download everything at once by increasing the cache size beyond the file size:

❯ rm demo.nii.gz; time datalad \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"fs_kwargs": {"s3": {"anon": true, "default_block_size": 20000000, "default_cache_type": "readahead"}}}' \
  download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'
datalad -c  -c  download   0.62s user 0.13s system 8% cpu 8.456 total

No meaningful progress reporting, less than 10s.

Now turning of fsspec's readahead cache (which should not do anything for a complete download, and instead use the same 20MB block size to read everything at once:

❯ rm demo.nii.gz; time datalad \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"block_size": 20000000, "fs_kwargs": {"s3": {"anon": true, "default_cache_type": "none"}}}' \
  download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'
datalad -c  -c  download   0.61s user 0.15s system 3% cpu 23.602 total

Same progress behavior as before, more than twice the runtime.

Again no caching, but 0.5M chunksize for meaningful progress reporting (every half MB)

❯ rm demo.nii.gz; time datalad \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' \
  -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"block_size": 500000, "fs_kwargs": {"s3": {"anon": true, "default_cache_type": "none"}}}' \
  download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'
datalad -c  -c  download   0.62s user 0.07s system 2% cpu 25.003 total

Proper progress reporting, more or less the same runtime of ~25s. So the per chunk processing in the handler does not add much, but it still takes 2.5x longer than with the downloader from datalad-core

For now just downloads via unauthenticated connections. The included
code enables downloads of selected archived content, without having to
download the entire archive first.

See datalad/datalad#373

For ZIP and TAR archives this is hooked into `AnyUrlOperations` and
thereby accessible via the `datalad download` command.

Demo:

```sh
❯ datalad download 'zip://datalad-datalad-cddbe22/requirements-devel.txt::https://zenodo.org/record/7497306/files/datalad/datalad-0.18.0.zip?download=1 -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]

❯ datalad download 'tar://datalad-0.18.0/requirements-devel.txt::https://files.pythonhosted.org/packages/dd/5e/9be11886ef4c3c64e78a8cdc3f9ac3f27d2dac403a6337d5685cd5686770/datalad-0.18.0.tar.gz -'
 # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided
 # Since we use requirements.txt ATM only for development IMHO it is ok but
 # we need to figure out/complaint to pip folks
 -e .[devel]
```

As demo'ed in the code, dependening on the capabilities of the
particular filesystem abstraction it needs custom handling of the
actual download process after `open()` was called.
This includes, but is not limited to, being able to specify
whether or not anonymous access should be attempted (first).

This change paves the way for endpoint customizations and
anything else that FSSPEC exposes.

Moreover, when an explicit `credential` identifier is given,
the boto-based attempt to locate credentials or try
anonymous access first is skipped, and a credential is looked
up and provisioned immediately.
This facilitates exploitation of the numerous filesystem specific
features provided by FSSPEC.
This changes demos the utility of this feature for the FSSPEC
based access to S3 resources. By default anonymous access
is attempted, but additional handlers could change this behavior
(for particular S3 targets).
Requires particular permissions and comes with a potential
performance penalty.
They are mostly needed for the tests (and more were needed), and we want
to keep the actually core-dependencies small.
See docs inside. At present it is unclear how relevant this will be
in practice. However, different filesystem caching settings
impact performance substantially (empricial observation). If caching
is turned off entirely, this parameter is the only way to specify
chunk sizes (for reading). Hence I consider it a useful thing to have
in general -- and it is cheap.

This change also flips the default download method from iteration
over a file pointer to reading chunks of a specific size. Ths has been
found to be more performant in some cases.
Such an explicit option conflicts with processing of S3 URL that
specify a version explicitly.
@@ -36,6 +36,9 @@ devel =
httpsupport =
requests
requests_toolbelt
remotefs =
fsspec
requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discovered in #223 that a dependency missing on my system was aiohttp in addition to those listed here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When trying S3 URLs, I was missing the s3fs dependency. However, the reporting for this was very nice and user-friendly:

datalad -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"fs_kwargs": {"s3": {"anon": true}}}' download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'

download(error): demo.nii.gz [download failure] [Install s3fs to access S3]

This error comes from fsspec directly: https://github.com/fsspec/filesystem_spec/blob/e180aa859ef081215882b2d1b67d4bc33c040330/fsspec/registry.py#L114.

Interestingly, this mapping of protocols to dependencies and errors also exists for HTTP (https://github.com/fsspec/filesystem_spec/blob/e180aa859ef081215882b2d1b67d4bc33c040330/fsspec/registry.py#L70), and I did see it in #223 - but only in the special remote debug output, not bubbled up. It would be nice to find out what makes the s3 code handle this error better, and mirror it in the generic fsspec code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it might have been because #223 concerned a plain git-annex call. potentially there is already stuff in place that bubbles it up when wrapped in a datalad call...

@adswa
Copy link
Member

adswa commented May 3, 2023

I was looking into the open questions about versioned s3 URLs by writing a unit test based on @mslw's bucket and code snippets - thanks much!

I found that handling versioned URLs works in general. However, in case of s3, there is a problem if we configure a URL handler to be not version-aware (ops = FsspecUrlOperations(fs_kwargs={'version_aware': False})) but provide it with a versioned URL (such as s3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2'):

The outcome is an error that reads like this:

E               datalad_next.url_operations.UrlOperationsRemoteError: UrlOperationsRemoteError for 's3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2'

Internally, it is a dictionary update in fsspec that fails:

path = 's3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2'
kwargs = {'version_aware': False}

    def _un_chain(path, kwargs):
        x = re.compile(".*[^a-z]+.*")  # test for non protocol-like single word
        bits = (
            [p if "://" in p or x.match(p) else p + "://" for p in path.split("::")]
            if "::" in path
            else [path]
        )
        # [[url, protocol, kwargs], ...]
        out = []
        previous_bit = None
        kwargs = kwargs.copy()
        for bit in reversed(bits):
            protocol = kwargs.pop("protocol", None) or split_protocol(bit)[0] or "file"
            cls = get_filesystem_class(protocol)
            extra_kwargs = cls._get_kwargs_from_urls(bit)
            print(extra_kwargs)
            kws = kwargs.pop(protocol, {})
            if bit is bits[0]:
                kws.update(kwargs)
>           kw = dict(**extra_kwargs, **kws)
E           TypeError: dict() got multiple values for keyword argument 'version_aware'

../../env/next/lib/python3.11/site-packages/fsspec/core.py:331: TypeError

This error results from the fact that s3fs's S3 class has a method _get_kwargs_from_urls which parses S3 URLs for version strings, and if it find it, it tries to tell fsspec to become version aware:

    def _get_kwargs_from_urls(urlpath):
        """
        When we have a urlpath that contains a ?versionId=

        Assume that we want to use version_aware mode for
        the filesystem.
        """
        url_storage_opts = infer_storage_options(urlpath)
        url_query = url_storage_opts.get("url_query")
        out = {}
        if url_query is not None:
            from urllib.parse import parse_qs

            parsed = parse_qs(url_query)
            if "versionId" in parsed:
                out["version_aware"] = True
        return out

Since we already set this key, fsspec crashes. So it seems that the case "configure version awareness to be off, but supply versioned URLs nevertheless" can't be supported. I wondering if this is something that needs documentation, or if we can do some clever handling - but parsing URLs pre-emptively for version strings seems a bit costly, IMO.

adswa added 2 commits May 3, 2023 13:24
When working with S3 URLs, s3fs employs internal URL parsing to detect
whether the S3 file has versioning enabled. Based on the outcome of this
evaluation, it sets the version_aware value itself.
If a user sets this keys' value to False in fs_kwargs of the URL handler,
but then supplies a versioned URL, s3fs's attempt to set this key to True
is sabotaged, and the download crashes:

datalad_next.url_operations.UrlOperationsRemoteError: UrlOperationsRemoteError for 's3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2'

To leave a trace of this, even if only temporary, this commit adds a short
paragraph to the URL handler's docstring to warn about setting this key.
@adswa
Copy link
Member

adswa commented May 3, 2023

I have for now settled on the following:

  • there is a unit test that confirms that version awareness is able to retrieve the file versions specified in URLs. It uses @mslw's S3 bucket, so its probably temporary, but maybe we have means to create a similar bucket under a datalad account?
  • I have found that setting the 'version_aware' key is generally a bad idea, as it interferes with s3fs's autodetection and setting of this key. I have added a docstring amendment to warn users about this - I think it isn't optimally phrased, and maybe also in the wrong place; please advise improvements

@adswa
Copy link
Member

adswa commented May 3, 2023

Pretty weird failure in crippledFS tests:


=================================== FAILURES ===================================
_____________________________ test_fsspec_download _____________________________

tmp_path = PosixPath('/crippledfs/pytest-of-runner/pytest-0/test_fsspec_download0')

    def test_fsspec_download(tmp_path):
        # test a bunch of different (chained) URLs that point to the same content
        # on different persistent storage locations
        ops = FsspecUrlOperations()
        for url in (
            # included in a ZIP archive
            'zip://datalad-datalad-cddbe22/requirements-devel.txt::https://zenodo.org/record/7497306/files/datalad/datalad-0.18.0.zip?download=1',
            # included in a TAR archive
            'tar://datalad-0.18.0/requirements-devel.txt::https://files.pythonhosted.org/packages/dd/5e/9be11886ef4c3c64e78a8cdc3f9ac3f27d2dac403a6337d5685cd5686770/datalad-0.18.0.tar.gz',
            # pushed to github
            '***0.18.0/requirements-devel.txt',
        ):
>           props = ops.download(url, tmp_path / 'dummy', hash=['md5'])

../../../../.local/lib/python3.9/site-packages/datalad_next/url_operations/tests/test_fsspec.py:30: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../../../.local/lib/python3.9/site-packages/datalad_next/url_operations/fsspec.py:215: in download
    fs, urlpath, props = self._get_fs(from_url, credential=credential)
../../../../.local/lib/python3.9/site-packages/datalad_next/url_operations/fsspec.py:363: in _get_fs
    fs, urlpath, props = get_fs(
../../../../.local/lib/python3.9/site-packages/datalad_next/url_operations/fsspec.py:63: in get_fs_generic
    fs, urlpath = url_to_fs(url, **kwargs)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/core.py:375: in url_to_fs
    fs = filesystem(protocol, **inkwargs)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/registry.py:257: in filesystem
    return cls(**storage_options)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/spec.py:76: in __call__
    obj = super().__call__(*args, **kwargs)
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/implementations/zip.py:54: in __init__
    fo = fsspec.open(
/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/core.py:439: in open
    return open_files(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <List of 0 OpenFile instances>, item = 0

    def __getitem__(self, item):
>       out = super().__getitem__(item)
E       IndexError: list index out of range

/opt/hostedtoolcache/Python/3.9.16/x64/lib/python3.9/site-packages/fsspec/core.py:194: IndexError

mih added 2 commits May 4, 2023 08:15
Mostly to minimize the diff for conflict avoidance, after a lot of typos
were fixed via datalad#315
@mih
Copy link
Member Author

mih commented May 4, 2023

Thanks @adswa for the S3 access test. I think this is spot on!

@adswa
Copy link
Member

adswa commented May 4, 2023

The crippledFS failure also shows up in #223, but I can't reproduce it locally. After a bit of digging it, it looks like it is a cross-platform incompatibility between Windows and Unix systems in fsspec's zip implementation. I have filed an issue to find out more fsspec/filesystem_spec#1256

mih and others added 6 commits May 4, 2023 12:00
Get the updates to -core for the CI
If the props variable does not get populated during _get_fs(), access to the URL has
failed for some reason. Further processing of props in download will lead to crashes.
This change adds error handling to raise early and more informative
@adswa adswa force-pushed the fsspec branch 3 times, most recently from a471406 to 80038f6 Compare May 5, 2023 13:45
@jsheunis
Copy link
Member

jsheunis commented May 9, 2023

Some initial usage notes:

@mih mih mentioned this pull request May 29, 2023
7 tasks
mih added a commit to mih/datalad-next that referenced this pull request May 30, 2023
This is a replacement for the implementation of the `datalad-archives`
remote. In addition to its predecessor, it reduces the storage overhead
from 200% to 100% by doing partial extraction from fully downloaded
archives.

Ultimately, it will support sparse/partial access to remote archives,
avoiding any storage overhead, and the requirement to unconditionally
download full archives (see datalad#215).

This implementation is trying to be efficient by:

- trying to fulfill requests using locally present archives
- trying to download smaller archives before larger ones

This implementation is aiming to be extensible by:

- using `ArchiveOperations` as a framework to uniformly implement
  (partial) access to local (or remote) archives.

Support for a number of corner cases is implemented (e.g., registered
availability from an archive that actually does not contain a given
key), but there are presently no tests for this, yet.
mih added a commit to mih/datalad-next that referenced this pull request May 30, 2023
This is a replacement for the implementation of the `datalad-archives`
remote. In addition to its predecessor, it reduces the storage overhead
from 200% to 100% by doing partial extraction from fully downloaded
archives.

Ultimately, it will support sparse/partial access to remote archives,
avoiding any storage overhead, and the requirement to unconditionally
download full archives (see datalad#215).

This implementation is trying to be efficient by:

- trying to fulfill requests using locally present archives
- trying to download smaller archives before larger ones

This implementation is aiming to be extensible by:

- using `ArchiveOperations` as a framework to uniformly implement
  (partial) access to local (or remote) archives.

Support for a number of corner cases is implemented (e.g., registered
availability from an archive that actually does not contain a given
key), but there are presently no tests for this, yet.
@mih
Copy link
Member Author

mih commented Jul 27, 2023

I think we have learned a lot here, but nobody was able to figure out the speed issues. This will need to be picked up later, or replaced by something else entirely.

Thanks to everyone who tried.

@adswa
Copy link
Member

adswa commented Sep 13, 2023

FTR, I'm linking two relevant comments to this PR:

@christian-monch found, I believe similar to what @mih found in #215 (comment), that the block size makes a different for download speeds, in this particular case, for partial 7z archives: datalad/datalad-ria#50 (comment).

In @christian-monch's explorations, a block size of 1 resulted in a general speed up in different usecases.

I tried to test this in this PR, too, rediscovered @mih's initial comment about this, and found no such general improvements for s3 downloads: datalad/datalad-ria#50 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants