-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft of FsspecUrlOperations
#215
Conversation
09e0450
to
98398ad
Compare
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #215 +/- ##
==========================================
+ Coverage 90.70% 91.77% +1.07%
==========================================
Files 122 87 -35
Lines 9100 7664 -1436
==========================================
- Hits 8254 7034 -1220
+ Misses 846 630 -216
☔ View full report in Codecov by Sentry. |
f22fb06
to
ad69071
Compare
Using the new config-based handler definitions I took a look at performance and which lever can impact it. While performance can be impacted, the underlying mechanism remain elusive to me. The below uses features introduced with 42148e0 -- see the commit message for infos. All examples download a 17.3 MB file from S3 in north-america to my laptop in Germany. Here is the reference:
Smooth progress reporting, 10s. Now
All on default, same anonymous access that download-url is doing. Chopping progress (every 5MB, seems to match the default cache blocksize of fsspec for S3), 20s. Make it download everything at once by increasing the cache size beyond the file size:
No meaningful progress reporting, less than 10s. Now turning of fsspec's readahead cache (which should not do anything for a complete download, and instead use the same 20MB block size to read everything at once:
Same progress behavior as before, more than twice the runtime. Again no caching, but 0.5M chunksize for meaningful progress reporting (every half MB)
Proper progress reporting, more or less the same runtime of ~25s. So the per chunk processing in the handler does not add much, but it still takes 2.5x longer than with the downloader from datalad-core |
8f03822
to
c46959c
Compare
For now just downloads via unauthenticated connections. The included code enables downloads of selected archived content, without having to download the entire archive first. See datalad/datalad#373 For ZIP and TAR archives this is hooked into `AnyUrlOperations` and thereby accessible via the `datalad download` command. Demo: ```sh ❯ datalad download 'zip://datalad-datalad-cddbe22/requirements-devel.txt::https://zenodo.org/record/7497306/files/datalad/datalad-0.18.0.zip?download=1 -' # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided # Since we use requirements.txt ATM only for development IMHO it is ok but # we need to figure out/complaint to pip folks -e .[devel] ❯ datalad download 'tar://datalad-0.18.0/requirements-devel.txt::https://files.pythonhosted.org/packages/dd/5e/9be11886ef4c3c64e78a8cdc3f9ac3f27d2dac403a6337d5685cd5686770/datalad-0.18.0.tar.gz -' # Theoretically we don't want -e here but ATM pip would puke if just .[full] is provided # Since we use requirements.txt ATM only for development IMHO it is ok but # we need to figure out/complaint to pip folks -e .[devel] ``` As demo'ed in the code, dependening on the capabilities of the particular filesystem abstraction it needs custom handling of the actual download process after `open()` was called.
This includes, but is not limited to, being able to specify whether or not anonymous access should be attempted (first). This change paves the way for endpoint customizations and anything else that FSSPEC exposes. Moreover, when an explicit `credential` identifier is given, the boto-based attempt to locate credentials or try anonymous access first is skipped, and a credential is looked up and provisioned immediately.
This facilitates exploitation of the numerous filesystem specific features provided by FSSPEC.
This changes demos the utility of this feature for the FSSPEC based access to S3 resources. By default anonymous access is attempted, but additional handlers could change this behavior (for particular S3 targets).
Requires particular permissions and comes with a potential performance penalty.
They are mostly needed for the tests (and more were needed), and we want to keep the actually core-dependencies small.
See docs inside. At present it is unclear how relevant this will be in practice. However, different filesystem caching settings impact performance substantially (empricial observation). If caching is turned off entirely, this parameter is the only way to specify chunk sizes (for reading). Hence I consider it a useful thing to have in general -- and it is cheap. This change also flips the default download method from iteration over a file pointer to reading chunks of a specific size. Ths has been found to be more performant in some cases.
Such an explicit option conflicts with processing of S3 URL that specify a version explicitly.
Expand test coverage too
@@ -36,6 +36,9 @@ devel = | |||
httpsupport = | |||
requests | |||
requests_toolbelt | |||
remotefs = | |||
fsspec | |||
requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I discovered in #223 that a dependency missing on my system was aiohttp
in addition to those listed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When trying S3 URLs, I was missing the s3fs dependency. However, the reporting for this was very nice and user-friendly:
datalad -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.class=datalad_next.url_operations.fsspec.FsspecUrlOperations' -c 'datalad.url-handler.(^|.*::)s3://openneuro.org.kwargs={"fs_kwargs": {"s3": {"anon": true}}}' download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz demo.nii.gz'
download(error): demo.nii.gz [download failure] [Install s3fs to access S3]
This error comes from fsspec directly: https://github.com/fsspec/filesystem_spec/blob/e180aa859ef081215882b2d1b67d4bc33c040330/fsspec/registry.py#L114.
Interestingly, this mapping of protocols to dependencies and errors also exists for HTTP (https://github.com/fsspec/filesystem_spec/blob/e180aa859ef081215882b2d1b67d4bc33c040330/fsspec/registry.py#L70), and I did see it in #223 - but only in the special remote debug output, not bubbled up. It would be nice to find out what makes the s3 code handle this error better, and mirror it in the generic fsspec code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, it might have been because #223 concerned a plain git-annex call. potentially there is already stuff in place that bubbles it up when wrapped in a datalad call...
I was looking into the open questions about versioned s3 URLs by writing a unit test based on @mslw's bucket and code snippets - thanks much! I found that handling versioned URLs works in general. However, in case of s3, there is a problem if we configure a URL handler to be not version-aware ( The outcome is an error that reads like this:
Internally, it is a dictionary update in
This error results from the fact that def _get_kwargs_from_urls(urlpath):
"""
When we have a urlpath that contains a ?versionId=
Assume that we want to use version_aware mode for
the filesystem.
"""
url_storage_opts = infer_storage_options(urlpath)
url_query = url_storage_opts.get("url_query")
out = {}
if url_query is not None:
from urllib.parse import parse_qs
parsed = parse_qs(url_query)
if "versionId" in parsed:
out["version_aware"] = True
return out Since we already set this key, fsspec crashes. So it seems that the case "configure version awareness to be off, but supply versioned URLs nevertheless" can't be supported. I wondering if this is something that needs documentation, or if we can do some clever handling - but parsing URLs pre-emptively for version strings seems a bit costly, IMO. |
When working with S3 URLs, s3fs employs internal URL parsing to detect whether the S3 file has versioning enabled. Based on the outcome of this evaluation, it sets the version_aware value itself. If a user sets this keys' value to False in fs_kwargs of the URL handler, but then supplies a versioned URL, s3fs's attempt to set this key to True is sabotaged, and the download crashes: datalad_next.url_operations.UrlOperationsRemoteError: UrlOperationsRemoteError for 's3://mslw-datalad-test0-versioned/3versions-allversioned.txt?versionId=Tro_UjqVFJfr32v5tuPfjwtOzeqYCxi2' To leave a trace of this, even if only temporary, this commit adds a short paragraph to the URL handler's docstring to warn about setting this key.
I have for now settled on the following:
|
Pretty weird failure in crippledFS tests:
|
Mostly to minimize the diff for conflict avoidance, after a lot of typos were fixed via datalad#315
Thanks @adswa for the S3 access test. I think this is spot on! |
The crippledFS failure also shows up in #223, but I can't reproduce it locally. After a bit of digging it, it looks like it is a cross-platform incompatibility between Windows and Unix systems in fsspec's zip implementation. I have filed an issue to find out more fsspec/filesystem_spec#1256 |
Get the updates to -core for the CI
If the props variable does not get populated during _get_fs(), access to the URL has failed for some reason. Further processing of props in download will lead to crashes. This change adds error handling to raise early and more informative
a471406
to
80038f6
Compare
Some initial usage notes:
|
This is a replacement for the implementation of the `datalad-archives` remote. In addition to its predecessor, it reduces the storage overhead from 200% to 100% by doing partial extraction from fully downloaded archives. Ultimately, it will support sparse/partial access to remote archives, avoiding any storage overhead, and the requirement to unconditionally download full archives (see datalad#215). This implementation is trying to be efficient by: - trying to fulfill requests using locally present archives - trying to download smaller archives before larger ones This implementation is aiming to be extensible by: - using `ArchiveOperations` as a framework to uniformly implement (partial) access to local (or remote) archives. Support for a number of corner cases is implemented (e.g., registered availability from an archive that actually does not contain a given key), but there are presently no tests for this, yet.
This is a replacement for the implementation of the `datalad-archives` remote. In addition to its predecessor, it reduces the storage overhead from 200% to 100% by doing partial extraction from fully downloaded archives. Ultimately, it will support sparse/partial access to remote archives, avoiding any storage overhead, and the requirement to unconditionally download full archives (see datalad#215). This implementation is trying to be efficient by: - trying to fulfill requests using locally present archives - trying to download smaller archives before larger ones This implementation is aiming to be extensible by: - using `ArchiveOperations` as a framework to uniformly implement (partial) access to local (or remote) archives. Support for a number of corner cases is implemented (e.g., registered availability from an archive that actually does not contain a given key), but there are presently no tests for this, yet.
I think we have learned a lot here, but nobody was able to figure out the speed issues. This will need to be picked up later, or replaced by something else entirely. Thanks to everyone who tried. |
FTR, I'm linking two relevant comments to this PR: @christian-monch found, I believe similar to what @mih found in #215 (comment), that the block size makes a different for download speeds, in this particular case, for partial 7z archives: datalad/datalad-ria#50 (comment). In @christian-monch's explorations, a block size of 1 resulted in a general speed up in different usecases. I tried to test this in this PR, too, rediscovered @mih's initial comment about this, and found no such general improvements for s3 downloads: datalad/datalad-ria#50 (comment) |
For now this is just covering downloads and has an implementation for authentication against S3 endpoints.
The included code enables downloads of selected archived content, without having to
download the entire archive first.
See datalad/datalad#373
For ZIP/TAR archives, and Github projects this is hooked into
AnyUrlOperations
andthereby accessible via the
datalad download
command.Demo:
As demo'ed in the code, dependening on the capabilities of the
particular filesystem abstraction it needs custom handling of the
actual download process after
open()
was called.Closes #210
Closes #179
Closes #217
TODO
anon=True
. However, that would be expensive, because it would prevent automatically using a session token provided via the environment. If that is needed, we should add additional logic.datalad download 's3://openneuro.org/ds004393/sub-26/func/sub-26_task-ImAcq_acq-task_run-05_bold.nii.gz?versionId=+WMIWSpgtnESd8J2k.BfgJ3Xo7qpQ1Kjm demo.nii.gz'
works, but it would be good to have a testcase for accessing a non-recent version of a file. Settingversion_aware=False
prevents handling about URLs with version tags. What needs testing is:For an openneuro URL a version tag is reported regardless of the
version_aware
setting and regardless of whether the requesting URL had one.download-url
from S3 seems to be 3x faster thandownload
at times.In turns out to be mostly a question of how data are read from the "file" pointer. The default was/is to simply iterate over the file object. This leaves the decision making to fsspec. However, not all filesystem implementations support this iteration anyways. Switching to reading direct chunks (of a size declared explicitly from the outside) remove the slowdown entirely (and actually makes the implementation about 15% faster than the (default behavior) of the downloader in datalad-core).
download()
when a remote URL is already known not to be around from stat'ing