-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: fsspec source implementation #692
Conversation
# Remove uproot-specific options (should be done earlier) | ||
# TODO: is timeout always valid? | ||
opts = { | ||
k: v | ||
for k, v in kwargs.items() | ||
if k | ||
not in { | ||
"file_handler", | ||
"xrootd_handler", | ||
"http_handler", | ||
"object_handler", | ||
"max_num_elements", | ||
"num_workers", | ||
"num_fallback_workers", | ||
"begin_chunk_size", | ||
"minimal_ttree_metadata", | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to pop uproot-specific options in uproot.reading.open
instead of passing all of them to the source. I think it would also be nice to deprecate specific *_handler
functions in favor of some sort of filesystem=
argument which would override the default irrespective of what the default is. Some of these arguments are source-related (e.g. timeout
or num_workers
), in which case they would better be defaulted keyword arguments in the respective source constructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The *_handler
arguments are wrong for a Source like fsspec, I agree. (And didn't see it coming.)
If we go the route of handling all remote file I/O with fsspec, then the natural thing to do would be to replace these arguments with a filesystem
or fs
argument (as is becoming standard in some other projects).
The role currently played by file_handler
and object_handler
would need some equivalent, though. There are two ways one might want to use a file_handler
: as memmap and as conventional file object(s). I expect that memmap would usually be better, but worse through NFS. For object_handler
, there's only one option (not something for a user to select), but it is very distinct from handlers for an object that you know is a file: if it's a generic object, we can't assume that we can seek
and read
in parallel.
This is looking good. But how will an fsspec-capable Source be enabled? The other Sources are identified by URI scheme, but a URI with The backend that runs can't be determined by what packages are installed (i.e. attempt to If it's opt-in (only used if FSSpecSource is explicitly selected in If it's something that Coffea or coffea-casa triggers when the time is right, that's fine. |
One option is to simply make uproot depend on fsspec. Then fsspec is responsible for picking the right implementation for a given protocol. (we could provide an override option) Of course, this means that we have to make sure the fsspec http and xrootd implementations are as robust and performant as the current implementations. This is something @ScottDemarest was interested in looking into. I know one potential (possibly premature) optimization is to notify downstream executors as soon as some portion of data is available. We can do this within the constraints of the fsspec public API by using async filesystem implementation with an event loop created and held as a resource by uproot (essentially copying the sync wrapper implementation that fsspec uses internally). If we were to do that, it would also be nice to drop the custom executor and futures implementation here and adopt the standard library one, now that python 2 is not supported. |
I get pulled in both directions about whether to include batteries or not. On the one hand,
On the other hand,
Currently, the PyPI versions of Uproot and Awkward Array have minimal dependencies, while the conda-forge recipes are batteries included. We've also considered pip install uproot[complete] # and
pip install awkward[complete] but then the Enlarging the scope of this, and perhaps answering my previous question: how about if fsspec-enabled remote access replaces the current HTTPSource and XRootDSource? The HTTPSource was written using only the Python standard library in an extremely dependency-averse mood. It should have been based on requests. (I wasn't familiar with how basic that is to the ecosystem, at least on par with NumPy.) I think we would have had fewer issues with HTTP redirects if we had gone this route from the start. I have had so much help from user-developers fixing up the XRootDSource, which I'm grateful for, but it indicates how tricky it is. By now, there's a lot of code in XRootDSource that I don't understand. If that was used as a model to develop xrootd-fsspec, then maybe we should discontinue this implementation in favor of that one. The story for users would be simpler: "If you want to access remote files (any non-file URI), you need fsspec (or install We could make this part of the Uproot 4 → 5 switch. What do you think? (Our comments crossed. It sounds like we have similar thoughts on this.) |
Yes, Uproot could strictly depend on fsspec, since that's pure Python and small. I don't know if this slippery slope extends to lz4 and xxhash (not pure Python, but small), since LZ4 is a not-uncommon ROOT compression setting. Or ZStandard. Maybe it does, since these are things a user might not even know they need: they just open a TTree and it fails. Dependencies for extra user-visible features like hist and Dask are different, so the slippery slope does stop at some point. |
I do sometimes wonder if "easy to install" means simply that pip worked, or that the user was happy to see that pip did not bring in more than one package. But that's a bit beside the point It should be noted that fsspec is lightweight because it is very much not batteries-included. For example, in a fresh environment, if you try to with fsspec.open("http://google.com") as f:
print(f.read(10)) you will get an error:
and if you go install those, then you end up with 12 packages!
The same goes for fsspec-xrootd, but doubly-so: the first error will ask you to install |
I think that's true, so having a tree of transitive dependencies is not off the table. (Earlier, I had been considering cases of a hard-to-access DAQ-or-whatever, without outside network/pip access, but not anymore. There are freezing options.) It is necessary for all of those transitive dependencies to successfully install, however. Some cases of lz4 installation have broken because they assume liblz4.so is available on the OS somewhere. I don't know if that's still true. Having Uproot strictly depend on fsspec, and then fsspec asks the user to install requests and aiohttp, would still be an improvement over not having Uproot strictly depend on fsspec, since it would only be one round of telling the user to go install something else, rather than two rounds. I can imagine that getting very annoying. Edit: Got it: you said all of that, but include both fsspec and fsspec-xrootd, since they're at the same level. |
I'm personally +1 on making fsspec a dependency:
|
Hello @jpivarski, allow me a quick comment - these are important matters and something that IMO you should raise more widely, nice with org admins and most usefully with users via Gitter/... I would be a bit scared myself if you were going to by default, meaning for standard releases, to increase significantly the dependencies. Indeed, as you said, the success of uproot has been "small and easy". I won't go into big thoughts here, so making a single one at this point: ask yourself the percentage of (the very large number of) uproot users who actually need ffspec ... or have even just heard of it! Adding something that brings in effectively an order magnitude more dependencies for a permil user base seems awkward. |
To be clear, adding fsspec as a dependency will bring just 1 new package in by default (namely, fsspec) What one then needs to do is bring in additional packages depending on their remote IO needs, all in an opt-in fashion. |
Of course we still should solicit a lot more feedback on this. @chrisburr in particular may have thoughts |
Remaining test failures seem unrelated?
|
@kkothari2001, are the failing Dask tests ones that require a particular version of dask-awkward, and should therefore be protected by |
I had a look, it is breaking due to either changes made in dask-awkward code itself, or from updating to awkward On my computer, I tried backporting to dask-awkward version |
On second thought, there is known to be some discrepancy between my personal dev setup and the PR tests (because I've been installing the latest dask-awkward/main) let's just skip the tests with now with |
The failing test can be fixed by merging in #694. |
We have not published fsspec for python 3.6. Should we do this or will uproot soon drop support for 3.6? |
Uproot will drop support for Python 3.6: #687. Only inertia has kept this from happening so far: if it's the slightest bit easier for the fsspec source, that would definitely push it over the edge. Then we're getting on a schedule of dropping Python versions when they're end-of-lifed (can't find the GitHub reference where that was stated, but it's the plan). |
@nsmith-, I'm marking this PR as "inactive." I would like to follow up on it someday, though it may get to the point where you'd want to re-apply these changes to a new PR, branched off of |
Closed in favor of #967 |
No description provided.