Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: fsspec source implementation #692

Closed
wants to merge 6 commits into from
Closed

Conversation

nsmith-
Copy link
Collaborator

@nsmith- nsmith- commented Aug 25, 2022

No description provided.

@nsmith- nsmith- changed the title fsspec source implementation feat: fsspec source implementation Aug 25, 2022
Comment on lines +19 to +36
# Remove uproot-specific options (should be done earlier)
# TODO: is timeout always valid?
opts = {
k: v
for k, v in kwargs.items()
if k
not in {
"file_handler",
"xrootd_handler",
"http_handler",
"object_handler",
"max_num_elements",
"num_workers",
"num_fallback_workers",
"begin_chunk_size",
"minimal_ttree_metadata",
}
}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to pop uproot-specific options in uproot.reading.open instead of passing all of them to the source. I think it would also be nice to deprecate specific *_handler functions in favor of some sort of filesystem= argument which would override the default irrespective of what the default is. Some of these arguments are source-related (e.g. timeout or num_workers), in which case they would better be defaulted keyword arguments in the respective source constructors.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The *_handler arguments are wrong for a Source like fsspec, I agree. (And didn't see it coming.)

If we go the route of handling all remote file I/O with fsspec, then the natural thing to do would be to replace these arguments with a filesystem or fs argument (as is becoming standard in some other projects).

The role currently played by file_handler and object_handler would need some equivalent, though. There are two ways one might want to use a file_handler: as memmap and as conventional file object(s). I expect that memmap would usually be better, but worse through NFS. For object_handler, there's only one option (not something for a user to select), but it is very distinct from handlers for an object that you know is a file: if it's a generic object, we can't assume that we can seek and read in parallel.

@jpivarski
Copy link
Member

This is looking good. But how will an fsspec-capable Source be enabled? The other Sources are identified by URI scheme, but a URI with https:// or root:// could be handled by either FSSpecSource or the non-fsspec ones. (s3:// would be unambiguous.)

The backend that runs can't be determined by what packages are installed (i.e. attempt to import fsspec and then run different code, depending on whether it raises MissingModuleException) because that would make debugging user issues more complicated.

If it's opt-in (only used if FSSpecSource is explicitly selected in uproot.open), then there will be people who would have wanted to use it that never find out about it.

If it's something that Coffea or coffea-casa triggers when the time is right, that's fine.

pyproject.toml Show resolved Hide resolved
@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 25, 2022

One option is to simply make uproot depend on fsspec. Then fsspec is responsible for picking the right implementation for a given protocol. (we could provide an override option) Of course, this means that we have to make sure the fsspec http and xrootd implementations are as robust and performant as the current implementations. This is something @ScottDemarest was interested in looking into.

I know one potential (possibly premature) optimization is to notify downstream executors as soon as some portion of data is available. We can do this within the constraints of the fsspec public API by using async filesystem implementation with an event loop created and held as a resource by uproot (essentially copying the sync wrapper implementation that fsspec uses internally). If we were to do that, it would also be nice to drop the custom executor and futures implementation here and adopt the standard library one, now that python 2 is not supported.

@jpivarski
Copy link
Member

If we want uproot to be "batteries included" we should add these to the regular install requirements if we adopt FSSpec as the standard source.

I get pulled in both directions about whether to include batteries or not.

On the one hand,

  • "Easy install" was the top reason that people gave for why they use Uproot in the first place. Adding dependencies jeopardizes that, though the case against fsspec is overstated. fsspec is a very easy dependency, and it's pure Python and everything.
  • @agoose77 is WASM-building Awkward for use in Thebe-powered tutorials. We need to keep dependencies minimal for the web (mostly pure Python and not too large to download). Again, using this as a case against fsspec, even if we decide to do a similar in-browser tutorial for Uproot (haven't talked about it), would be overstating it. fsspec is small (141 kB).

On the other hand,

  • It's annoying to install a package, try to use it, and immediately be told that you have to install another package. It's even worse if you aren't using the package interactively (or directly: it's a hidden dependency of the one you are using). It could even be a multi-pass cycle of install, try to use, install more, try to use again, install yet more.
  • The Python world has well-established dependency resolution tools, but optional dependencies require us to effectively reimplement that, including minimum versions.
  • Some of these dependencies implement rather core features. In particular, I'm recently swayed by the fact that we're now saying Awkward's primary on-disk format is Parquet, but to use Parquet you have to install pyarrow (and fsspec). I'm only being held back on that one by the WASM target: pyarrow is huge and a monster to compile.

Currently, the PyPI versions of Uproot and Awkward Array have minimal dependencies, while the conda-forge recipes are batteries included. We've also considered

pip install uproot[complete]   # and
pip install awkward[complete]

but then the [complete] labels would basically be synonyms of [dev] and [test]. I don't know what should be in one label and not the others (except maybe [test] should include pytest, the others not).


Enlarging the scope of this, and perhaps answering my previous question: how about if fsspec-enabled remote access replaces the current HTTPSource and XRootDSource?

The HTTPSource was written using only the Python standard library in an extremely dependency-averse mood. It should have been based on requests. (I wasn't familiar with how basic that is to the ecosystem, at least on par with NumPy.) I think we would have had fewer issues with HTTP redirects if we had gone this route from the start.

I have had so much help from user-developers fixing up the XRootDSource, which I'm grateful for, but it indicates how tricky it is. By now, there's a lot of code in XRootDSource that I don't understand. If that was used as a model to develop xrootd-fsspec, then maybe we should discontinue this implementation in favor of that one.

The story for users would be simpler: "If you want to access remote files (any non-file URI), you need fsspec (or install uproot[complete])." And then that comes with its optional dependencies, like s3fs. Then Uproot is out of the business of remote access and relies on the same remote access libraries as everybody else.

We could make this part of the Uproot 4 → 5 switch. What do you think?

(Our comments crossed. It sounds like we have similar thoughts on this.)

@jpivarski
Copy link
Member

The story for users would be simpler: "If you want to access remote files (any non-file URI), you need fsspec (or install uproot[complete])."

One option is to simply make uproot depend on fsspec.

Yes, Uproot could strictly depend on fsspec, since that's pure Python and small. I don't know if this slippery slope extends to lz4 and xxhash (not pure Python, but small), since LZ4 is a not-uncommon ROOT compression setting. Or ZStandard. Maybe it does, since these are things a user might not even know they need: they just open a TTree and it fails. Dependencies for extra user-visible features like hist and Dask are different, so the slippery slope does stop at some point.

@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 25, 2022

I do sometimes wonder if "easy to install" means simply that pip worked, or that the user was happy to see that pip did not bring in more than one package. But that's a bit beside the point

It should be noted that fsspec is lightweight because it is very much not batteries-included. For example, in a fresh environment, if you try to pip install fsspec and then run

with fsspec.open("http://google.com") as f:
    print(f.read(10))

you will get an error:

ImportError: HTTPFileSystem requires "requests" and "aiohttp" to be installed

and if you go install those, then you end up with 12 packages!

Installing collected packages: urllib3, multidict, idna, frozenlist, charset-normalizer, certifi, attrs, async-timeout, yarl, requests, aiosignal, aiohttp
Successfully installed aiohttp-3.8.1 aiosignal-1.2.0 async-timeout-4.0.2 attrs-22.1.0 certifi-2022.6.15 charset-normalizer-2.1.1 frozenlist-1.3.1 idna-3.3 multidict-6.0.2 requests-2.28.1 urllib3-1.26.12 yarl-1.8.1

The same goes for fsspec-xrootd, but doubly-so: the first error will ask you to install fsspec-xrootd, then immediately afterwards it will again ask you to figure out how to install xrootd (because it still is not pip installable). In that spirit, I would say adding fsspec and fsspec-xrootd to the explicit dependencies of uproot is OK because they are very lightweight and themselves will ask the user to install further packages.

@jpivarski
Copy link
Member

jpivarski commented Aug 25, 2022

I do sometimes wonder if "easy to install" means simply that pip worked, or that the user was happy to see that pip did not bring in more than one package.

I think that's true, so having a tree of transitive dependencies is not off the table. (Earlier, I had been considering cases of a hard-to-access DAQ-or-whatever, without outside network/pip access, but not anymore. There are freezing options.)

It is necessary for all of those transitive dependencies to successfully install, however. Some cases of lz4 installation have broken because they assume liblz4.so is available on the OS somewhere. I don't know if that's still true.

Having Uproot strictly depend on fsspec, and then fsspec asks the user to install requests and aiohttp, would still be an improvement over not having Uproot strictly depend on fsspec, since it would only be one round of telling the user to go install something else, rather than two rounds. I can imagine that getting very annoying.

Edit: Got it: you said all of that, but include both fsspec and fsspec-xrootd, since they're at the same level.

@agoose77
Copy link
Collaborator

agoose77 commented Aug 25, 2022

I'm personally +1 on making fsspec a dependency:

  • IO is very important in uproot
  • fsspec is lightweight RE dependencies
  • fsspec is near "standard" for IO in Python ecosystem
  • we can drop code 🥳

@eduardo-rodrigues
Copy link
Member

eduardo-rodrigues commented Aug 26, 2022

Hello @jpivarski, allow me a quick comment - these are important matters and something that IMO you should raise more widely, nice with org admins and most usefully with users via Gitter/... I would be a bit scared myself if you were going to by default, meaning for standard releases, to increase significantly the dependencies. Indeed, as you said, the success of uproot has been "small and easy". I won't go into big thoughts here, so making a single one at this point: ask yourself the percentage of (the very large number of) uproot users who actually need ffspec ... or have even just heard of it! Adding something that brings in effectively an order magnitude more dependencies for a permil user base seems awkward.

@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 26, 2022

To be clear, adding fsspec as a dependency will bring just 1 new package in by default (namely, fsspec) What one then needs to do is bring in additional packages depending on their remote IO needs, all in an opt-in fashion.
But this is not really so much about users wanting fsspec: this is a library that makes uproot development easier by abstracting storage. With this dependency, it would be much easier to, for example, implement the much-desired glob feature, or to add headers (token auth!) to http IO, etc. And we would not be rolling our own http, as the standard library is not nearly sufficient to handle everything, case in point #121 #176 etc.

@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 26, 2022

Of course we still should solicit a lot more feedback on this. @chrisburr in particular may have thoughts

@nsmith-
Copy link
Collaborator Author

nsmith- commented Aug 26, 2022

Remaining test failures seem unrelated?

FAILED tests/test_0652_dask-for-awkward.py::test_dask_concatenation - TypeError: could not convert data into an ak.Array
FAILED tests/test_0652_dask-for-awkward.py::test_delay_open - TypeError: could not convert data into an ak.Array

@jpivarski
Copy link
Member

@kkothari2001, are the failing Dask tests ones that require a particular version of dask-awkward, and should therefore be protected by pytest.importorskip or maybe pytest.mark.skipif(dask_awkward.version)?

@kkothari2001
Copy link
Collaborator

I had a look, it is breaking due to either changes made in dask-awkward code itself, or from updating to awkward v1.9.0rc5 to v1.9.0rc10 in commit 7b5ca3 in dask-awkward.

On my computer, I tried backporting to dask-awkward version 2022.7a0 it passes, ig for now we can work with that, then figure the reason for the error.

@kkothari2001
Copy link
Collaborator

On second thought, there is known to be some discrepancy between my personal dev setup and the PR tests (because I've been installing the latest dask-awkward/main) let's just skip the tests with now with pytest.mark.skip and I'll resolve this issue in my currently open PR #679.

@jpivarski
Copy link
Member

The failing test can be fixed by merging in #694.

@nsmith-
Copy link
Collaborator Author

nsmith- commented Sep 26, 2022

We have not published fsspec for python 3.6. Should we do this or will uproot soon drop support for 3.6?

@jpivarski
Copy link
Member

Uproot will drop support for Python 3.6: #687. Only inertia has kept this from happening so far: if it's the slightest bit easier for the fsspec source, that would definitely push it over the edge.

Then we're getting on a schedule of dropping Python versions when they're end-of-lifed (can't find the GitHub reference where that was stated, but it's the plan).

@jpivarski jpivarski added the inactive A pull request that hasn't been touched in a long time label Nov 28, 2022
@jpivarski
Copy link
Member

@nsmith-, I'm marking this PR as "inactive." I would like to follow up on it someday, though it may get to the point where you'd want to re-apply these changes to a new PR, branched off of main.

@jpivarski jpivarski mentioned this pull request Jul 12, 2023
2 tasks
lobis added a commit to lobis/uproot5 that referenced this pull request Oct 3, 2023
@nsmith-
Copy link
Collaborator Author

nsmith- commented Oct 4, 2023

Closed in favor of #967

@nsmith- nsmith- closed this Oct 4, 2023
@nsmith- nsmith- deleted the fsspecsrc branch October 4, 2023 19:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inactive A pull request that hasn't been touched in a long time
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants