AICSImageIO 5.0 and Roadmap #372

evamaxfield · 2022-01-25T20:54:49Z

evamaxfield
Jan 25, 2022
Maintainer

Hello everybody! We are starting a discussion here as a means to figure out the general interest / a potential plan for an AICSImageIO 5.0.

Over the past couple of months we have seen a couple of new readers added, we have experienced some licensing issues, and we have heard some general gripes and wishes for this library.

Reasons for a 5.0

TL;DR:

We want to move to a reader plugin model to encourage more community reader implementations
We want to allow readers the option to support different array formats and reading modes (numpy, dask, cupy, etc.)
We want to make the API even easier to understand and explicit about interaction behaviors
We want to remove some of the requirements for writing a file format reader to make it even easier on contributors and maintainers

We also want to minimize API changes to minimize the upgrade burden for existing code

Full Details

Make it easier for members of the community to implement their own readers
While we originally thought that our base reader specification was minimal in what a contributor would need to provide, there have been issues. First is the many complications to writing and maintaining a reader implementation (more on this later) but this specific target is meant to make it easier to contribute a reader that follows our spec. To that we look to the plugin model (examples of inspiration come from fsspec and napari hookspec). Some goals and ideas for this are:
- The reader spec / ABC could live outside of aicsimageio (although it may not be necessary - TBD)
- Reader implementations are plugins that when installed, can be picked up by aicsimageio
- Reader implementations can have their own licensing and exist as separate packages
- We can create a cookiecutter to help people get started with a new file format reader
- Remove the requirement on file readers to close file handles (see here)
Reduce the maintainance on us / a single library
As we have added more readers to the library, we have become worried (as rightfully pointed out by members of the community) that it may become a problem if someone contributes a reader to the library but fails to maintain it and fix bugs.
- If readers are plugins: individual file format readers have independent builds and releases, we don't need to fix it, the plugin author / the repo owner maintains "ownership"
- The plugin interface (reader base class) will of course be carefully constructed so as to be maximally future-proof for the problem domain of scientific image loading, to reduce the burden on plugin maintainers. We think we can do this with only small changes to the existing reader API.
- Note: TIFF and Zarr formats will still be baked into the core library as they are community standards and have sufficiently open licensing. These readers would follow the same plugin interface. It is possible that core aicsimageio provides other readers if they have appropriate licensing, and the maintainers of aicsimageio are able to support them.
Allow variability in which array-types and reading modes are supported
One of the things we noticed from new reader implementers was confusion as to "what all was needed." Some even asking the question: "Why can't I just support numpy.ndarray as the output type?" This is a totally valid question. Additionally, many users have expressed confusion about "what happens when they call get_image_data vs get_image_dask_data", and as we grow our supported dtypes, do we simply keep adding methods? This somewhat relates to how we already allow reader implementations to determine if they want to support mosaic file stitching or not.

In 5.0, Reader authors can choose to support output to numpy, dask array, or cupy arrays. If a reader is used in a mode that is not implemented, then exceptions should be raised. (Maybe a reader should be able to report which modes it supports, to avoid trial and error) This should be extensible to new array types which would be unimplemented in readers by default.
- Proposing the dtype and mode parameters to specify the underlying memory model: img = AICSImage("path", dtype="numpy", mode="delayed"). (Note: there is probably a better name than dtype, to avoid confusion)
- The combination of these parameters would allow implementers to decide what they want to support: numpy-in-memory, numpy-delayed, cupy-in-memory, cupy-delayed, etc.
- The fundamental api call of Reader.get_image_data will remain the key entry point for getting pixel data out. The base class and utility code in aicsimageio will provide important basic supporting functionality around chunking, transpositions, etc.
Minor API and user interaction improvements:
AICSImageIO really tries to determine a lot of behavior for the user without them knowing. Some of those things are chunk_dims and mosaic handling and crucially, even which Reader will be selected as default. But these have caused problems and confusion because the user has no information before attempting to load the file. A specific example of this behavior has occurred when the file is a large mosaic image that may take minutes to completely stitch together before returning the AICSImage object. The user should be able to determine how they want to load the object before they attempt to do so, with the defaults generally still being "good practice."
- Add functions to readers that enable a fast and cheap check for dims and other metadata to help user decide which mode and dtype they want their data stored in. I.e. Reader.get_read_modes_available should return a list of the various modes available for that reader for that file. And Reader.get_metadata(file) should pop open the file, parse the metadata, and return some information that will be useful for users in determining which mode they want to open the file in or to simply access the metadata without all the pixel data loading.

Feedback

If you have any thoughts, ideas, concerns, or anything else you wish to share about these ideas please let us know! We are hoping to get a lot of feedback for a potential 5.0 API throughout the design and implementation process.

(in-place edits by @toloudis)

toloudis · 2022-01-25T23:28:45Z

toloudis
Jan 25, 2022
Maintainer

Just a couple of additional random high level thoughts.

This is also an opportunity to make sure we have thought about functionality like mosaic, pyramid, multiscene, and rgb, while keeping the 80% code path have the simplest API.

As we develop the spec of the new API, again, a goal is to keep the API simple and sensible with easy to use defaults for most scalar-valued bio images. And to minimize changes if possible, so that existing code will require the fewest possible changes.

With the changes we discuss here, AICSimageIO results in being constructed entirely of:

the AICSImage class
the reader base class and utility functionality common to all readers
a system for finding/discovering/registering external readers
the open file format readers and writers (OME-TIFF and Zarr)

0 replies

toloudis · 2022-03-09T21:02:14Z

toloudis
Mar 9, 2022
Maintainer

If we switch to a plugin model, we do this in part to reduce the maintenance burden of a single central repo that has ALL the file format readers. I wonder how we would "hand back" the currently implemented readers to their rightful maintainers (who we assume to be the domain experts on those particular file formats).

We expect aicsimageio to maintain at least a couple of readers and writers, for open file formats like TIFF and Zarr, but it's easy to see a future where third parties implement even more optimized versions for the same formats.

Another key problem we have identified with a discoverable plugin system is assigning precedence when there is more than one plugin that can read the same file format.

0 replies

evamaxfield · 2022-07-13T16:49:08Z

evamaxfield
Jul 13, 2022
Maintainer Author

Prep for mini-hackathon:

in the case of multiple tiff reading plugins, use most recently installed as default otherwise allow user to specify reader
have a single get_image_data function that has a mode parameter for in-memory or delayed
maybe a function called get_xarray that has a mode parameter, that attempts to generate the xarray obj on demand
add pint to the library
entry-points are the current plan for plugins
create secondary package with just the Reader interface / protocol and typing
change properties over to normal functions

0 replies

toloudis · 2022-07-13T16:53:24Z

toloudis
Jul 13, 2022
Maintainer

reader precedence:
reader explicitly provided to the get data command
supported extensions
mime type?
most recently installed
loop over all installed, or try a few and then raise

0 replies

evamaxfield · 2022-07-13T18:32:52Z

evamaxfield
Jul 13, 2022
Maintainer Author

Homework for us both: https://github.com/danielballan/pims2-prototype

Read

0 replies

evamaxfield · 2022-07-18T23:18:22Z

evamaxfield
Jul 18, 2022
Maintainer Author

Updates from Hackathon

Done

We made a repo called base-image-reader that is meant to store only the Reader base class, type information, constants, and etc. New plugins will only need to depend on this base-image-reader library.
We made two plugins:
a. The main idea is that plugins should really only have to define three things. 1: a ReaderMetadata object that is default exported from the library, 2: the actual Reader implementation, and 3: the entrypoint in setup.py.
a. tiff-reader: the exact same code of current aicsimageio.readers.tiff_reader.TiffReader but uses the base-image-reader library.
b. aicsimageio-ometiffreader: the exact same code of current aicsimageio.readers.ome_tiff_reader.OmeTiffReader but uses the tiff-reader library.
We wrote a plugin detection and sorting prototype:
a. Plugins are detected using stdlib Python entrypoints
b. Plugins are added to a LUT with their "supported extensions" as the primary key
c. Each extension key has a value of a list of matching plugins, sorted by installation datetime. The most recently installed reader with the longest matching extension will be the selected reader. (the idea here is: "ahh my file doesn't work with this reader, i can install this other one, and it is now the default")
d. This code can be seen in our v5-proto branch
Showed that these plugins and this installation detection pattern works for TIFF and OME-TIFF files (no tests just yet)

Planned Soon

Move each reader into it's own plugin repo
a. Make cookiecutter-aicsimageio-plugin
b. Not sure how to handle TiffGlobReader (maybe it's as simple as the "extensions" is still "tif" and "tiff" but the is_supported_image function checks for glob patterns.
c. Determine where ArrayLikeReader should live (likely in AICSImageIO)
d. Likely make a new organization to manage all of these repos
e. Move over all reader tests (and unique data) to the individual repos
Write a bad-image-reading-plugin for fuzz testing. It says "it supports every file extension" but when if tries to read the file, it fails every time. Useful for testing.
Release 5.0.dev0 with plugin installation

Planned Later

Settle on 5.0 API changes
Implement 5.0 API changes in base-image-reader library
Individually implement 5.0 API changes for each reader
a. We can now do this, because they all live separately 🎉

0 replies

eode · 2022-07-21T17:27:22Z

eode
Jul 21, 2022

Oh -- as far as plugins go, is there a necessity to standardize permissible error types a (functional) plugin is permitted to / expected to use?

2 replies

evamaxfield Jul 21, 2022
Maintainer Author

Good question and one we haven't thought about! I personally want to leave it up to the plugin author to manage that.

We can provide good defaults such as UnsupportedFileFormatError and such but if they want to have highly specific errors I say they should be able to.

We sorta already allow this in v4, each reader can somewhat do whatever and we just catch all errors in the AICSImage object init.

eode Jul 21, 2022

Fair enough.

SeanLeRoy · 2023-06-21T18:46:21Z

SeanLeRoy
Jun 21, 2023
Maintainer

@evamaxfield @toloudis @BrianWhitneyAI

Dan, Brian, & I briefly talked about how to determine which plug-in is best to accept when there are multiple plug-ins that support the same file format (like TIFF reader vs bioformats). This is the precedence we drafted:

If a specific reader was supplied by the user:
- Use the specified reader
Otherwise find all readers that declare that they support the given file extension:
- If exactly 1 reader is found, use it.
- If 0 readers are found and the user supplied a parameter allowing exhaustive searching (like exhaustive_reader_search=True, by default False) then search against all readers asking if they support the file. If none do or if the user didn't allow exhaustive searching then raise an error.
- If > 1 readers are found, use the most recently installed one.

See also this issue in bioio that describes how some plug-ins may be filtered out.

My initial thoughts are:

I like this design
I wonder if it be excessive/unnecessary to allow the exhaustive search option since if the plug-ins don't declare support for the file extension then what are the chances it'll actually work out. One of the reasons this was suggested (I believe) is sometimes files don't have an extension, I think in those cases the user should likely supply which reader to use though.
On a quick search I'm unsure if we can determine the date a plug-in was installed as opposed to just the last time it was modified which could have been from an upgrade. Maybe this doesn't matter though.

0 replies

toloudis · 2023-06-22T00:49:35Z

toloudis
Jun 22, 2023
Maintainer

I tend to agree that the "exhaustive search" (which lets every possible installed plugin try to open the file and read some bits to try to tell what type it is) is potentially unnecessary. I wonder what workflows truly depend on this, where they are batching over many files and they don't have file extensions and don't know what the types are.

Also be sure to note that extension ".ome.tiff" is different than ".tiff" and higher precedence because it is more specific.

0 replies

evamaxfield · 2023-06-22T17:04:38Z

evamaxfield
Jun 22, 2023
Maintainer Author

If a specific reader was supplied by the user:

Use the specified reader

Otherwise find all readers that declare that they support the given file extension:

If exactly 1 reader is found, use it.

If 0 readers are found and the user supplied a parameter allowing exhaustive searching (like exhaustive_reader_search=True, by default False) then search against all readers asking if they support the file. If none do or if the user didn't allow exhaustive searching then raise an error.

If > 1 readers are found, use the most recently installed one.

I think this is in line with our thoughts from long ago. I can think of a few potential use cases for exhaustive search (usually involving "no-file-extension blob storage"), but I think for a pre-release v5 it can be ignored and added later, in a release candidate or something.

On a quick search I'm unsure if we can determine the date a plug-in was installed as opposed to just the last time it was modified which could have been from an upgrade. Maybe this doesn't matter though.

I personally don't think the update / created datetime difference matters. Updated datetime hopefully means most recently upgraded which works for me.

Also be sure to note that extension ".ome.tiff" is different than ".tiff" and higher precedence because it is more specific.

Ahhh yes. I forget how the current implementation works but I think priority should be: "length of suffix" match first, then datetime installed / updated, etc. etc.

Last little idea which I would say to hold on until release candidate season is something like use_readers parameter that is a list of "user allowed readers to use". I.e. instead of trying all of the plugins / reader known to the library, the user can say, "I only want to use these three plugins / readers." Again, I would say this can be held until everything else is done.

1 reply

SeanLeRoy Jun 22, 2023
Maintainer

Ahhh yes. I forget how the current implementation works but I think priority should be: "length of suffix" match first, then datetime installed / updated, etc. etc.

I like the length of suffix idea 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AICSImageIO 5.0 and Roadmap #372

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AICSImageIO 5.0 and Roadmap #372

evamaxfield Jan 25, 2022 Maintainer

Reasons for a 5.0

TL;DR:

Full Details

Feedback

Replies: 10 comments · 3 replies

toloudis Jan 25, 2022 Maintainer

toloudis Mar 9, 2022 Maintainer

evamaxfield Jul 13, 2022 Maintainer Author

toloudis Jul 13, 2022 Maintainer

evamaxfield Jul 13, 2022 Maintainer Author

evamaxfield Jul 18, 2022 Maintainer Author

Updates from Hackathon

Done

Planned Soon

Planned Later

eode Jul 21, 2022

evamaxfield Jul 21, 2022 Maintainer Author

eode Jul 21, 2022

SeanLeRoy Jun 21, 2023 Maintainer

toloudis Jun 22, 2023 Maintainer

evamaxfield Jun 22, 2023 Maintainer Author

SeanLeRoy Jun 22, 2023 Maintainer

evamaxfield
Jan 25, 2022
Maintainer

Replies: 10 comments 3 replies

toloudis
Jan 25, 2022
Maintainer

toloudis
Mar 9, 2022
Maintainer

evamaxfield
Jul 13, 2022
Maintainer Author

toloudis
Jul 13, 2022
Maintainer

evamaxfield
Jul 13, 2022
Maintainer Author

evamaxfield
Jul 18, 2022
Maintainer Author

eode
Jul 21, 2022

evamaxfield Jul 21, 2022
Maintainer Author

SeanLeRoy
Jun 21, 2023
Maintainer

toloudis
Jun 22, 2023
Maintainer

evamaxfield
Jun 22, 2023
Maintainer Author

SeanLeRoy Jun 22, 2023
Maintainer