Dask, memmap, lazy #350

sem-geologist · 2024-12-19T08:33:43Z

sem-geologist
Dec 19, 2024

I see some issues #345 , #198 , #18 , #241 , #211 .... and Bruker reader also has some rudimentary lazy implementation using dask, but I never got a complete picture how these lazy memmap and dask are interconnected and how to use them correctly (i.e. dask has many options). The "lazy" flag behaviour I feel is inconsistent. I.e. with bruker "lazy" only delays loading - that would be ok for 5D hypercube (i.e. FIB sliced EDS), but still not solve the problem when single bruker file unpacks into array larger than RAM. Other formats seems use other features of dask, in some cases it looks like redundant (i.e. tiff files uses memmap, and it is encapsulated with dask?).

Do we want to encapsulate arrays with dusk so that multiple files could be multiprocessesed on distributed resources (CPU, memory, clusters)?
Or do we want to enable file loading even when RAM is smaller than unpacked content of the file?
Both achievable with dask?

This is so complicated that I even probably can't formulate the right questions. lazy, dask and memmap, it seems quite murky water for me. As for hitherto implemented lazy loading of bruker format, as we see in #241 , is useless. Even conversion to other format is doomed at the moment as to do that file needs to be whole loaded into memory at first.

So the only possibility to save some huge files (convert) would be to get chunked dask array, and writter needs support for chunked writting. (right?)

sem-geologist · 2024-12-19T08:48:04Z

sem-geologist
Dec 19, 2024
Author

I am asking as I have mere idea how to achieve chunked random reading of bruker format, but I don't see clearly "the holy grail" the thing where to aim that ability.
I have two competing (albeit muddy) ideas for bruker reader improvements:

make a custom class which would be numpy-like object. It would read (from disk and parse) on demand when using indexing and slicing.
- pros: fast access when single pixels or small ROI's are read.
use different than delayed method (distributed? from_array?, stack? concatenate? block?) from dask and make reader to accept chunks. (how chunks should behave, only x,y (which can have kind of random access with help of a pixel address table) or EDS channels too (no random access)

both could be somehow married?

0 replies

ericpre · 2024-12-19T10:49:11Z

ericpre
Dec 19, 2024
Maintainer

These are valid points/questions that touch on what we try to achieve with rosettasciio!
One thing that is important is to have an API that is as consistent as possible across formats - by the way, on that page, the description of the data is incorrect, it should be "numpy array of dask array if lazy=True"... but it is correct in the docstring of the file_reader functions! This API is for the file_reader and file_writer functions that is currently expected from all formats. Beside this API, it may be useful to have "non-standard" API accessible through the public API (the one defined explicitly in the documentation) and an example of this is the quantum detector format, which has several convenient functions, in particular, the load_mid_data returns numpy.memmap. Of course, these need to be justified by use cases, etc.

Coming back to the bruker format, it seems that there are functionalities that would be useful to others (for example as metioned in #36).

The "lazy" flag behaviour I feel is inconsistent. I.e. with bruker "lazy" only delays loading - that would be ok for 5D hypercube (i.e. FIB sliced EDS), but still not solve the problem when single bruker file unpacks into array larger than RAM.

This is a wrong/misleading implementation of lazy loading in case the bruker format. Another example of inconsistent lazy loading is Velox emd files is fully lazy but it is still useful and sometime, it it is a matter of balancing needs and efforts!

Do we want to encapsulate arrays with dusk so that multiple files could be multiprocessesed on distributed resources (CPU, memory, clusters)?

Yes, the file_reader should return dask array when loaded lazily - at least this is how it is expected to work as of today! 😅

Or do we want to enable file loading even when RAM is smaller than unpacked content of the file?

What does the "unpacked" content mean here?

So the only possibility to save some huge files (convert) would be to get chunked dask array, and writter needs support for chunked writting. (right?)

Yes, what we recommend is to convert to something like a zarr format, typically zspy.

@CSSFrancis will know most likely have a better understanding that me on this, but in the case of bruker format, I would expect that using the dask.delayed function would be the most suitable approach to make a dask array. Possibly along lower level API that doesn't use dask.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dask, memmap, lazy #350

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Dask, memmap, lazy #350

sem-geologist Dec 19, 2024

Replies: 2 comments

sem-geologist Dec 19, 2024 Author

ericpre Dec 19, 2024 Maintainer

sem-geologist
Dec 19, 2024

sem-geologist
Dec 19, 2024
Author

ericpre
Dec 19, 2024
Maintainer