How widely is column-major supported? #41

tbenst · 2022-01-11T18:18:57Z

We’ve been having a discussion over at Zarr.jl as to whether the package should default to writing order=“F” aka column-major data instead of order=“C” aka row-major. We have the impression that most Zarr datasets out in the wild are saved in the row-major order, as that is the default choice of the popular Zarr-Python package. Currently, Zarr.jl writes to row-major, permuting the dimensions on read & write.

Since Julia is column-major, it would seem natural to write in this order by default, as the desired memory ordering can be maintained without permuting dimensions. However, there is some concern about how widely Zarr implementations support column-major data.

I know that zarr-python supports reading row-major or column-major. Curious if folks could advise about other major implementations? Would writing to column major by default reduce interoperability in practice?

The text was updated successfully, but these errors were encountered:

rabernat · 2022-01-11T18:34:13Z

Good question @tbenst! The Zarr.jl thread is interesting. I think it makes sense for Julia applications to write F order arrays, since this will give the best performance. However, I share your concern that such arrays may not be interoperable with other zarr implementations.

One great thing about Zarr is its native javascript support, which opens up all kinds of cool usage patterns on the web. It looks like zarr.js does not support F order

Ordering
Only C order zarr arrays (default for numpy/zarr) are supported right now. NestedArrays will be C-ordered and little-endian > (regardless of the store endianness). (contributions are welcome!)

http://guido.io/zarr.js/#/getting-started/remote-data?id=ordering

Same for zarr-js

It supports reading arrays with zlib compression or no compression and C order little endian arrays.

https://github.com/freeman-lab/zarr-js#zarr-js

😞

I wonder how much effort would be required to enable F order for these implementations.

joshmoore · 2022-01-14T11:28:59Z

@gzuidhof @manzt @freeman-lab @jhamman : any thoughts?

rabernat · 2022-01-14T12:52:11Z

Likewise with Z5:

Supports only little endianness and C-order for the zarr format.

https://github.com/constantinpape/z5#current-limitations--todos

Someone could easily go through the other implementations listed here and check: https://github.com/zarr-developers/zarr_implementations

joshmoore · 2022-01-14T16:05:15Z

and/or add a order="C" / order="F" test in that repo 😉

jakirkham · 2022-01-14T18:13:39Z

cc @gzuidhof

jhamman · 2022-01-15T05:30:44Z

Same for zarr-js

It supports reading arrays with zlib compression or no compression and C order little endian arrays.

https://github.com/freeman-lab/zarr-js#zarr-js

This is correct. I chatted briefly about this with @freeman-lab today. Our general consensus is that it should be possible to implement F-order array support in zarr-js but we would need to do some extra diligence to confirm things work with the compression libs and whatnot.

gzuidhof · 2022-01-17T15:40:16Z

I reckon that it's possible to add support for column ordered arrays in zarr.js by adding some sort of transpose (or a transposed view), a similar amount of effort as zarr-js I imagine.

A PR is welcome for it of course, but I don't think anyone has actually needed that feature so far. Usually one has at least some control over the dataset that should be served to the browser, so writing your data as a C-ordered array seems like the easier fix for most datasets.

joshmoore · 2022-01-17T16:43:56Z

Usually one has at least some control over the dataset that should be served to the browser, so writing your data as a C-ordered array seems like the easier fix for most datasets.

I'd imagine this is more a result of there being far fewer datasets currently that are F-ordered. As soon as the likelihood goes up, there will be a (stronger) driver to not require users to download & convert in order to access a dataset.

manzt · 2022-01-17T20:33:51Z

Just agreeing with what's already been stated here. Certainly possible to add to current implementations, but fewer F-order arrays "in the wild" haven't made this less of a priority to implement. In addition, the WebGL graphics libraries we've used on the web have generally expected row-major arrays.

It's worth mentioning that zarrita.js supports C/F ordered arrays today, along with some other features missing from zarr-js and zarr.js.

import { get_array } from "zarrita/v2"; // version 2 protocol
import FetchStore from "zarrita/storage/fetch";
import { get } from "zarrita/ndarray";

let store = new FetchStore("http://localhost:8080/data.F.zarr");
let f = await get_array(store).then(get); // returns F-strided array
let c = await get_array(store).then(arr => get(arr, null, { order: "C" })); // force C ordering

I'd spoken to @jhamman & @freeman-lab previously about this implementation but have been too busy with grad school things to advertise, etc. Perhaps it would be a good time soon to get together and share updates..

meggart · 2022-01-18T08:32:44Z

Thanks a lot for all the replies. In no way did we intend to push people into implementing F-ordering, we just wanted to know how reasonable it would be to make F-order the default for the Julia implementation. I think most people will use the default and it might cause some confusion if the exported arrays can not be accessed with other libraries. So I think this leaves us with the following options and it would be good to find some consensus for column-major languages (R, Matlab, Octave, Fortran people here?):

Current behavior: when parsing and exporting metadata, we revert the shape and chunks tuples, so that e.g. a matrix with size 10x20 in Julia will be saved as is, but have the size 20x10 when read with another implementation. This is equivalent to what NetCDF and HDF5 packages are doing. However, people might be confused when the dimensions are switched for the same array accessed from different programming languages. Much of this confusion could be reduced when zarr adds a notion for named dimensions, like in NetCDF of what xarray does with _ARRAY_DIMENSIONS attribute, so that it is obvious inside the application which axis of the array refers to what.
We make 'F' the default storage order creating zarr arrays in Julia and save arrays and metadata as-is. This would have the downside that arrays might not be accissible by implementations that don't support 'F' storage order. In addition I don;t know how well-optimized libraries like xarray are for processing this kind of data, especially when applying ufuncs with input_core_dimensions, I am not sure if there are implicit assumptions on the ordering of the underlying numpy array. Maybe @rabernat can comment?
Instead of reverting the dimension order on the metadata we could transpose the data after reading/before writing to the correct order, which I think currently happens in the python implementation when saving a file in 'F' order. This might be the least confusing option when switching between languages, but this would have severe performance implications, because the resulting data has to be copied and permuted. Then the following analyses might not be optimised because the record dimension is modified by this step etc...

Personally I still prefer option 1) but I am happy to be convinced by a majority saying that this behavior is not according to spec.

rabernat · 2022-01-27T13:58:43Z

Based on the discussion at yesterday's Zarr dev meeting with @meggart, I now realize that my earlier comment:

I think it makes sense for Julia applications to write F order arrays, since this will give the best performance.

is completely wrong. There is no performance benefit for column-major languages to use F-order. Fortran has been writing C-order netCDF / HDF5 arrays for decades without any performance penalty. This is purely about the language-specific conventions regarding what order dimensions should appear in. Given, that I'm inclined to support the proposal in zarr-developers/zarr-specs#126.

Alexander-Barth · 2023-12-04T10:27:43Z

I like also the current behavior of Zarr.jl (reversing the python dimension lat/lon into lon/lat, and not transposing the data), because this is also the current behavior of reading NetCDF files (and HDF5) in Julia and it makes the transition and interoperability from NetCDF to Zarr much easier. As far as I know, other column-major languages like Fortran, octave, matlab, R, ... read a 2D netCDF files as lon/lat while it would be lat/lon in a row-major language like python or C. So there is a lot of code, in these languages assuming a particular dimension order.

While numpy does support both ordering schemes, I am wondering how well are arrays with F ordering supported in the python (extension) ecosystem. Should a C extension typically provide an implementation for both cases (as the loop ordering is different)?

(Requiring the support of C and F layout pushes a lot of complexity into upstream applications. I would be fine with settling on the C layout. And I say that as somebody having used almost exclusively Fortran-layout languages, Fortran, matlab, octave, and now julia :-). But I realize that we have to deal with as it is standardized in the Zarr v2 format. )

joshmoore mentioned this issue Jan 12, 2022

Zarr.jl zarr-developers/zarr_implementations#42

Open

4 tasks

joshmoore mentioned this issue Jan 26, 2022

Community feedback process (e.g. ZEP) zarr-developers/governance#14

Closed

meggart mentioned this issue Jan 27, 2022

Remove 'order' from Specs and make 'C' default zarr-developers/zarr-specs#126

Closed

rabernat mentioned this issue Feb 9, 2022

Data array is flattened in numcodecs which reduced the compression ratio that ZFP can provide on multi-dimension arrays zarr-developers/numcodecs#303

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How widely is column-major supported? #41

How widely is column-major supported? #41

tbenst commented Jan 11, 2022 •

edited

Loading

rabernat commented Jan 11, 2022 •

edited

Loading

joshmoore commented Jan 14, 2022 •

edited

Loading

rabernat commented Jan 14, 2022

joshmoore commented Jan 14, 2022

jakirkham commented Jan 14, 2022

jhamman commented Jan 15, 2022

gzuidhof commented Jan 17, 2022

joshmoore commented Jan 17, 2022

manzt commented Jan 17, 2022 •

edited

Loading

meggart commented Jan 18, 2022

rabernat commented Jan 27, 2022 •

edited

Loading

Alexander-Barth commented Dec 4, 2023 •

edited

Loading

How widely is column-major supported? #41

How widely is column-major supported? #41

Comments

tbenst commented Jan 11, 2022 • edited Loading

rabernat commented Jan 11, 2022 • edited Loading

joshmoore commented Jan 14, 2022 • edited Loading

rabernat commented Jan 14, 2022

joshmoore commented Jan 14, 2022

jakirkham commented Jan 14, 2022

jhamman commented Jan 15, 2022

gzuidhof commented Jan 17, 2022

joshmoore commented Jan 17, 2022

manzt commented Jan 17, 2022 • edited Loading

meggart commented Jan 18, 2022

rabernat commented Jan 27, 2022 • edited Loading

Alexander-Barth commented Dec 4, 2023 • edited Loading

tbenst commented Jan 11, 2022 •

edited

Loading

rabernat commented Jan 11, 2022 •

edited

Loading

joshmoore commented Jan 14, 2022 •

edited

Loading

manzt commented Jan 17, 2022 •

edited

Loading

rabernat commented Jan 27, 2022 •

edited

Loading

Alexander-Barth commented Dec 4, 2023 •

edited

Loading