Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How widely is column-major supported? #41

Open
tbenst opened this issue Jan 11, 2022 · 12 comments
Open

How widely is column-major supported? #41

tbenst opened this issue Jan 11, 2022 · 12 comments

Comments

@tbenst
Copy link

tbenst commented Jan 11, 2022

We’ve been having a discussion over at Zarr.jl as to whether the package should default to writing order=“F” aka column-major data instead of order=“C” aka row-major. We have the impression that most Zarr datasets out in the wild are saved in the row-major order, as that is the default choice of the popular Zarr-Python package. Currently, Zarr.jl writes to row-major, permuting the dimensions on read & write.

Since Julia is column-major, it would seem natural to write in this order by default, as the desired memory ordering can be maintained without permuting dimensions. However, there is some concern about how widely Zarr implementations support column-major data.

I know that zarr-python supports reading row-major or column-major. Curious if folks could advise about other major implementations? Would writing to column major by default reduce interoperability in practice?

@rabernat
Copy link

rabernat commented Jan 11, 2022

Good question @tbenst! The Zarr.jl thread is interesting. I think it makes sense for Julia applications to write F order arrays, since this will give the best performance. However, I share your concern that such arrays may not be interoperable with other zarr implementations.

One great thing about Zarr is its native javascript support, which opens up all kinds of cool usage patterns on the web. It looks like zarr.js does not support F order

Ordering
Only C order zarr arrays (default for numpy/zarr) are supported right now. NestedArrays will be C-ordered and little-endian > (regardless of the store endianness). (contributions are welcome!)

http://guido.io/zarr.js/#/getting-started/remote-data?id=ordering

Same for zarr-js

It supports reading arrays with zlib compression or no compression and C order little endian arrays.

https://github.com/freeman-lab/zarr-js#zarr-js

😞

I wonder how much effort would be required to enable F order for these implementations.

@joshmoore
Copy link
Member

joshmoore commented Jan 14, 2022

@gzuidhof @manzt @freeman-lab @jhamman : any thoughts?

@rabernat
Copy link

Likewise with Z5:

Supports only little endianness and C-order for the zarr format.

https://github.com/constantinpape/z5#current-limitations--todos

Someone could easily go through the other implementations listed here and check: https://github.com/zarr-developers/zarr_implementations

@joshmoore
Copy link
Member

and/or add a order="C" / order="F" test in that repo 😉

@jakirkham
Copy link
Member

cc @gzuidhof

@jhamman
Copy link
Member

jhamman commented Jan 15, 2022

Same for zarr-js

It supports reading arrays with zlib compression or no compression and C order little endian arrays.

https://github.com/freeman-lab/zarr-js#zarr-js

This is correct. I chatted briefly about this with @freeman-lab today. Our general consensus is that it should be possible to implement F-order array support in zarr-js but we would need to do some extra diligence to confirm things work with the compression libs and whatnot.

@gzuidhof
Copy link

I reckon that it's possible to add support for column ordered arrays in zarr.js by adding some sort of transpose (or a transposed view), a similar amount of effort as zarr-js I imagine.

A PR is welcome for it of course, but I don't think anyone has actually needed that feature so far. Usually one has at least some control over the dataset that should be served to the browser, so writing your data as a C-ordered array seems like the easier fix for most datasets.

@joshmoore
Copy link
Member

Usually one has at least some control over the dataset that should be served to the browser, so writing your data as a C-ordered array seems like the easier fix for most datasets.

I'd imagine this is more a result of there being far fewer datasets currently that are F-ordered. As soon as the likelihood goes up, there will be a (stronger) driver to not require users to download & convert in order to access a dataset.

@manzt
Copy link
Member

manzt commented Jan 17, 2022

Just agreeing with what's already been stated here. Certainly possible to add to current implementations, but fewer F-order arrays "in the wild" haven't made this less of a priority to implement. In addition, the WebGL graphics libraries we've used on the web have generally expected row-major arrays.

It's worth mentioning that zarrita.js supports C/F ordered arrays today, along with some other features missing from zarr-js and zarr.js.

import { get_array } from "zarrita/v2"; // version 2 protocol
import FetchStore from "zarrita/storage/fetch";
import { get } from "zarrita/ndarray";

let store = new FetchStore("http://localhost:8080/data.F.zarr");
let f = await get_array(store).then(get); // returns F-strided array
let c = await get_array(store).then(arr => get(arr, null, { order: "C" })); // force C ordering

I'd spoken to @jhamman & @freeman-lab previously about this implementation but have been too busy with grad school things to advertise, etc. Perhaps it would be a good time soon to get together and share updates..

@meggart
Copy link
Member

meggart commented Jan 18, 2022

Thanks a lot for all the replies. In no way did we intend to push people into implementing F-ordering, we just wanted to know how reasonable it would be to make F-order the default for the Julia implementation. I think most people will use the default and it might cause some confusion if the exported arrays can not be accessed with other libraries. So I think this leaves us with the following options and it would be good to find some consensus for column-major languages (R, Matlab, Octave, Fortran people here?):

  1. Current behavior: when parsing and exporting metadata, we revert the shape and chunks tuples, so that e.g. a matrix with size 10x20 in Julia will be saved as is, but have the size 20x10 when read with another implementation. This is equivalent to what NetCDF and HDF5 packages are doing. However, people might be confused when the dimensions are switched for the same array accessed from different programming languages. Much of this confusion could be reduced when zarr adds a notion for named dimensions, like in NetCDF of what xarray does with _ARRAY_DIMENSIONS attribute, so that it is obvious inside the application which axis of the array refers to what.

  2. We make 'F' the default storage order creating zarr arrays in Julia and save arrays and metadata as-is. This would have the downside that arrays might not be accissible by implementations that don't support 'F' storage order. In addition I don;t know how well-optimized libraries like xarray are for processing this kind of data, especially when applying ufuncs with input_core_dimensions, I am not sure if there are implicit assumptions on the ordering of the underlying numpy array. Maybe @rabernat can comment?

  3. Instead of reverting the dimension order on the metadata we could transpose the data after reading/before writing to the correct order, which I think currently happens in the python implementation when saving a file in 'F' order. This might be the least confusing option when switching between languages, but this would have severe performance implications, because the resulting data has to be copied and permuted. Then the following analyses might not be optimised because the record dimension is modified by this step etc...

Personally I still prefer option 1) but I am happy to be convinced by a majority saying that this behavior is not according to spec.

@rabernat
Copy link

rabernat commented Jan 27, 2022

Based on the discussion at yesterday's Zarr dev meeting with @meggart, I now realize that my earlier comment:

I think it makes sense for Julia applications to write F order arrays, since this will give the best performance.

is completely wrong. There is no performance benefit for column-major languages to use F-order. Fortran has been writing C-order netCDF / HDF5 arrays for decades without any performance penalty. This is purely about the language-specific conventions regarding what order dimensions should appear in. Given, that I'm inclined to support the proposal in zarr-developers/zarr-specs#126.

@Alexander-Barth
Copy link

Alexander-Barth commented Dec 4, 2023

I like also the current behavior of Zarr.jl (reversing the python dimension lat/lon into lon/lat, and not transposing the data), because this is also the current behavior of reading NetCDF files (and HDF5) in Julia and it makes the transition and interoperability from NetCDF to Zarr much easier. As far as I know, other column-major languages like Fortran, octave, matlab, R, ... read a 2D netCDF files as lon/lat while it would be lat/lon in a row-major language like python or C. So there is a lot of code, in these languages assuming a particular dimension order.

While numpy does support both ordering schemes, I am wondering how well are arrays with F ordering supported in the python (extension) ecosystem. Should a C extension typically provide an implementation for both cases (as the loop ordering is different)?

(Requiring the support of C and F layout pushes a lot of complexity into upstream applications. I would be fine with settling on the C layout. And I say that as somebody having used almost exclusively Fortran-layout languages, Fortran, matlab, octave, and now julia :-). But I realize that we have to deal with as it is standardized in the Zarr v2 format. )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants