-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove 'order' from Specs and make 'C' default #126
Comments
@constantinpape, do you know how this plays on the N5 side of things? |
Good question. The n5 data format uses F order. So I assume that F order still needs to be supported by But when writing zarr data format with n5-zarr it writes in C order. (I am not sure if it supports reading F order or has optional support for writing in F order). Maybe @bogovicj would know more. But in summary I don't see any big concern in removing the 'order' from the spec from the n5 perspective. |
Correct that n5-zarr writes in C order by default. There was a time when n5-zarr "handled' F order differently, but the current release errs on the side of doing less, and letting downstream applications decide what to do. On writing, yes it write in C order by default, but it can write in F order as well. |
Note: TensorStore and Neuroglancer support both C and F order with zarr. While it is true that you can always just permute the overall dimensions of the array to achieve the desired storage layout, it seems to me that there is value in being able to control the storage layout independently from the dimension order presented to the user, e.g. in order to present dimensions in an expected canonical order. On the other hand it would be more natural if the chunk keys used the same order as used within the chunk, which is what you get if you permute the overall dimensions. Regardless of what is decided as far as allowed storage orders, I would strongly urge that zarr implementations not automatically permute the dimension order relative to what is stored in the array metadata json. I know that Zarr.jl and the zarr-python n5 module do reverse the dimension order, but in my view that is a major source of confusion and interoperability problems and I hope we can avoid that being an issue with zarr v3. As one example of a problem, if other custom attributes are used to specify per-dimension information, then even if the implementation permutes the array dimensions it will not be able to permute custom attributes. |
I see your point. However, this is contrary to what is done with NetCDF and HDF5 libraries. Those libraries only ever store C order but permute dimension order when presenting data to column-major languages such as Fortran and Julia. This exacerbates the user confusion because those users may think that the storage library natively supports F order when in fact it does not. I agree that permutation needs to be done consistently with named dimensions. |
I can understand the desire for consistency with HDF5 and Netcdf but in my view a key aim of zarr is to correct the shortcomings of HDF5 and we should therefore not allow compatibility with HDF5 to result in sub-optimal design decisions. I think zarr v3 should support user-specified storage orders, and I also think the confusion and interoperability issues that result from reversing the dimension order in some implementations strongly outweighs any compatibility benefits with existing formats like HDF5. Dimension names are not necessarily the only per-dimension attribute that users may wish to specify --- for custom attributes the implementation will naturally be unable to permute them leading to a high chance of problems. I don't know about Fortran, but Julia does support a general StridedArray. Additionally, with codecs, filters, chunking and any number of additional indexing operations applied there is anyway not necessarily much correspondence between the on-disk format and layout of the in-memory arrays that a user may pass to read and write operations. |
As I mentioned above, all HDF5 and NetCDF implementations I have come across in Matlab, R, Octave and Julia revert dimension orders they present to the user compared to how they are returned in the underlying C lib. So I would still have to be convinced that everything breaks when we decide to do this.
What would be the alternative? I think the only alternatives are to a) transpose the array or to b) present some transposed view to the user (for example a |
This is true and for me the only technical reason that speaks against reverting dimensions. Of course, downstream libraries can handle this manually, as is e.g. done here https://github.com/meggart/YAXArrayBase.jl/blob/ef90b2b07e53540d4612711b2f51eeb169dd796c/src/datasets/zarr.jl#L9 where the xarray dimension names are parsed and reverted. Maybe we can solve this through another convention/addition to the zarr specs? Can we label attributes that refer to the dimensions of an array in a special way? Then all implementations could check that number of entries in this attribute equals the number of dims of the array and for implementations that potentially revert dimension orders it is a sign to revert the order of the attribute. |
A Zarr array is already a different representation from an in-memory array --- it is already a form of view. I am simply proposing that any operations on it treat the dimensions in the same order they are specified in the array metadata file --- I wouldn't consider that a "transposed" view. When creating a new zarr array from Julia, it probably would indeed make sense to default to Fortran order. And it is true that if you attempt to read a single chunk that is stored in C order (and not encoded with some special codec that already may transpose the order) into a Julia Fortran order array, then there would indeed be a transpose happening as part of the read operation. But already you have to account for different strides when copying between the stored chunk representation and the output array. I'm not too familiar with Julia, but it seems it would be best if Zarr.jl could support reading/writing directly to/from e.g. StridedArray rather than just the basic DenseArray. Then the user has full control over the resultant data layout, but all transposing happens explicitly rather than implicitly.
Already with zarr v2 we have support for both Fortran and C order, and therefore it is already important to efficiently support multiple dimension orders. Additionally, in many cases you wouldn't want to work directly with the zarr array itself, but rather may want to work with a view that applies various indexing operations, including permuting the dimensions. So already it is necessary to deal with arbitrary dimension orders.
I think this is a great idea, because it also allows transformed views of various sorts to correctly transform the attributes. But I still think it will be problematic to cover all of the cases --- e.g. maybe you want to store a matrix where both the rows and columns are indexed by dimensions, or you are storing an array of coordinates, etc. Much better to force the user to explicitly create a transposed view if they want to, that way they are aware that the transpose is happening and will be more prepared to transpose any attributes and coordinates as well. |
I think that the operation you describe here, reading a single chunk that is stored in C order is something that happens at least as often as reading many chunks of a Zarr array into a single array. Any efficient mapreduce-like operation will read the data chunk by chunk and operate on these chunks of data, so this is not a rare case. And regarding your second point, yes there are data structures in Julia that can represent transposed views of arrays (although not StridedArrays, this is just a name for a union of a lot of array types (Views, Reshapes etc) that can be passed to blas for strided operations). However, my point was that we can not assume that the rest of the ecosystem can deal with these arrays as good as it can deal with DenseArrays. I would claim this is true for python as well, I don't know how many libraries that use numpy arrays will be written with optimizations for the case that one might have Fortran-ordered arrays. So I completely agree with you on the following point:
This is exactly what I am saying: Let us present the data to the user as-is, in the native array type of the respective programming language and if the PL supports different orderings (I doubt that JavaScript does this), then let the user decide to use them. I think we have similar design goals in mind but we just disagree on the question if the same zarr array should "look" exactly the same in every programming language. To me, as someone switching between Python and Julia, I am completely used to the convention that an x,y,z array in python will appear as z,y,x in Julia. I would be much more surprised by the opposite. However, I can understand that users who have been in only one of these worlds will be surprised when they see this the first time. |
The whole idea of zarr is to provide an abstraction --- the data is chunked and encoded, but we are viewing it as a single cohesive array. The conflict here is that the data "as-is" has a given dimension order listed in the metadata, and a given storage order (at least if the chunks are uncompressed or compressed as raw byte streams). Since Julia code is normally written assuming Fortran dimension order, there is sometimes a conflict between presenting a Fortran dimension order and presenting the data "as is". To me the basic issue is that it is very important to be able to agree on the "dimension order" of an array. It is fine to work with a transposed view, but it should be explicit that you are using a transposed view, so that there is no confusion as to what the "real" dimension order is. With some zarr implementations transposing the dimensions by default, though, the concept of a "real" dimension order becomes entirely lost, and when specifying dimension order you must qualify "dimension order as seen from Julia" vs "dimension order as seen from Python". From discussions with @axtimwalde I understand that problem already exists with the zarr n5 module, which also transposes the dimensions. Users seem to be forced to rely on named dimensions in order to unambiguous identify dimensions. I certainly appreciate the value of named dimensions, but I also think the zarr standard should allow the dimensions to be unambiguously identified without relying on named dimensions. For an array indexed by xyz perhaps there is already sufficient intuition that Julia normally uses xyz and Python normally uses zyx that this is less confusing, but that assumption may not always hold, for more general arrays there may not even be such intuition that can be relied upon. I think if the zarr julia implementation forced users to specify a As far as other languages: I believe that virtually all of the operations built into NumPy itself work equally well on Fortran and C order arrays. However, it is certainly possible that more specialized operations in scipy, scikit-image, etc. may only work efficiently with certain dimension orders. JavaScript has no built-in support for multi-dimensional arrays, and there is no non-standard library that has reached widespread usage like NumPy. Therefore it is still up to each JavaScript tool/library that support zarr to determine how to represent multi-dimensional arrays. Neuroglancer, for example, by its nature already needs to support arbitrary affine transforms of the data and therefore is agnostic to the dimension order. I think what I had in mind as far as "strided array" support in Julia is provided by this package: |
I just wanted to chime in as a user of Zarr-Python and someone who contributes to Julia packages: there is already a precedent for what @meggart describes in other areas where we interop with Python/C. ONNX and DLPack are good examples I'm familiar with, and in both we just reverse the logical order of dimensions. The problem with a package like Strided.jl is that a) everyone has to use it, and b) it doesn't work with all array types. Thus changing the default ordering of dimensions to look "correct" vis a vis other languages would be at best murderous for downstream performance. |
Not being a Julia user myself, I suppose I have no business saying whether I think Julia should use a reversed dimension order or not by default, if that is the convention that users expect. But I do think there is a real problem to be addressed as far as being able to unambiguously specify dimensions and coordinates in a concise way. I noticed that zarr.jl also uses 1-based indexing, so there is an additional issue there. We can't just say coordinates (1, 2, 3) in zarr array at path xxx, we instead must say 1-based reverse-order coordinates (1, 2, 3) in zarr array at path xxx, or 0-based zarr spec-order coordinates (1, 2, 3) in zarr array at path xxx. Do you have any suggestions on what conventions could be adopted to resolve that issue? This is an issue for both textual communication (e.g. emails, chat) as well as machine-readable formats, like attributes, coordinate arrays, etc. Additionally, I am advocating for the following two proposals:
I would say there are clear use cases for both of these features, and they are not technically difficult to support in any zarr implementation. However, there are some obvious conflicts with the existing zarr.jl API: if zarr.jl is already using an origin of 1 by default it is problematic to support a non-default origin specified in the metadata. Supporting an arbitrary dimension order causes no API problems in itself but it would mean zarr.jl can no longer assume that reversing the dimension order guarantees that Fortran order matches the chunk storage order. |
Now that #149 is out, I want to return to this discussion. I really appreciate @jbms's thoughtful and detailed comments. Jeremey raises some very good points. I disagree, but I hope I can do so with respect and care. I think my viewpoint here reflects the same fundamental difference of opinion as I expressed in #144 (comment) - I don't believe it is Zarr's job to reconcile different language conventions around how arrays are presented to users. Ultimately this is a question about the scope of Zarr. I am arguing for the core spec to have a smaller scope, limited to simply passing arrays from the storage to the language's natural, idiomatic array representation in the most straightforward possible way.
For readers of this thread, I think it is super important to note that the actual users of F-major languages (mostly Julia here, but also the Unidata Fortran group) are in support of this proposal (to remove "order" from the spec and make C the default). The main person arguing against it is self-admittedly not a user of F-major languages. Jeremy's arguments are on the grounds of seeking some type of inter-language consistency that Julia users neither want nor expect.
Yes, I agree that this issue is very related to 0 vs 1-based indexing. If you support the idea that all Zarr implementations should present the same on-disk array using the same exact shape in both row-major and column-major languages, then it is consistent to also expect Zarr to mediate indexing, such that the Zarr indexing operation However, this would also be strongly inconsistent with what actual users of those languages expect. I think it is simply not Zarr's job to resolve the fact that different programming languages use different conventions (row-major vs column-major, 0-based vs 1-based indexing) for arrays. Attempting to do so will introduce undesirable complexity and confusion into Zarr. A Minimal ExampleI think it's useful to have a very minimal example which concisely illustrates the way that Python and Julia represent the same on-disk data. I'll create the data from Python because that's the language I know best. import numpy as np
# create array using default row-major shape
# the dimension with length 3 is the most contiguous dimension
shape = (2, 3)
data = np.arange(6, dtype='i4').reshape(shape)
# -> array([[0, 1, 2],
# [3, 4, 5]], dtype=int32)
# write to disk in a very explicit way
as_bytes = data.tobytes(order='C')
assert len(as_bytes) == data.size * data.dtype.itemsize
with open('data_c_order.bin', mode='wb') as fp:
fp.write(as_bytes)
# verify we can read it back from python
with open('data_c_order.bin', mode='rb') as fp:
read_byes = fp.read()
data_read = np.frombuffer(read_byes, dtype='i4').reshape(shape)
np.testing.assert_equal(data, data_read) Now let's read that same data from Julia in the most standard way # we use the shape 3, 2 because that's what Julia expects: the most contiguous array dimension is FIRST
data = Array{Int32}(undef, 3, 2)
read!("data_c_order.bin", data)
# -> 3×2 Array{Int32,2}:
# 0 3
# 1 4
# 2 5
# to verify that the array is "right", we can iterate it in the idiomatic julia way
for i in eachindex(data)
print(i, ": ", data[i], "\n")
end
# 1: 0
# 2: 1
# 3: 2
# 4: 3
# 5: 4
# 6: 5 This example concisely illustrates the following points:
I believe that Zarr should not attempt to reconcile these differences. On disk, there is really no such thing as C order or F order. Once you pick an order, the shape of the array becomes determined. In the Zarr spec and metadata, I propose we should default to describing everything in C-order conventions. F-order implementations should handle the shape permutation, as already done by Zarr.jl. |
I see your example as more a motivation for defining array formats and standards like zarr, that provide an abstraction over the on-disk binary representation, rather than for any particular storage order within zarr. E.g. if we run your same examples but write on a little endian machine and read on a big endian machine, the numbers printed out will differ. In my view we can say that a particular array in-memory or on-disk has a storage order, but I don't agree that in general we can say that a programming language has a single array storage order, at least for many popular programming languages like Python, C++, and Java, just that perhaps an particular array storage order is more commonly used. I think it is more just a particular characteristic of Julia, Fortran, and matlab that they are oriented around column-major order. One relevant precedent to consider is image formats: image formats use various on-disk encodings. For example, the BMP format stores pixels with all color channels interleaved, approximately row-major HWC (height, width, channel) order, usually from the bottom up. JPEG uses a transformed representation, with each color channel.encoded separately, that would most closely map to CHW order. Regardless of the stored representation, though, image libraries provide a consistent view of the image. |
That's a useful point of comparison, and I think it points clearly at the crux of our disagreement. I don't think it's Zarr's job to do what image libraries do (rearrange dimension order based on a domain-specific data model). I see it as the job of tools higher up in the stack. Zarr implementations should simply expose the data in a way that makes sense for that language. I think the comparison with big vs. little endian is spurious. No one would ever want a big endian value to be interpreted as a little endian one: the result would be non-sensical. But Julia users want the most contiguous dimension to be the first one, while Python users want it to be the last one. Let's give the users what they want! (Rather than trying to define some abstract, inter-language standard for how to index arrays.) |
In my view, Zarr's role is to map the stored representation to an abstract array data model. We had some disagreement over exactly what that data model should be (e.g. whether arrays should be allowed to have a specified/non-zero origin) but I don't think any of those disagreements are relevant to this issue. There seems to be pretty broad agreement that this mapping from stored representation to the abstract array may potentially involve almost arbitrary transformations, either through codecs or through However, thinking about this more, assuming, as in #153, that we support a list of codecs/filters rather than just a single compressor, I would say that it would be logical to eliminate |
Actually, can we? Storage is ultimately one-dimensional: an array is just a sequence of bytes. In order to have a storage order, we need to introduce the concept of a multi-dimensional array. That is done at the level of the programming language or library. The programming language defines what we mean by "first" and "last" dimension. C/Python do There are only two ways have the same on-disk array have the same shape in both languages:
According to https://en.wikipedia.org/wiki/Row-_and_column-major_order
|
This is interesting. But would the transpose codec be applied for all implementations (regardless of their "native" order)? If so, how would this solve the problem? |
Yes, all implementations would apply it, just like It would be up to the implementation to decide whether it wants to do anything special with regard to this codec. Julia probably would not do anything special, and would just do a physical transpose in memory. zarr-python might have special logic like:
Julia would still presumably do its usual dimension reversal, and maintain the invariant that dimension order in Julia is reversed compared to Python and most other implementations. |
To be clear, by "array" I was referring to a multi-dimensional array.
If we have a numpy array with shape
For libraries / languages that support only a single memory layout for arrays (which may be true for Matlab and Julia, but is certainly not true of Python), then I would agree that a physical transpose would by default be required as part of the encode/decode process. However, the zarr library could provide the user with a way to create a virtual transposed view of the array. Then users of the library could explicitly request a transposed view of the data, and thereby get the data in their desired order, and if this agrees with the stored order, also avoid any physical transpose operation. At the same time, by making this transpose explicit, it avoids confusion about the dimension order.
Note that a numpy array with
If the byte strides happen to exactly match those strides that would be generated for a C order array, then we say the array is C order, and similarly for F order. But there is no preferred order in the representation, other than what is the default value for various array creation functions. A transpose operation (that converts a C order array to a corresponding F order array with reversed dimensions) simply permutes both the dimension sizes and the byte strides.
I think Julia and Matlab and Fortran definitely have a "native" order. NumPy supports both orders equally well, though it does have a default. C and C++ have built-in multi-dimensional array support, but as it is limited to arrays where the size is known at compile time, it is basically unused in practice outside of toy examples or small test cases, and therefore can be ignored. Eigen (C++ library for multi-dimensional arrays) supports both storage orders but defaults to Fortran order. Vigra (C++ library for multi-dimensional arrays) mostly just supports Fortran order, though it does also allow views with arbitrary order. xtensor (C++ library for multi-dimensional arrays) supports both storage orders but defaults to C order. opencv (C++ library) supports only C order. Tensorflow supports only C order. JAX supports C and Fortran order. TensorStore supports arbitrary storage orders like NumPy, but defaults to C order. Neuroglancer (JavaScript) supports arbitrary orders, doesn't really have a preferred order at all. N5 (Java) uses Fortran order. HDF5 (Java) uses C order (I believe). |
That makes a lot of sense Jeremy. Thanks for the review of all the different language conventions. I have come across numerical code in Python that does not work with I am 👍 on the transpose filter. I am also open to keeping the I think it would be great to get some more voices from @zarr-developers/python-core-devs on this important topic. |
I'm chime in with a few anecdotes about Julia, strides, and C-ordered arrays. I'm an active contributor to HDF.jl so some of this does sound familiar. I've also written packages such as NumPyArrays.jl and Napari.jl where these issues arise. In general, these packages offer mechanisms to access the underlying data via both Julia's conventions as well as foreign conventions. For example, consider how julia> using PythonCall
julia> np = pyimport("numpy");
julia> A = np.zeros((3,5), order = 'C'); A[2,1] = 1; A # using Python's 0-based, C-ordered indexing
Python ndarray:
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0.]])
julia> typeof(A)
Py
julia> B = PyArray(A)
3×5 PyArray{Float64, 2}:
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 0.0
julia> B[3,2] # using Julia's 1-based, F-order indexing
1.0
julia> strides(B)
(5, 1)
julia> A
Python ndarray:
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 1., 0., 0., 4.]])
julia> C = np.zeros((3,5), order = 'F'); C[2,1] = 2; C # using Python's 0-based, C-ordered indexing
Python ndarray:
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 2., 0., 0., 0.]])
julia> D = PyArray(C) # using Julia's 1-based, F-order indexing
3×5 PyArray{Float64, 2}:
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 2.0 0.0 0.0 0.0
julia> D[3,2]
2.0
julia> strides(D)
(1, 3)
julia> D[3,5] = 9 # note that D is a Julian view of C
9
julia> C
Python ndarray:
array([[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 2., 0., 0., 9.]])
julia> E = PermutedDimsArray(D, [2,1])
5×3 PermutedDimsArray(::PyArray{Float64, 2}, (2, 1)) with eltype Float64:
0.0 0.0 0.0
0.0 0.0 2.0
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 9.0
julia> strides(E)
(3, 1)
As was pointed out above Strided.jl is not necessarily needed for strided array support in Julia. The The immediate consequence of this is that julia> typeof(B)
PyArray{Float64, 2, true, false, Float64}
julia> typeof(B) <: StridedArray
false
julia> applicable(strides, B)
true
julia> f(x::AbstractArray) = f(x, Val(applicable(strides,x)))
f (generic function with 6 methods)
julia> f(x::AbstractArray, isstrided::Val{true}) = println("Using strided array optimizations...")
f (generic function with 6 methods)
julia> f(x::AbstractArray, isstrided::Val{false}) = println("The array is not strided!")
f (generic function with 6 methods)
julia> f(B)
Using strided array optimizations... Presently, the Also note that Julia dispatches indexing based on type. A common type to use in Julia is a julia> B
3×5 PyArray{Float64, 2}:
0.0 0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0 0.0
0.0 1.0 0.0 0.0 4.0
julia> CartesianIndices(B)[6]
CartesianIndex(3, 2)
julia> ci = CartesianIndex(3,2)
CartesianIndex(3, 2)
julia> B[ci]
1.0 Thus, another approach would be to define a julia> struct ZarrIndex{N}
index::Dims{N}
ZarrIndex(args...) = ZarrIndex(args)
end
julia> Base.getindex(aa::AbstractArray, zi::ZarrIndex) = getindex(aa, (reverse(zi.index) .+ 1)...)
julia> zi = ZarrIndex(1,2)
ZarrIndex{2}((1, 2))
julia> B[zi]
1.0 With regard to indexing, Julia is actually incredibly flexible. Above I have outlined two mechanisms to address foreign indexing conventions.
In general, for an I'm not sure why the focus turned to Julia here since the issue is not necessarily on Julia's limitations. There are a number of ways to address the Julia interface issues. Those probably should be discussed at https://github.com/JuliaIO/Zarr.jl/issues . Rather the issue here is whether other languages can support column-major storage order or not. The interface and storage issues are separable. An implementation could support a particular kind of interface or there may be multiple implementations for a language, each presenting a distinct interface. There are good arguments for an interface that follow's the language's default conventions as well as one that follows some canonical conventions. To me the question largely depends on the user. The question here, in this pull request, is what do compliant implementations store on disk and if other implementations will be able to read that. If |
Thanks for the clarification and added context, Mark. One point of clarification: you said that the Julia PyArray function provided by your library provides a Julia-style 1-based F order indexer view. However, it looks like it just shifts to 1-based indexing but does not change the dimension order at all (different from hdf5.jl and zarr.jl, which I believe both shift to 1-based indexing and reverse the dimension order). Is that correct? In general I think it would be very beneficial to the zarr v3 ecosystem if we can avoid any implementation reversing the dimension order by default, so that there is a simple, unambiguous way to describe the dimensions of a zarr array. Given the capabilities of Julia that you described, do you think that would be a possibility for a Julia zarr v3 implementation? The decision to reverse the dimension order in some c++ and Python n5 implementations has effectively brought major ambiguity regarding dimension order to the n5 ecosystem, and I'd very much like to avoid that in the zarr v3 ecosystem if possible. |
Thanks a lot @mkitti for this helpful perspective from Julia.
I think I agree with everything you say here. So I think the question for the Zarr spec is whether we need to permit multiple storage orders, or whether we can just get away with just saying "we always use row-major storage order". It's clear that, when reading data, Julia is flexible enough to present this data with whatever order / indexing conventions the user wants. The situation when writing is less clear to me. Given a base Julia array using the default column-major order, how should this be stored in Zarr? Are there significant performance penalties for converting to row-major order for storage? |
👍 I'll try to add another perspective: Besides the language, also the available data is important. If data is available in a specific order (before storing it as zarr), it is much more efficient to persist the zarr array with the original order (if one wants to keep the original dimension order). This argument might not hold for frameworks or languages that only support C or F order, but for many it does (e.g. Python with numpy). |
Just a brief comment to say that the original motivation for introducing the ability to change the way the data for a chunk is serialised into a contiguous sequence of bytes was because of compression. I have some arrays in my genomics work where compression ratio is much better if chunk data are serialised using F rather than C order, although these are less common. This is usually because you get much longer runs of zeros if you use F rather than C order. In principle this improvement to compression could be achieved via a "transpose" filter, and always using C order to perform the initial serialisation of chunk data when encoding. However, if you did this it might be then harder for an implementation to figure out when it can skip the transpose operation because it isn't needed, either when writing or reading data. If so, this could introduce more memory copies into the chunk encoding and/or decoding process, which might be tolerable but I generally tried to avoid memory copies where possible for performance reasons. |
The point was to show how the indexing conventions could be flexible in Julia as well as Python. I can see how the message became confused. Alternatively, I could wrap a normal Julia julia> pytype(A)
Python type: <class 'numpy.ndarray'>
julia> A.__array_interface__["data"][0]
Python int: 35843792
julia> ptr = Ptr{Float64}( pyconvert(UInt64, A.__array_interface__["data"][0]) )
Ptr{Float64} @0x000000000222eed0
julia> B_prime = unsafe_wrap(Array, ptr, (5,3))
5×3 Matrix{Float64}:
0.0 0.0 0.0
0.0 0.0 1.0
0.0 0.0 0.0
0.0 0.0 0.0
0.0 0.0 0.0
julia> B_prime[2,3]
1.0
julia> strides(B_prime)
(1, 5) Regarding HDF5, let's save an array with h5py. In [1]: import h5py, numpy as np
In [2]: E = np.zeros((3,4,5))
In [3]: E[2,3,4] = 1
In [4]: E
Out[4]:
array([[[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.]],
[[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1.]]])
In [5]: with h5py.File("test.h5", "w") as h5f:
...: h5f["data"] = E Then load it with HDF5.jl. julia> E = h5open("test.h5") do h5f
h5f["data"][]
end
5×4×3 Array{Float64, 3}:
[:, :, 1] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 2] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 3] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0
julia> E[5,4,3]
1.0
julia> h5open("test.h5") do h5f
h5f["data"][5,4,3]
end
1.0 Yes, you are correct. The convention of HDF5.jl is to "shift to 1-based indexing and reverse the dimension order". In particular, array data read into memory by HDF5.jl is returned as the normal |
If I wanted to take a julia> strides(E)
(1, 5, 20)
julia> using OffsetArrays
julia> Z = OffsetArray(PermutedDimsArray(E, [3,2,1]), (-1, -1, -1))
3×4×5 OffsetArray(PermutedDimsArray(::Array{Float64, 3}, (3, 2, 1)), 0:2, 0:3, 0:4) with eltype Float64 with indices 0:2×0:3×0:4:
[:, :, 0] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 1] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 2] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 3] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
[:, :, 4] =
0.0 0.0 0.0 0.0
0.0 0.0 0.0 0.0
0.0 0.0 0.0 1.0
julia> strides(Z)
(20, 5, 1)
julia> Z[2,3,4]
1.0
julia> typeof(Z)
OffsetArray{Float64, 3, PermutedDimsArray{Float64, 3, (3, 2, 1), (3, 2, 1), Array{Float64, 3}}}
julia> typeof(Z) <: Array
false
julia> typeof(Z) <: AbstractArray
true As long as you also change the representation, the Julia user should be fine regardless of what's in memory or in disk. The physical layout is rarely exposed directly. Also note that neither Should C-order 0-based indexing be the default? For Zarr.jl, that would really be up to @meggart because he has the most code invested based on the current convention. In my opinion, this would be quite difficult. The path to do so would probably involve deprecating most of the current API, adding an explicit indexing parameter to the type and its constructors, and then later making the indexing and representation implicit. An easier approach for a new default might be to start a Zarr3.jl package. However, it would probably be best to make the permutation of dimensions explicit in that case. A low effort approach would be to wrap zarr-python. |
@alimanfoo and I propose to proceed with having support for both This still leaves room for changes such as having a transpose codec instead of the current I'm asking everyone involved in this conversation to signal if this resolution seems acceptable to them to move forward with v3 via 👍/👎 on this comment. This discussion goes on since almost 11 months without resolution, so we hope that this might be an acceptable compromise for all participants. cc @meggart @jakirkham @constantinpape @rabernat @jbms @ToucheSir @mkitti @bogovicj |
I am okay with deferring arbitrary order support to an extension. I think that a |
Note: zarr v2 effectively supports encoding orders that are neither C nor F order, when using a structured dtype: Consider the following .zarray metadata: {"zarr_format": 2,
"shape": [100, 200],
"chunks": [10, 20],
"dtype": ["a", "<u2", [2, 3]],
"compressor": null,
"fill_value": null,
"filters": [],
"order": "F"} Here each chunk has an outer array of shape |
👍 , would you like to add a PR for this, @jbms? |
I will create a pr. |
#189 was merged and implements the compromise proposed above. Since there were no further objections I'm closing this issue. |
Following up in the discussion in yesterday's community call I want to suggest to remove or deprecate the 'order' field in the metadata and make 'C' order the default.
I think the main reason is that most if not all datasets published are using 'C' storage order, so it would be hard to argue that this is a feature that a large part of the community needs. In addition, AFAIK zarr-python is currently the only zarr implementation that supports 'F' storage, all other do not support this.
The initial reason to add fortran order as an option was that compression might be more efficient in some cases zarr-developers/zarr-python#7 , but I don't know if that has ever been tested/used systematically.
There was a lot of discussion on the question how programming languages that are column-major by default should deal with the storage order zarr-developers/community#41 and I think some of the confusion can be resolved by just not allowing 'F' storage order and always reverting the dimension order when reading/writing array metadata (just like the current behavior). Note that reverting the dimensions is also the default behavior in all HDF5 and NetCDF packages that I have worked with in column-major languages (R, Matlab and Julia).
The text was updated successfully, but these errors were encountered: