-
Notifications
You must be signed in to change notification settings - Fork 143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi-dimensional arrays have reversed order if loaded by h5py in Python3 #785
Comments
Hmm I think this is intentional since Julia uses column-major ordering vs. C/Python row-major ordering. Altough we should probably fix this so it's consistent. |
Hi @musm, since you are considering a fix, I propose that we swap the dimensions (of any array with dimension > 1) by Although C/Python are more popular these days, do you want to make this an optional argument, say May I do the fix? |
Sure, I think you would basically need @jmert what do you think about this? |
Exactly, just checked. Let Julia construct a column-major 3D-array with julia> reshape(collect(1:4*3*2), (2, 3, 4))
2×3×4 Array{Int64,3}:
[:, :, 1] =
1 3 5
2 4 6
[:, :, 2] =
7 9 11
8 10 12
[:, :, 3] =
13 15 17
14 16 18
[:, :, 4] =
19 21 23
20 22 24 Save it to a using HDF5
test3d = reshape(collect(1:4*3*2), (2, 3, 4)) # by default Julia reshape it into column-major
h5write("test3d.h5", "col", test3d)
h5write("test3d.h5", "row", permutedims(test3d, reverse(1:ndims(test3d)))) Read it in Python3: import h5py
with h5py.File("./test3d.h5", 'r') as f3d:
print(f3d['col'][:], f3d['col'][:].shape) # should be row-major now in Python
print(f3d['row'][:], f3d['row'][:].shape) # should be column-major now in Python we have the expected output: [[[ 1 2]
[ 3 4]
[ 5 6]]
[[ 7 8]
[ 9 10]
[11 12]]
[[13 14]
[15 16]
[17 18]]
[[19 20]
[21 22]
[23 24]]] (4, 3, 2)
[[[ 1 7 13 19]
[ 3 9 15 21]
[ 5 11 17 23]]
[[ 2 8 14 20]
[ 4 10 16 22]
[ 6 12 18 24]]] (2, 3, 4) |
It's something I had in the back of my mind, but I haven't yet take the time to dig through development history to see why this particular choice was made. In my own use case, there's no ambiguity (I've been most interested in reading serialized CSC sparse matrices which get stored as multiple vectors and some metadata) so I haven't worried about it too much. The immediate complication/downside I see to performing any implicit permuting of dimensions is that memory mapping an array won't equal reading the array. |
Ah, I see. For me, people in my lab work on 3D images packed into 4D arrays, so I never thought of compression. I think TensorFlow/PyTorch users (including me) start to try Flux now, so it might be good if HDF5.jl have an out-of-the-box optional arg on read/write.
Maybe we can have a slow version with |
Isn't
It is unfortunate that HDF5 does not store a flag that indicates whether an array is stored in a row-major or column-major way. Given this, what I describe above is the standard approach. |
Thank you for sharing some ideas! Indeed But changing the order is not only about writing but also about reading, and I wouldn't expect Julia users in the future seeing an HDF5 file have to know Python. Also, the (lazy and probably-better-in-someway) I guess some users have to face this problem in some ways, with or without this repo. Shall we close the issue or do you think we should get a built-in optional argument? @musm @jmert |
So if I understand correctly the issue is explicitly with hdf5 arrays created in numpy and that there's no issue with hdf5.jl reading hdf5 arrays created in say fortran w.r.t their ordering. Does numpy store any optional metadata that we could use to help make a decision on whether we need to wrap it in a tranpose op or not? |
Yes there is a problem reading HDF5 arrays created by h5py. I have not tried HDF5.jl on Fortran-written HDF5 files.
Good point! There is a |
Hi @musm, I tried to check with a file created by Python's h5py, but it seems that there is no information about the shape unless you look at the object itself: julia> h5open("data/plant1_0hrs.h5", "r") do file
println(keys(file))
println(size(file["raw"]))
println(HDF5.attributes(file["raw"]))
end
["label", "raw"]
(512, 512, 134)
Attributes of HDF5.Dataset: /raw (file: data/plant1_0hrs.h5 xfer_mode: 0) I believe the shape information is not explicitly written into the file. |
I first found this discourse comment on HDF5's intentions for storing arrays — in my reading, HDF5.jl is doing exactly what was intended and is reversing the order of dimensions to have the "fastest moving" (unit stride) dimension be the last listed dimension in HDF5's dataspace. (This is backed up by how libhdf5's own Fortran interface is documented as behaving: see section 7.3.2.5 of the HDF5 user guide.) That being said, I too find the situation frustrating — On the one hand, they're right in that the dataspace dimension reversal preserves the [relative] structuring of the data (i.e. adjacent elements are still adjacent, strides between elements are preserved, etc), but it leaves an ambiguity in the "linear algebra"/"image" ordering of the dimensions. They've gone through careful effort in things like providing separate data types for different endianesses, so it would have been nice to have some sort of "dimensions reversed" flag on the data space to indicate whether the dimensions in matrix/math notation and the dataspace agree (C ordering) or are reversed (Fortran ordering). I guess that means, I'm arguing to err on the side of leaving things as they are since it agrees with how the official Fortran interface behaves (and Julia's column major ordering matches Fortran's). I don't think I'm in support for adding flags since the official libhdf5 docs do clearly state what is intended, and that kind of knowledge probably should be required of the user [of HDF5.jl] — |
Indeed there is no elegant solution. I agree that making the users aware of this is one of the best things we can do. The comments here are perhaps very informative and helpful already! Shall we close the issue? |
Agreed. I think the best we could do is document this behavior here. @qin-yu any chance you want to take a stab at it? |
Sure, though I'm not an expert on this. Maybe we can have a short notice in README + a paragraph in documentation + a sentence in docstring for reading functions? |
The main doc file is simply a markdown file located https://github.com/JuliaIO/HDF5.jl/blob/master/docs/src/index.md |
Hi, I noticed that if I save a
3×473×600 Array{UInt8,3}
array usingh5write()
in Julia and then load this dataset into Python3 withh5py.File()
, the loaded numpy array has shape(600, 473, 3)
. Is it a known issue? Also, is it a design choice?The text was updated successfully, but these errors were encountered: