Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy Data Indexing Convention #75

Closed
addisonElliott opened this issue Dec 10, 2018 · 13 comments
Closed

Numpy Data Indexing Convention #75

addisonElliott opened this issue Dec 10, 2018 · 13 comments

Comments

@addisonElliott
Copy link
Collaborator

Please provide your feedback for this idea. I'm interested in starting a discussion and getting your input on the issue

Background

A vast majority of Python numerical libraries use what is called C-ordered indexing rather than Fortran-ordered indexing. In simple terms, C-ordered indexing is where the dimensions go from slowest varying to fastest varying. Fortran-ordered indexing is exactly the opposite. I've cited some sources that talk about this and discuss it in much greater detail than I am going to.

Long story short, let's say you have a 4D image with x, y, z, and time.

Here's how you would index it with C-order: data[t, z, y, x]

Here's how you would index it with Fortran-order: data[x, y, z, t]

MATLAB & Fortran use Fortran-order, obviously and the benefit is that it is intuitive to have the dimensions in that order. C-order is used in C/C++ because you typically declare multi-dimensional arrays as such:

int data[10][5][4];
// Look, the fastest-varying data is the last dimension, so you would index typically 
// z, y, x in that case

Issue

NRRD specifications use Fortran-order (see here). I've tested with 3D Slicer and it adheres to the standard where it stores the data shape as X, Y, Z.

However, most of the Python community seems to be using C-order and I don't think pynrrd appropriately handles it.

List of Python libraries that use C-order:

  • Numpy uses C ordering by default. (Has order option but that is more for the memory layout than indexing)
  • OpenCV (images are returned in (row, col, channel) or (y, x, d))
  • Scikit-image
  • Scipy
  • PIL, imageio, imread all return in (rows, cols, depth)
  • pydicom

The only exception I found was Nibabel which uses Fortran ordering.

Sources & Additional Information:

If we read in a NRRD file from 3D Slicer that has the shape being (x, y, z). The shape will be the same when read into Numpy. If we were using C-ordering, the shape should be (z, y, x).

Solution

It's an easy fix but I'm more interested in what would be the best approach for users of pynrrd. I'm thinking of adding some sort of parameter/default option that can be set to transpose the data array.

The downside is that all of the dimension related arrays will need to be flipped to appropriately correspond to the right axes.

@ihnorton
@tashrifbillah
@jcnils

@ihnorton
Copy link
Contributor

My use-cases for pynrrd are pretty much identical to nibabel; so, the existing situation which matches both data-on-disk and the nibabel choice, works very well for me.

The downside is that all of the dimension related arrays will need to be flipped to appropriately correspond to the right axes.

It's not just the arrays. Images can have associated frame-of-reference characteristics which also must be accounted for in any reordering. For example, diffusion images may have associated acquisition gradients in a specific frame of reference. In FSL-style Nifti DWI, gradients are always reoriented so that the reference is the image plane, but even that can get dicey very quickly!. NRRD DWI is more flexible but also more complicated, because the gradients may be stored relative to the measurement frame field.

@ihnorton
Copy link
Contributor

Thanks for starting the discussion by the way. This is valuable to have discussed and documented in any case.

@addisonElliott
Copy link
Collaborator Author

addisonElliott commented Dec 10, 2018

Yeah, the fix is not as 'easy' as I thought it would be. My current thoughts are that we keep pynrrd the same but include documentation similar to what other Python libraries have.

It'll be up to the user to transpose the data if they desire. This will probably be what I do because there are a lot of libraries where you end up transposing anyway to get it to work correctly.

Edit: Upon further thought, most of the fields would be easy to flip. I'm envisioning sometime in the future having an index_order or order argument that can be Fortran or C (similar to Numpy). If the order is set to C-order, then we transpose the data and invert all relevant dimensions, i.e. header['space origin'] =header['space origin'][:, :, ::-1] sort of thing should do it.

You bring up a good point in terms of NRRD DWIs. If anyone has any ideas, I'm open to hearing them for support of DWIs and other 'frame-of-reference characteristics'.

@jcnils
Copy link
Contributor

jcnils commented Dec 17, 2018

Thanks for inviting me to the conversation, sadly my knowledge of medical imaging is limited.

But I think I can help sharing how I used it so far.

When I was using only python, I used the this code, after reading the nrrd and before applying any calculations or method to the matrix:

array = np.flip(array , 0)
array = np.flip(array , 1) 

But then they asked me to port everything to Slicer 3D, I basically removed those lines to get the same result (I was still loading some files using pynrrd).

Most libraries I use are a wrap around C/C++ so it makes sense on using its orientation, but when working with nrrd I was just following the info on the header and acting accordingly. Something like:

header = OrderedDict([
        ('type', 'uint8'),
        ('dimension', 4),
        ('space', 'right-anterior-superior'),
        ('sizes' , [24, 200, 220, 180]),
        ('kinds' , ['list', 'domain', 'domain', 'domain']),
        ('encoding' , 'gzip'),
        ('space origin', [16.,16.,0.])
    ])

I would break it in frames(24), and change the volume with the first code with pure python, but on Slicer those steps are not needed.

In any case I like the idea of having the possibility to use both, because it could be more friendly for people coming from other libraries or mathlab, as long as it is easy to identify on code what is happening.

@addisonElliott
Copy link
Collaborator Author

Thanks for your input! It's helpful to see how pynrrd is being used in the "field".

My current thoughts are to keep pynrrd the same but to include some documentation about C-order vs Fortran order because it can be very confusing to think about. Plus, a lot of the information out there can be sort of misleading in terms of how to deal with it.

As a summary, pynrrd currently uses Fortran-order indexing because that is what the NRRD specifications say to have it as. Thus, a 3D volume would be organized such that you would index it as (x, y, z).

Most Python libraries out there, including Numpy being the most widely used, use C-order indexing where a 3D volume would be indexed as (z, y, x). The one exception is Nibabel (used for loading Nifti files) which uses Fortran-ordering.

Since the other libraries use C-order, I want to show how simple it is to "convert" between the two. In this case, everything is the same except for I had to flip the spacing around.

        nrrdHeaderDict = {
            'space': volume.space,
            # 'space directions': volume.orientation * volume.spacing[::-1, None],
            'space directions': np.identity(3) * volume.spacing[::-1, None],
            'space origin': volume.origin
        }
        nrrd.write(self.destEdit.text(), volume.data.T, nrrdHeaderDict)
        logger.info('Saved NRRD file!')

@AAAArcus
Copy link

AAAArcus commented Mar 4, 2019

My problem is that I have a 3D array that I want to keep the shape of, and that I need to be C-contiguous, but I want to use pynrrd for saving and loading. As far as I know, it is impossible to get something C-contiguous directly from nrrd.read() (does not matter how I save the data).
The best solution seems to be saving an F-contiguous array with nrrd.write() (since that is MUCH faster than trying to save a C-contiguous array). then loading and using numpy.ascontiguousarray() to get something C-contiguous, but without messing with the shape (and consequently with spacings and orientations as shown above) .

Would be great if this could be handled within pynrrd instead.

Note that converting a C-contiguous array with numpy.asfortranarray() before saving with nrrd.write() seems significantly faster than writing the C-contiguous array directly.

@addisonElliott
Copy link
Collaborator Author

Thanks for the input @AAAArcus. It's always helpful to get second opinions...

My inclination at this point is to not support adding any new functionality, but documenting the current functionality and mentioning ways to convert between the two. I'm happy to hear arguments/comments/suggestions for otherwise though!

As I mentioned, the NRRD specification itself specifies that Fortran order should be used (see here). Thus, I think it makes sense to read the metadata information in Fortran order as well. For example, it makes much more sense that the space origin field is in (x, y, z) order rather than (z, y, x). I think that should hold true even if the array is loaded in C-order. Plus, I think it is rather trivial to invert the point in the off chance that you need it that way (spaceOrigin[::-1]). In addition, the data itself can be converted to the so-called C-order by taking the transpose of the data, which is what I'm doing in my code.

However, from your comment, it seems that you want the data to be C-order in memory whilst keeping the shape the same, which is different from what I am talking about. So, if I'm understanding correctly, you want your data to have a shape of (x, y, z) but actually be accessing the data in the reverse order (z, y, x) under the hood. In your case, the reason why it doesn't matter how you save the data, is that the Numpy data is converted to Fortran order before being saved (see here). Then, when the data is loaded with nrrd.read, I assume it is read in as Fortran order. My justification for not having functionality to convert the data back to C-contiguous is because it can be done with np.ascontiguousarray as you mentioned.

So, let me make a few points...

Personally, I'm surprised that converting to a Fortran array using np.asfortranarray and then saving is faster than just saving itself. According to spec, I think we should convert to a Fortran array, but if it is much slower, I would be willing to accept a PR to fix this if possible. The current implementation calls numpy.tostring(order='F'), this must be where the additional slowdown is occurring.

In terms of reading a C-contiguous array, what do you propose for adding this feature to the library? Is it worth adding the feature when it can be done with one additional function call in the Numpy library? Also, is your use case for requiring C-contiguous data common in your opinion?

We're discussing two different problems, yours is related to how the data is saved in memory and mine is how the data is handled in terms of the axes and such.

@simeks
Copy link
Contributor

simeks commented Mar 8, 2019

Hi,

Interesting discussion (and I like the project as well) and maybe I'm a bit late but I had a brief discussion with @AAAArcus about using pynrrd in another project. I was a bit surprised about the F-order indexing at first. My thought is that you should move to C-order indexing since this is what numpy uses (https://docs.scipy.org/doc/numpy-1.13.0/user/basics.indexing.html). What you are doing now is that you read a C-contiguous array (from an NRRD-file) and flag it as a Fortran-array. Basically tricking numpy into using Fortran-order.

As for the discussion about NRRD using Fortran-order, I think you got it wrong. NRRD uses Fortran-order only in the data fields used for meta-data, the same would go for OpenCV (width, height, depth) and the likes. The image data in itself is specified in C-contiguous order which makes sense since the reference implementation of NRRD is originally written in C.

Also, if you stick with the Fortran-ordering I think it's important to be very clear about this. For instance, right now the nrrd.write is broken if I pass it a C-contiguous array. No implicit conversion or even a warning.

@addisonElliott
Copy link
Collaborator Author

addisonElliott commented Mar 8, 2019

Hello @simeks

You're not late to the conversation since nothing has been decided on what to do yet. I'm always open to comments and suggestions.

I want to say that your comment was very enlightening to me. I've spent a lot of time reading about Fortran and C-ordering and it can easily get confusing.

Yes, after further thought you are correct that the data is stored as C-contiguous. The current approach for reading the array is to reshape it into Fortran order like you mentioned. Saving the data is to convert the data to bytes in Fortran order once again.

The confusion for me lies mainly in how Numpy handles ordering. In certain cases Numpy uses the order parameter to indicate the memory ordering, or how the data is laid out in memory. However, there are a few functions where the order parameter indicates the index ordering.

My thoughts are now that maybe using the order parameter is the wrong approach altogether since the purpose is to change the index ordering only from C-order to Fortran order. Rather, maybe a better approach would be to have an index_order parameter in the read and write functions that, if set to Fortran ordering, will simply transpose the data before reading/writing. The default value for this parameter would need to be Fortran ordering in order to keep backwards compatibility but at a later date, the default value would be set to C-ordering. Then, the reshape & tostring functions don't specify an order parameter using the default based on the data.

What are your thoughts?

Here's the current lines of code that are causing problems and that we need to rethink.

reader.py

data = np.reshape(data, tuple(header['sizes']), order='F')

writer.py

raw_data = data.tostring(order='F')

Edit:

To be clear, this is what I'm kinda envisioning. Curious to see if there are any problems with this method or maybe a more optimal approach.

reader.py

data = np.reshape(data, tuple(header['sizes']))

if index_order == 'F':
    data = data.T

writer.py

if index_order == 'F':
    data = data.T

raw_data = data.tostring()

@simeks
Copy link
Contributor

simeks commented Mar 9, 2019

I want to say that your comment was very enlightening to me. I've spent a lot of time reading about Fortran and C-ordering and it can easily get confusing.

Glad to help. This ordering business has always been confusing for me as well. I think one source of confusion is the fact that index ordering and memory ordering are inherently the same. You flip one and the other follows.

What are your thoughts?

The idea with index_order makes sense. This respects the underlying memory order while keeping backwards compatibility. I think this could make things a bit faster as well, at least within pynrrd, since the transpose won't actually change the underlying data as compared to the previous case of doing nrrd.write on a C-continuous array.

@addisonElliott
Copy link
Collaborator Author

Sounds good.

You're welcome to make a PR if you'd like. Otherwise, it'll be awhile before I can get some time. We should also document the new argument to prevent confusion in the future.

@simeks
Copy link
Contributor

simeks commented Mar 10, 2019

Sure, I should be able to take a look at it tomorrow.

@AAAArcus
Copy link

Thanks for the feedback @addisonElliott. Sorry for being inactive, but it seems we have some progress! It's easy to get confused with contiguous here and order there (I'm still a bit confused...), so it would be great if we could make it less confusing for the user of pynrrd. Thanks for getting involved @simeks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants