Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove deepcopies when slicing or manipulating cubes #939

Closed

Conversation

shoyer
Copy link
Contributor

@shoyer shoyer commented Jan 14, 2014

This is a first attempt at removing unnecessary (and very slow) deepcopy operations with slicing or otherwise manipulating cubes and coordinates. See #914.

Note: A few of the unit tests are failing, because they insist on checking the order (Fortran or C) of numpy arrays. I think these checks should be removed, because it is a waste of computational effort to always ensure arrays are contiguous. If some code needs to interface with external modules code that requires continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 14, 2014

OK, I think I have fixed the Travis failures, except for the data array order checks.

It looks like to remove those we would need to remove order from Cube._xml_element and also from all the cml files in tests/results? I will hold off on that for now since I'm not entirely sure if these XML representations are used for anything other than unit tests.

@esc24
Copy link
Member

esc24 commented Jan 15, 2014

@shoyer - this is clearly going to have knock on effects as we're exchanging simplicity (at least from a mental model perspective) for improved performance (as you describe in the associated issue). My biggest concern is that cubes are more likely to become coupled through shared objects leading to unexpected behaviour. I'm not saying I'm against this altogether, but we need to consider where we draw the line. Sharing data arrays seems attractive and we can draw parallels to the notion of views when documenting the behaviour. However, sharing metadata seems a little dangerous. As the code stands you do a shallow copy on the metadata. In practice, property setters and some convoluted logic around the attributes dictionary (my fault) means that the two resulting cubes will not share the majority of metadata. However, they may end up sharing some. For example, I think the cell_methods might be shared (although certain operations such as a collapse might break this link). Mutable objects in the attributes dictionary such as arrays will also be shared. I'd need convincing that this is a good idea.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 15, 2014

@esc24 Thanks for taking a look.

I agree, it's probably a good idea to deepcopy the metadata for both coordinates and cubes -- there should be little overhead in doing that, even if most (but not all) metadata is already immutable. I am inclined to draw the line at using views for cube data, coordinate points and coordinate bounds, which I think include all the data intensive cube attributes. I will add another commit with these changes.

I'm not quite sure about data managers (I don't think they are mutated and the tests passed when I removed the copy), but they are going away with biggus and anyways copying them is (probably?) not a performance limiter.

My goal with this change is to get Iris's performance for basic operations to within epsilon of numpy's for large arrays. At that point Iris becomes a viable option for high performance computing (at least in cases where numpy would suffice).

I still don't understand why Iris has gone to so much trouble (e.g., #929) to ensure that cube data arrays are always contiguously ordered :).

@esc24
Copy link
Member

esc24 commented Jan 15, 2014

Regarding the contiguous issue, I believe this is only being done in the the unit tests to avoid being sensitive to numpy version 1.6 vs 1.7 when running the test suite. I don't believe any effort is made to ensure cubes data arrays are contiguous in Iris itself. This begs the question of why we output the array ordering in the CML. Part of the answer is to help diagnose test failures, but perhaps in hindsight we should consider either removing it, making the tests insensitive to this particular attribute, or adding an order keyword to iris.tests.IrisTest.assertCML. Does that make any sense?

@shoyer
Copy link
Contributor Author

shoyer commented Jan 15, 2014

I notice now that xml is a public method exposed on the Cube and CubeList, although I doubt it is used very outside of Iris itself. Thus for backward compatibility, I am now inclined to add order=True as a keyword argument for these xml methods. For now, I'll make assertCML call xml with order=False with no option to check ordering, but that would be easy to add if it ever turns out to be useful.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 16, 2014

OK, these commits should get us closer. The main remaining issue for me is that I haven't been able to regenerate all the .cml files because I haven't gotten around to installing all the optional Iris dependencies (like gribapi). If someone else could do that for me that would be awesome... or I'll eventually figure that out if I devote some time to it.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 17, 2014

Could someone please advise me how to retrigger Travis-CI so it will check this build? I seem to have missed something because it's not starting automatically now...

@rhattersley
Copy link
Member

it's not starting automatically now...

I think it's because this PR can't be merged automatically - i.e. there are merge conflicts which need a rebase to clear up.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 17, 2014

@rhattersley Thanks! A rebase took care of it.

The tests are passing now on Travis-CI, so I would appreciate another look now.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 17, 2014

If this looks good to you guys, I'm thinking the next step is probably updating the documentation to clarify that ndarrays associated with cubes and coordinates should not be assumed to have their own data.

@esc24
Copy link
Member

esc24 commented Jan 17, 2014

Thanks @shoyer. I think this shows a lot of promise, but I need a little time to think about it. We're working hard on getting 1.6 out so I won't have time until next week.

@shoyer
Copy link
Contributor Author

shoyer commented Jan 17, 2014

Not a problem, there's no rush here.

@rhattersley
Copy link
Member

we're exchanging simplicity (at least from a mental model perspective) for improved performance

Iris has always copied data arrays - even way back in its proof-of-concept days over three years ago. I suspect that was indeed because of the "simplicity" argument. But also Iris, and its target audience, had its roots in the IDL language where I don't think the data-sharing concept exists.

Now, given the prevalance of views within numpy I think there's a good case to be made for using views for Cube.data instead. @shoyer - have you already made any measurements of the impact?

I'm less clear on the benefit of using views for coordinates though. It breaks the otherwise simple rule: metadata is not shared.

NB. We might wish to use the iris.FUTURE mechanism to allow any new behaviour to be opt-in for a while and preserve backwards compatibility.

If some code needs to interface with external modules code that requires continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.

👍 This makes a lot of sense.

I'll make assertCML call xml with order=False with no option to check ordering

If we agree that order is unimportant, then perhaps we could pull this out as a separate PR? It's relevant to the biggus PR too so it would be nice to de-couple from the potentially more contentious data-copying issue.

@rhattersley
Copy link
Member

perhaps we could pull this out as a separate PR

I've jumped the gun and done it anyway. Please feel free to complain on #967! 😉

This is a first attempt at removing unnecessary (and very slow) deepcopy operations
with slicing or otherwise manipulating cubes and coordinates. See SciTools#914.

Note: A few of the unit tests are failing, because they insist on checking the
order (Fortran or C) of numpy arrays. I think these checks should be removed,
because it is a waste of computational effort to always ensure arrays are
contiguous. If some code needs to interface with external modules code that
require continguous arrays, it should use np.ascontiguousarray or
np.asfortranarray at the immediate level of the wrapper.
@shoyer
Copy link
Contributor Author

shoyer commented Jan 21, 2014

OK, I have rebased and squashed on the PR removing order checks.

It's a little tricky to figure out the relevant benchmarks for this. Typically, copying data is not the limiting factor (although sometimes it is). Some operations (like Cube.slices) become almost infinitely faster.

Overall, the test suite runs about 2% faster with this patch on my computer:

Ran 1677 tests in 62.946s
vs
Ran 1677 tests in 61.685s

I do recall that removing the copy for coordinate point made a meaningful contribution... let me see if I can find the file where I benchmarked that.

@rhattersley rhattersley mentioned this pull request Jan 23, 2014
@pelson
Copy link
Member

pelson commented Jun 30, 2015

Question: What if we made this behaviour togglable, and the user has to opt-in to it per-session? e.g.

with iris.features.np_views:
     # c2.data is a view of cube.data
     c2 = cube[0, ::2]

# c3.data is a copy of cube.data
c3 = cube[0, ::2]

This would be a super useful addition, as there are known subsets of functionality where this can deliver performance enhancements.

@shoyer
Copy link
Contributor Author

shoyer commented Jun 30, 2015

@pelson this is a nice idea, but what if you call another function inside the context manager? Now the fact that that method uses slicing becomes part of its public API.

This is part of the more general hazard of global flags that change functionality.

@djkirkham
Copy link
Contributor

Replaced by #2549

@djkirkham djkirkham closed this Jul 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants