Remove deepcopies when slicing or manipulating cubes #939

shoyer · 2014-01-14T19:46:41Z

This is a first attempt at removing unnecessary (and very slow) deepcopy operations with slicing or otherwise manipulating cubes and coordinates. See #914.

Note: A few of the unit tests are failing, because they insist on checking the order (Fortran or C) of numpy arrays. I think these checks should be removed, because it is a waste of computational effort to always ensure arrays are contiguous. If some code needs to interface with external modules code that requires continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.

shoyer · 2014-01-14T20:48:41Z

OK, I think I have fixed the Travis failures, except for the data array order checks.

It looks like to remove those we would need to remove order from Cube._xml_element and also from all the cml files in tests/results? I will hold off on that for now since I'm not entirely sure if these XML representations are used for anything other than unit tests.

esc24 · 2014-01-15T10:46:46Z

@shoyer - this is clearly going to have knock on effects as we're exchanging simplicity (at least from a mental model perspective) for improved performance (as you describe in the associated issue). My biggest concern is that cubes are more likely to become coupled through shared objects leading to unexpected behaviour. I'm not saying I'm against this altogether, but we need to consider where we draw the line. Sharing data arrays seems attractive and we can draw parallels to the notion of views when documenting the behaviour. However, sharing metadata seems a little dangerous. As the code stands you do a shallow copy on the metadata. In practice, property setters and some convoluted logic around the attributes dictionary (my fault) means that the two resulting cubes will not share the majority of metadata. However, they may end up sharing some. For example, I think the cell_methods might be shared (although certain operations such as a collapse might break this link). Mutable objects in the attributes dictionary such as arrays will also be shared. I'd need convincing that this is a good idea.

shoyer · 2014-01-15T11:09:57Z

@esc24 Thanks for taking a look.

I agree, it's probably a good idea to deepcopy the metadata for both coordinates and cubes -- there should be little overhead in doing that, even if most (but not all) metadata is already immutable. I am inclined to draw the line at using views for cube data, coordinate points and coordinate bounds, which I think include all the data intensive cube attributes. I will add another commit with these changes.

I'm not quite sure about data managers (I don't think they are mutated and the tests passed when I removed the copy), but they are going away with biggus and anyways copying them is (probably?) not a performance limiter.

My goal with this change is to get Iris's performance for basic operations to within epsilon of numpy's for large arrays. At that point Iris becomes a viable option for high performance computing (at least in cases where numpy would suffice).

I still don't understand why Iris has gone to so much trouble (e.g., #929) to ensure that cube data arrays are always contiguously ordered :).

esc24 · 2014-01-15T11:32:16Z

Regarding the contiguous issue, I believe this is only being done in the the unit tests to avoid being sensitive to numpy version 1.6 vs 1.7 when running the test suite. I don't believe any effort is made to ensure cubes data arrays are contiguous in Iris itself. This begs the question of why we output the array ordering in the CML. Part of the answer is to help diagnose test failures, but perhaps in hindsight we should consider either removing it, making the tests insensitive to this particular attribute, or adding an order keyword to iris.tests.IrisTest.assertCML. Does that make any sense?

shoyer · 2014-01-15T18:02:08Z

I notice now that xml is a public method exposed on the Cube and CubeList, although I doubt it is used very outside of Iris itself. Thus for backward compatibility, I am now inclined to add order=True as a keyword argument for these xml methods. For now, I'll make assertCML call xml with order=False with no option to check ordering, but that would be easy to add if it ever turns out to be useful.

shoyer · 2014-01-16T01:44:29Z

OK, these commits should get us closer. The main remaining issue for me is that I haven't been able to regenerate all the .cml files because I haven't gotten around to installing all the optional Iris dependencies (like gribapi). If someone else could do that for me that would be awesome... or I'll eventually figure that out if I devote some time to it.

shoyer · 2014-01-17T08:55:30Z

Could someone please advise me how to retrigger Travis-CI so it will check this build? I seem to have missed something because it's not starting automatically now...

rhattersley · 2014-01-17T09:00:33Z

it's not starting automatically now...

I think it's because this PR can't be merged automatically - i.e. there are merge conflicts which need a rebase to clear up.

shoyer · 2014-01-17T09:35:01Z

@rhattersley Thanks! A rebase took care of it.

The tests are passing now on Travis-CI, so I would appreciate another look now.

shoyer · 2014-01-17T11:13:12Z

If this looks good to you guys, I'm thinking the next step is probably updating the documentation to clarify that ndarrays associated with cubes and coordinates should not be assumed to have their own data.

esc24 · 2014-01-17T11:39:44Z

Thanks @shoyer. I think this shows a lot of promise, but I need a little time to think about it. We're working hard on getting 1.6 out so I won't have time until next week.

shoyer · 2014-01-17T18:23:55Z

Not a problem, there's no rush here.

rhattersley · 2014-01-21T13:13:42Z

we're exchanging simplicity (at least from a mental model perspective) for improved performance

Iris has always copied data arrays - even way back in its proof-of-concept days over three years ago. I suspect that was indeed because of the "simplicity" argument. But also Iris, and its target audience, had its roots in the IDL language where I don't think the data-sharing concept exists.

Now, given the prevalance of views within numpy I think there's a good case to be made for using views for Cube.data instead. @shoyer - have you already made any measurements of the impact?

I'm less clear on the benefit of using views for coordinates though. It breaks the otherwise simple rule: metadata is not shared.

NB. We might wish to use the iris.FUTURE mechanism to allow any new behaviour to be opt-in for a while and preserve backwards compatibility.

If some code needs to interface with external modules code that requires continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.

👍 This makes a lot of sense.

I'll make assertCML call xml with order=False with no option to check ordering

If we agree that order is unimportant, then perhaps we could pull this out as a separate PR? It's relevant to the biggus PR too so it would be nice to de-couple from the potentially more contentious data-copying issue.

rhattersley · 2014-01-21T13:59:37Z

perhaps we could pull this out as a separate PR

I've jumped the gun and done it anyway. Please feel free to complain on #967! 😉

This is a first attempt at removing unnecessary (and very slow) deepcopy operations with slicing or otherwise manipulating cubes and coordinates. See SciTools#914. Note: A few of the unit tests are failing, because they insist on checking the order (Fortran or C) of numpy arrays. I think these checks should be removed, because it is a waste of computational effort to always ensure arrays are contiguous. If some code needs to interface with external modules code that require continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.

shoyer · 2014-01-21T17:58:54Z

OK, I have rebased and squashed on the PR removing order checks.

It's a little tricky to figure out the relevant benchmarks for this. Typically, copying data is not the limiting factor (although sometimes it is). Some operations (like Cube.slices) become almost infinitely faster.

Overall, the test suite runs about 2% faster with this patch on my computer:

Ran 1677 tests in 62.946s
vs
Ran 1677 tests in 61.685s

I do recall that removing the copy for coordinate point made a meaningful contribution... let me see if I can find the file where I benchmarked that.

pelson · 2015-06-30T13:40:57Z

Question: What if we made this behaviour togglable, and the user has to opt-in to it per-session? e.g.

with iris.features.np_views:
     # c2.data is a view of cube.data
     c2 = cube[0, ::2]

# c3.data is a copy of cube.data
c3 = cube[0, ::2]

This would be a super useful addition, as there are known subsets of functionality where this can deliver performance enhancements.

shoyer · 2015-06-30T17:35:25Z

@pelson this is a nice idea, but what if you call another function inside the context manager? Now the fact that that method uses slicing becomes part of its public API.

This is part of the more general hazard of global flags that change functionality.

djkirkham · 2017-07-03T08:39:25Z

Replaced by #2549

rhattersley mentioned this pull request Jan 21, 2014

Ignore data and mask order #967

Merged

rhattersley mentioned this pull request Jan 23, 2014

biggus #900

Merged

This was referenced Apr 26, 2016

ENH: Lazy in-place Cube transpose #1983

Merged

Remove deepcopies when slicing cubes and copying coords #1992

Closed

djkirkham closed this Jul 3, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove deepcopies when slicing or manipulating cubes #939

Remove deepcopies when slicing or manipulating cubes #939

shoyer commented Jan 14, 2014

shoyer commented Jan 14, 2014

esc24 commented Jan 15, 2014

shoyer commented Jan 15, 2014

esc24 commented Jan 15, 2014

shoyer commented Jan 15, 2014

shoyer commented Jan 16, 2014

shoyer commented Jan 17, 2014

rhattersley commented Jan 17, 2014

shoyer commented Jan 17, 2014

shoyer commented Jan 17, 2014

esc24 commented Jan 17, 2014

shoyer commented Jan 17, 2014

rhattersley commented Jan 21, 2014

rhattersley commented Jan 21, 2014

shoyer commented Jan 21, 2014

pelson commented Jun 30, 2015

shoyer commented Jun 30, 2015

djkirkham commented Jul 3, 2017

Remove deepcopies when slicing or manipulating cubes #939

Remove deepcopies when slicing or manipulating cubes #939

Conversation

shoyer commented Jan 14, 2014

shoyer commented Jan 14, 2014

esc24 commented Jan 15, 2014

shoyer commented Jan 15, 2014

esc24 commented Jan 15, 2014

shoyer commented Jan 15, 2014

shoyer commented Jan 16, 2014

shoyer commented Jan 17, 2014

rhattersley commented Jan 17, 2014

shoyer commented Jan 17, 2014

shoyer commented Jan 17, 2014

esc24 commented Jan 17, 2014

shoyer commented Jan 17, 2014

rhattersley commented Jan 21, 2014

rhattersley commented Jan 21, 2014

shoyer commented Jan 21, 2014

pelson commented Jun 30, 2015

shoyer commented Jun 30, 2015

djkirkham commented Jul 3, 2017