-
Notifications
You must be signed in to change notification settings - Fork 283
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove deepcopies when slicing or manipulating cubes #939
Conversation
OK, I think I have fixed the Travis failures, except for the data array order checks. It looks like to remove those we would need to remove order from |
@shoyer - this is clearly going to have knock on effects as we're exchanging simplicity (at least from a mental model perspective) for improved performance (as you describe in the associated issue). My biggest concern is that cubes are more likely to become coupled through shared objects leading to unexpected behaviour. I'm not saying I'm against this altogether, but we need to consider where we draw the line. Sharing data arrays seems attractive and we can draw parallels to the notion of views when documenting the behaviour. However, sharing metadata seems a little dangerous. As the code stands you do a shallow copy on the metadata. In practice, property setters and some convoluted logic around the attributes dictionary (my fault) means that the two resulting cubes will not share the majority of metadata. However, they may end up sharing some. For example, I think the cell_methods might be shared (although certain operations such as a collapse might break this link). Mutable objects in the attributes dictionary such as arrays will also be shared. I'd need convincing that this is a good idea. |
@esc24 Thanks for taking a look. I agree, it's probably a good idea to deepcopy the metadata for both coordinates and cubes -- there should be little overhead in doing that, even if most (but not all) metadata is already immutable. I am inclined to draw the line at using views for cube data, coordinate points and coordinate bounds, which I think include all the data intensive cube attributes. I will add another commit with these changes. I'm not quite sure about data managers (I don't think they are mutated and the tests passed when I removed the copy), but they are going away with biggus and anyways copying them is (probably?) not a performance limiter. My goal with this change is to get Iris's performance for basic operations to within epsilon of numpy's for large arrays. At that point Iris becomes a viable option for high performance computing (at least in cases where numpy would suffice). I still don't understand why Iris has gone to so much trouble (e.g., #929) to ensure that cube data arrays are always contiguously ordered :). |
Regarding the contiguous issue, I believe this is only being done in the the unit tests to avoid being sensitive to numpy version 1.6 vs 1.7 when running the test suite. I don't believe any effort is made to ensure cubes data arrays are contiguous in Iris itself. This begs the question of why we output the array ordering in the CML. Part of the answer is to help diagnose test failures, but perhaps in hindsight we should consider either removing it, making the tests insensitive to this particular attribute, or adding an |
I notice now that |
OK, these commits should get us closer. The main remaining issue for me is that I haven't been able to regenerate all the .cml files because I haven't gotten around to installing all the optional Iris dependencies (like gribapi). If someone else could do that for me that would be awesome... or I'll eventually figure that out if I devote some time to it. |
Could someone please advise me how to retrigger Travis-CI so it will check this build? I seem to have missed something because it's not starting automatically now... |
I think it's because this PR can't be merged automatically - i.e. there are merge conflicts which need a rebase to clear up. |
@rhattersley Thanks! A rebase took care of it. The tests are passing now on Travis-CI, so I would appreciate another look now. |
If this looks good to you guys, I'm thinking the next step is probably updating the documentation to clarify that ndarrays associated with cubes and coordinates should not be assumed to have their own data. |
Thanks @shoyer. I think this shows a lot of promise, but I need a little time to think about it. We're working hard on getting 1.6 out so I won't have time until next week. |
Not a problem, there's no rush here. |
Iris has always copied data arrays - even way back in its proof-of-concept days over three years ago. I suspect that was indeed because of the "simplicity" argument. But also Iris, and its target audience, had its roots in the IDL language where I don't think the data-sharing concept exists. Now, given the prevalance of views within numpy I think there's a good case to be made for using views for Cube.data instead. @shoyer - have you already made any measurements of the impact? I'm less clear on the benefit of using views for coordinates though. It breaks the otherwise simple rule: metadata is not shared. NB. We might wish to use the
👍 This makes a lot of sense.
If we agree that order is unimportant, then perhaps we could pull this out as a separate PR? It's relevant to the biggus PR too so it would be nice to de-couple from the potentially more contentious data-copying issue. |
I've jumped the gun and done it anyway. Please feel free to complain on #967! 😉 |
This is a first attempt at removing unnecessary (and very slow) deepcopy operations with slicing or otherwise manipulating cubes and coordinates. See SciTools#914. Note: A few of the unit tests are failing, because they insist on checking the order (Fortran or C) of numpy arrays. I think these checks should be removed, because it is a waste of computational effort to always ensure arrays are contiguous. If some code needs to interface with external modules code that require continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.
OK, I have rebased and squashed on the PR removing order checks. It's a little tricky to figure out the relevant benchmarks for this. Typically, copying data is not the limiting factor (although sometimes it is). Some operations (like Overall, the test suite runs about 2% faster with this patch on my computer: Ran 1677 tests in 62.946s I do recall that removing the copy for coordinate point made a meaningful contribution... let me see if I can find the file where I benchmarked that. |
Question: What if we made this behaviour togglable, and the user has to opt-in to it per-session? e.g.
This would be a super useful addition, as there are known subsets of functionality where this can deliver performance enhancements. |
@pelson this is a nice idea, but what if you call another function inside the context manager? Now the fact that that method uses slicing becomes part of its public API. This is part of the more general hazard of global flags that change functionality. |
Replaced by #2549 |
This is a first attempt at removing unnecessary (and very slow) deepcopy operations with slicing or otherwise manipulating cubes and coordinates. See #914.
Note: A few of the unit tests are failing, because they insist on checking the order (Fortran or C) of numpy arrays. I think these checks should be removed, because it is a waste of computational effort to always ensure arrays are contiguous. If some code needs to interface with external modules code that requires continguous arrays, it should use np.ascontiguousarray or np.asfortranarray at the immediate level of the wrapper.