Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Picking up #1118: Do not convert subclasses of ndarray unless required #2956

Closed
wants to merge 42 commits into from

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented May 12, 2019

This picks up the work done in #1118, plus converting the unittest
based tests to pytest (since that seems to be used in the other
tests).

closes #1118

Some of the suggestions there are not implemented yet:

  • the disabling of bottleneck if a ndarray subclass is used
  • testing for operations like xarray.concat. This does not work with any of the unit libraries (at least, pint, astropy.units, unyt and quantities don't support it yet).
  • testing with other unit libraries that use subclasses (e.g. astropy.units or unyt)
  • renaming the test file to test_units_subclass.py
  • fix indexing (NumpyIndexingAdapter)

The requirements files for CI changed since then, so maybe more than
the two right now need the quantities entry.

@pep8speaks
Copy link

pep8speaks commented May 12, 2019

Hello @keewis! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-26 13:40:47 UTC

@max-sixty
Copy link
Collaborator

Looks great! Thanks @keewis

I don't know this area that well, so I'll defer to someone else on a proper review

@keewis
Copy link
Collaborator Author

keewis commented May 20, 2019

Most of the items above are implemented in unit-support, even support for array wrappers with units like pint (if my tests are sufficient for deciding that: pint still complains/warns about stripped units). The calls to asarray and asanyarray are skipped based on whether any of the __array_*__ methods and the attributes unit or units exist.

I did not push these commits to this PR because most of it does not have anything to do with subclasses. I did not open a new PR for the branch either because it builds on this one and because I'd like to receive feedback on any of this first.

Combining this with dask arrays should still fail, though.

@shoyer
Copy link
Member

shoyer commented May 20, 2019

This looks like a good start to me. So we can start making progress on this, I think it would make sense to hide this behavior behind an option, e.g., xarray.set_options(enable_experimental_ndarray_subclass_support=True).

I would consider writing some tests with two of the ndarray subclasses that ship with NumPy:

  • np.matrix
  • np.ma.MaskedArray

Both of these subclasses violate some important principles of subclassing, namely they aren't substitutable with objects of their parent class. Thus, I think it would be reasonable to explicitly "blacklist" them for use inside in xarray.DataArray. I think we already explicitly convert MaskedArray objects, but perhaps we should also add a check for np.matrix objects.

In terms of operations, one big class that comes to mind are aggregations, e.g., .mean(). It would be good to make sure that these work.



@pytest.fixture
def data():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a minor point, but I would recommend against using pytest fixtures for test data. (Though we do use this style currently in xarray in a few places.)

Fixtures are great for setup/teardown (e.g., cleaning up temporary files), but when you test depends on data values, it's best if the data values can be found in the test itself. Otherwise, the logic in a test is not self-contained, which means you have to understand a much bigger context of code.

All this is true for even helper functions, but pytest's fixtures are even more magical:

So in summary:

  • Prefer creating test data in a test itself, if your test relies on any aspect of the data values.
  • If you do want to re-use constructors, prefer using normal helper functions to fixtures. These are easier to understand.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, so I used helper functions instead

@keewis keewis force-pushed the member-arrays-with-units branch from 44d796a to 5aee870 Compare August 20, 2019 15:28
@keewis
Copy link
Collaborator Author

keewis commented Aug 20, 2019

I'm not too sure about putting _asarray into npcompat.py, but apart from that I would say it is ready to be merged.

However, the question of whether or not this PR is obsolete now still remains: if one were to replace quantities with pint, the test checking that this does not work without the proposed changes enabled is the only one that fails.

return x, y, xp


def create_coords(x=None, y=None, xp=None):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You write these in whichever format works best for you; don't take this as a dictum to change. But:

You might find that writing these using pytest fixtures makes the code nicer, and is easier to write

Here, you'd

  • add @pytest.fixture to this function,
  • rename it to coords,
  • and then any test function you wanted access to its result you'd add its name like def test_units_in_data_and_coords(coords):

You can have fixtures taking data from other fixtures, so you can still have the chain you're creating here, and you get parameterization for free!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, that was what I did before, but when I created this PR, I was told in #2956 (comment) that data fixtures were adding complexity to the tests so I should either create the data in the tests themselves or use helper functions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it particularly matters whether you write fixtures or helper functions. My point was more about trying to keep test-specific inputs and assertions as close to each other as possible, so you can read a test without constantly referring to a helper function that defines the inputs.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't seen that, sorry you're caught in the middle there @keewis.

For this case, you can do it how you wish; I think @shoyer 's emphasis was on the locality rather than the format - so no benefit from create_x over an x fixture

(then we can all have a separate discussion about locality vs generality without holding this work up)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should I then leave it as is or rewrite it so that the data is created in each individual test? None of these actually need all of data, dimensions and coordinates as subclass instances. Actually, they are almost the same as the ones in the original PR. I would be happy with both, though I do think rewriting may actually make the tests a little bit easier to understand.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming it's as straight swap between create_X and a fixture X (i.e. no change to the locality), I think they would be better as pytest fixtures. They're acceptable as either, though.

@shoyer what are your thoughts about the case at hand?

Copy link
Collaborator Author

@keewis keewis Aug 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was that at the moment I have a single 2d array subclass as data and both dimensions and an extra coordinate are also subclass instances. The dimensions (and the coordinate) only matter in the tests they are used in, so most of the data arrays used could be in place created 1d arrays.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified the tests to create the data in place.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK great! Cheers @keewis

@shoyer
Copy link
Member

shoyer commented Aug 20, 2019

However, the question of whether or not this PR is obsolete now still remains: if one were to replace quantities with pint, the test checking that this does not work without the proposed changes enabled is the only one that fails.

Does pint do __array_function__ yet? It looks like the answer is "not yet" but there's a branch with experimental support: hgrecco/pint#764

We could add the experimental option for allowing subclasses for testing purposes, but ultimately I think we want to insist that objects we use inside data implement __array_function__, regardless of whether they are a numpy array subclass or duck array. Otherwise there's no way complicated operations like concat are going to work properly.

I'd love to add tests that verify that xarray can properly wrap arrays with units as soon as possible. We could even merge tests that are all marked "xfail" and that currently only pass when using the experimental version of pint from hgrecco/pint#764.

As we saw with sparse array support, this will be useful for two reasons:

  1. It will help us identify remaining issues that need to be solved either in xarray or in an array-with-units library like pint.
  2. There are also likely some subtle tweaks we can do inside xarray to better support arrays with units, e.g., displaying units in the xarray.Dataset repr.

@dopplershift
Copy link
Contributor

Yeah, pint doesn't do __array_function__ yet. It'd be great to get this is, if for only that it lowers the barrier to make sure xarray works with pint in the future.

@keewis
Copy link
Collaborator Author

keewis commented Aug 20, 2019

Do you have a rough idea of which operations these tests should check? The ones currently implemented are mean() and an xfailing concat. There is also the sel(), but as it only tries to use values with the matching unit, it works even though it should probably fail.

@jthielen
Copy link
Contributor

I'd love to add tests that verify that xarray can properly wrap arrays with units as soon as possible. We could even merge tests that are all marked "xfail" and that currently only pass when using the experimental version of pint from hgrecco/pint#764.

This is actually something that I've been working towards recently, but I've ran into some delays. I'm still working on the pint side of things since there is a fair number of numpy functions not yet implemented and tested in hgrecco/pint#764.

@keewis Would you want to take the lead on these xarray + pint with __array_function__ tests, since you've been working on related tests here, or would you want me to see what I can come up with based on what I've been doing so far?

@keewis
Copy link
Collaborator Author

keewis commented Aug 20, 2019

I think I would like to tackle this, but I definitely will need some help with finding examples for some of the more advanced operations.

@shoyer should I add these in a new PR or should I repurpose this one?

@shoyer
Copy link
Member

shoyer commented Aug 20, 2019

You are welcome to continue here or in a new PR, whichever you prefer. I don't have a strong opinion, please do whichever seems easier for you.

I think the best examples of complex operations to check for compatibility can be found in xarray/tests/test_sparse.py and xarray/tests/test_dask.py

@keewis
Copy link
Collaborator Author

keewis commented Aug 21, 2019

then I would like to finish this one and modify the tests in a new one.

@keewis keewis mentioned this pull request Aug 22, 2019
16 tasks
@keewis
Copy link
Collaborator Author

keewis commented Aug 26, 2019

We could add the experimental option for allowing subclasses for testing purposes, but ultimately I think we want to insist that objects we use inside data implement __array_function__, regardless of whether they are a numpy array subclass or duck array. Otherwise there's no way complicated operations like concat are going to work properly.

@shoyer: What kind of testing would that be? If this is something that would be removed again in the future and has no actual value, feel free to close (but don't forget the original PR). Otherwise, I'd be happy to continue working on this.

@keewis
Copy link
Collaborator Author

keewis commented Dec 25, 2019

Since we have been making a lot of progress using __array_function__ (which does not exclude subclasses but is much more flexible), this approach is outdated and I'm closing this and #1118.

@keewis keewis closed this Dec 25, 2019
@keewis keewis deleted the member-arrays-with-units branch December 25, 2019 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-arrays related to flexible array support
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants