-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New alignment option: join='strict'
#8698
New alignment option: join='strict'
#8698
Conversation
Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient. |
CI Failed because of multiple timeouts being > 180s: https://github.com/pydata/xarray/actions/runs/7768517575/job/21186540956?pr=8698 Does it happen sometimes? If so, is it possible to re-trigger the failed pipeline? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this didn't get an earlier review...
I left one question.
Others know more about this area so they should feel free to comment; and it's a big enough change that I'll ask others to approve before merging
xarray/tests/test_dataarray.py
Outdated
def test_align_exact_vs_strict(self) -> None: | ||
xda_1 = xr.DataArray([1], dims="x1") | ||
xda_2 = xr.DataArray([1], dims="x2") | ||
|
||
# join='exact' passes | ||
aligned_1, aligned_2 = xr.align(xda_1, xda_2, join="exact") | ||
assert aligned_1 == xda_1 | ||
assert aligned_2 == xda_2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[edited from earlier incorrect response]
The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?
Another option would be refining exact
. It sounds like you tried this but many tests failed. It might be worth pushing that PR if you still have it.
I'd want to ask others why we don't enforce identical dimension names...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?
I added a test with 2D arrays, they are concerned too
What I can do as a simple test is to systematically transform 'exact' to 'strict' when entering the alignment logic and pushing this logic as a separate draft PR (#8729) to see the failing tests.
Edit: Results on the "bare-minimum" CI job: 96 Failed. It seems that in many cases, the leniency is desirable (eg many tests containing the string broadcast
in them).
Example: test_cat_broadcast_left
.
Stacktrace
/home/me/dev/xarray/xarray/tests/test_accessor_str.py::test_cat_broadcast_left[bytes] failed: dtype = <class 'numpy.bytes_'>
def test_cat_broadcast_left(dtype) -> None:
values_1 = xr.DataArray(
["a", "bb", "cccc"],
dims=["Y"],
).astype(dtype)
values_2 = xr.DataArray(
[["11111", "222", "33"], ["4", "5555", "66"]],
dims=["X", "Y"],
)
targ_blank = (
xr.DataArray(
[["a11111", "bb222", "cccc33"], ["a4", "bb5555", "cccc66"]],
dims=["X", "Y"],
)
.astype(dtype)
.T
)
targ_space = (
xr.DataArray(
[["a 11111", "bb 222", "cccc 33"], ["a 4", "bb 5555", "cccc 66"]],
dims=["X", "Y"],
)
.astype(dtype)
.T
)
targ_bars = (
xr.DataArray(
[["a||11111", "bb||222", "cccc||33"], ["a||4", "bb||5555", "cccc||66"]],
dims=["X", "Y"],
)
.astype(dtype)
.T
)
targ_comma = (
xr.DataArray(
[["a, 11111", "bb, 222", "cccc, 33"], ["a, 4", "bb, 5555", "cccc, 66"]],
dims=["X", "Y"],
)
.astype(dtype)
.T
)
> res_blank = values_1.str.cat(values_2)
/home/me/dev/xarray/xarray/tests/test_accessor_str.py:3319:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/home/me/dev/xarray/xarray/core/accessor_str.py:508: in cat
return self._apply(
/home/me/dev/xarray/xarray/core/accessor_str.py:232: in _apply
return _apply_str_ufunc(
/home/me/dev/xarray/xarray/core/accessor_str.py:130: in _apply_str_ufunc
return apply_ufunc(
/home/me/dev/xarray/xarray/core/computation.py:1270: in apply_ufunc
return apply_dataarray_vfunc(
/home/me/dev/xarray/xarray/core/computation.py:295: in apply_dataarray_vfunc
deep_align(
/home/me/dev/xarray/xarray/core/alignment.py:977: in deep_align
aligned = align(
/home/me/dev/xarray/xarray/core/alignment.py:913: in align
aligner.align()
/home/me/dev/xarray/xarray/core/alignment.py:594: in align
self.assert_equal_dimension_names()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <xarray.core.alignment.Aligner object at 0x7f1e824e00a0>
def assert_equal_dimension_names(self) -> None:
# Strict mode only allows objects having the exact same dimensions' names.
if not self.join == "strict":
return
unique_dims = set(tuple(o.sizes) for o in self.objects)
all_objects_have_same_dims = len(unique_dims) == 1
if not all_objects_have_same_dims:
> raise ValueError(
f"cannot align objects with join='strict' "
f"because given objects do not share the same dimension names ({[tuple(o.sizes) for o in self.objects]!r}); "
f"try using join='exact' if you only care about equal indexes"
)
E ValueError: cannot align objects with join='strict' because given objects do not share the same dimension names ([('Y',), ('X', 'Y')]); try using join='exact' if you only care about equal indexes
/home/me/dev/xarray/xarray/core/alignment.py:498: ValueError
OK, I think I see what's going on now! Sorry I was slow. I appreciate you doing so much to make it easy to engage with the issue. So when we say "'exact' allows shape-compatible DataArrays to be aligned with differing dimension names, whereas 'strict' forbids.", the existing behavior doesn't align dimensions with different names — it just ignores them. My mental model has that as correct behavior — I haven't observed much demand for "only align these arrays if all the dimensions match" Is that consistent with your view? The change I would support is #6806, which would make Does that make sense? Lmk if I'm still missing something. |
Hello, Thank your for your time giving me feedback on this PR! ⁂
Yes, they are not considered. It is like manipulating raw numpy arrays: the typing is structural and not nominal, I would say. This situation reminds me of the structural and nominal terminology used to describe type systems. So even if the two DataArrays are structurally equivalent, the 'strict' mode would fail because of the difference in names. Definitely the Python community is more leaning towards structural typing: operations on numpy arrays rely on the structure of arrays, PEP 544 – Protocols: Structural subtyping (static duck typing), etc. But sometimes, having the safety of nominal typing can be useful. In a way, this is what xarray seems to aim to address by naming dimensions. Indeed, it is common to get confused with I found this interesting quote on the Wikipedia page of Structural type system:
This issue is also well-known in the TypeScript world. See How do I prevent two types from being structurally compatible? So to summarize, my mental model is the following:
⁂
I understood that the issue described in #6806 (comment) was related to arrays having the same structure but different dimension names being successfully aligned, not related to dimension sizes. Also, I understand that implementing an option to use 'strict' by default would solve the scenario described in #6806 (comment) (not implemented in this PR) I added new test to experiment. Here is a test matrix result of the current implementation:
(E1): Structural error
(E2): Nominal error (check happens before the structural error has a chance to happen)
Sorry, this starts to become very long! |
(Great! I asked a question at #6806 to clarify the request) |
…array into eschalk/issue-8231-align
xarray/core/computation.py
Outdated
@@ -969,6 +974,8 @@ def apply_ufunc( | |||
dimensions as input and vectorize it automatically with | |||
:py:func:`numpy.vectorize`. This option exists for convenience, but is | |||
almost always slower than supplying a pre-vectorized function. | |||
broadcast : bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should skip apply_ufunc
for now (and just ignore arithmetic_broadcast
in dot
)
Seems like there are some confusing implications to implementing it. See #1618 for an alternative approach.
@@ -91,26 +94,35 @@ def _positive_integer(value: int) -> bool: | |||
return isinstance(value, int) and value > 0 | |||
|
|||
|
|||
def _is_boolean(value: Any) -> bool: | |||
return isinstance(value, bool) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor comment: This is a matter of taste but this feels like a fairly tiny benefit for the added indirection. The diff would be a lot smaller too, if you just followed the existing copy-paste approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to apply "opportunistic refactoring", since I was updating this zone of the code. Currently, there is a mixed approach: lambdas are used, but a function _positive_integer
already exists. So I tried to align all the existing checks to _positive_integer
. It looks at least to me a little bit cleaner to refer multiple times one function rather than define multiple duplicated lambdas.
I understand however the main downside of it adding unrelated noise to the diff, sorry for that. But now that the change is made, should I revert it or keep it?
) | ||
|
||
|
||
@pytest.mark.parametrize("join", ("outer", "exact")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets merge this with the previous test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are not exactly the same, the previous test test_concat_join_coordinate_variables_non_asked_dims
tests coordinate variables while this one test_concat_join_non_coordinate_variables
tests non-coordinate variables, also the parametrization is not possible with the previous test as the behaviour differs when using join=outer vs join=exact for coordinate variables
Co-authored-by: Deepak Cherian <[email protected]>
…array into eschalk/issue-8231-align
@@ -282,6 +282,7 @@ def apply_dataarray_vfunc( | |||
*args, | |||
signature: _UFuncSignature, | |||
join: JoinOptions = "inner", | |||
broadcast: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being unclear (again). I don't think we need this in align
at all.
We should simply be checking OPTIONS["arithmetic_broadcast"]
in Variable._binary_op
. The whole business with align
was a misunderstanding of mine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will close this PR as it diverged too much from the original wanted behavior
Title: New alignment option:
join='strict'
whats-new.rst
api.rst
Motive
This PR is motivated by solving of the following issues:
The current PR does not solve the unexpected issue described in #8231 without a change in user-code. Indeed, in the tests written, it is shown that to get the said expected behavior, the user would have to use the new
join='strict'
mode suggested in #6806 for the concatenation operation. Only in that case, the uniqueness of the indexed dimensions' names will be checked, re-using the same logic that was already applied forjoin='override'
inAligner.find_matching_indexes
This may not be enough to fix #8231. If that isn't, I can split the PR into two, first one for adding the
join='strict'
for #6806 and later on one for #8321.Technical Details
I try to detail here my thought process. Please correct me if there is anything wrong. This is my first time digging into this core logic!
Here is my understanding of the terms:
Input data for Scenario 1, tested in
test_concat_join_coordinate_variables_non_asked_dims
Input data for Scenario 2, tested in
test_concat_join_non_coordinate_variables
Logic for non-indexed dimensions logic was working "as expected", as it relies on
Aligner.assert_unindexed_dim_sizes_equal
, checking that unindexed dimension sizes are equal as its name suggests. (Scenario 1)However, the logic for indexed dimensions was surprising as such an expected check on dimensions' sizes was not performed. A check exists in
Aligner.find_matching_indexes
but was only applied tojoin='override'
. Applying it forjoin='strict'
too is suggested in this Pull Request.