Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New alignment option: join='strict' #8698

Closed

Conversation

etienneschalk
Copy link
Contributor

@etienneschalk etienneschalk commented Feb 3, 2024

Title: New alignment option: join='strict'

Motive

This PR is motivated by solving of the following issues:

The current PR does not solve the unexpected issue described in #8231 without a change in user-code. Indeed, in the tests written, it is shown that to get the said expected behavior, the user would have to use the new join='strict' mode suggested in #6806 for the concatenation operation. Only in that case, the uniqueness of the indexed dimensions' names will be checked, re-using the same logic that was already applied for join='override' in Aligner.find_matching_indexes

This may not be enough to fix #8231. If that isn't, I can split the PR into two, first one for adding the join='strict' for #6806 and later on one for #8321.

Technical Details

I try to detail here my thought process. Please correct me if there is anything wrong. This is my first time digging into this core logic!

Here is my understanding of the terms:

  • An indexed dimension is attached to a coordinate variable
  • An unindexed dimension is not attached to a coordinate variable ("Dimensions without coordinates")

Input data for Scenario 1, tested in test_concat_join_coordinate_variables_non_asked_dims

    ds1 = Dataset(
        coords={
            "x_center": ("x_center", [1, 2, 3]),
            "x_outer": ("x_outer", [0.5, 1.5, 2.5, 3.5]),
        },
    )

    ds2 = Dataset(
        coords={
            "x_center": ("x_center", [4, 5, 6]),
            "x_outer": ("x_outer", [4.5, 5.5, 6.5]),
        },
    )

Input data for Scenario 2, tested in test_concat_join_non_coordinate_variables

    ds1 = Dataset(
        data_vars={
            "a": ("x_center", [1, 2, 3]),
            "b": ("x_outer", [0.5, 1.5, 2.5, 3.5]),
        },
    )

    ds2 = Dataset(
        data_vars={
            "a": ("x_center", [4, 5, 6]),
            "b": ("x_outer", [4.5, 5.5, 6.5]),
        },
    )

Logic for non-indexed dimensions logic was working "as expected", as it relies on Aligner.assert_unindexed_dim_sizes_equal, checking that unindexed dimension sizes are equal as its name suggests. (Scenario 1)

However, the logic for indexed dimensions was surprising as such an expected check on dimensions' sizes was not performed. A check exists in Aligner.find_matching_indexes but was only applied to join='override'. Applying it for join='strict' too is suggested in this Pull Request.

Copy link

welcome bot commented Feb 3, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

@etienneschalk
Copy link
Contributor Author

CI Failed because of multiple timeouts being > 180s:

https://github.com/pydata/xarray/actions/runs/7768517575/job/21186540956?pr=8698

Does it happen sometimes? If so, is it possible to re-trigger the failed pipeline?

Thanks!

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this didn't get an earlier review...

I left one question.

Others know more about this area so they should feel free to comment; and it's a big enough change that I'll ask others to approve before merging

xarray/tests/test_concat.py Outdated Show resolved Hide resolved
Comment on lines 3197 to 3204
def test_align_exact_vs_strict(self) -> None:
xda_1 = xr.DataArray([1], dims="x1")
xda_2 = xr.DataArray([1], dims="x2")

# join='exact' passes
aligned_1, aligned_2 = xr.align(xda_1, xda_2, join="exact")
assert aligned_1 == xda_1
assert aligned_2 == xda_2
Copy link
Collaborator

@max-sixty max-sixty Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[edited from earlier incorrect response]

The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?

Another option would be refining exact. It sounds like you tried this but many tests failed. It might be worth pushing that PR if you still have it.

I'd want to ask others why we don't enforce identical dimension names...

Copy link
Contributor Author

@etienneschalk etienneschalk Feb 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The existing behavior does seem quite surprising. Is it only an issue with 1D arrays?

I added a test with 2D arrays, they are concerned too

What I can do as a simple test is to systematically transform 'exact' to 'strict' when entering the alignment logic and pushing this logic as a separate draft PR (#8729) to see the failing tests.

Edit: Results on the "bare-minimum" CI job: 96 Failed. It seems that in many cases, the leniency is desirable (eg many tests containing the string broadcast in them).

Example: test_cat_broadcast_left.

Stacktrace
/home/me/dev/xarray/xarray/tests/test_accessor_str.py::test_cat_broadcast_left[bytes] failed: dtype = <class 'numpy.bytes_'>

    def test_cat_broadcast_left(dtype) -> None:
        values_1 = xr.DataArray(
            ["a", "bb", "cccc"],
            dims=["Y"],
        ).astype(dtype)
        values_2 = xr.DataArray(
            [["11111", "222", "33"], ["4", "5555", "66"]],
            dims=["X", "Y"],
        )
    
        targ_blank = (
            xr.DataArray(
                [["a11111", "bb222", "cccc33"], ["a4", "bb5555", "cccc66"]],
                dims=["X", "Y"],
            )
            .astype(dtype)
            .T
        )
    
        targ_space = (
            xr.DataArray(
                [["a 11111", "bb 222", "cccc 33"], ["a 4", "bb 5555", "cccc 66"]],
                dims=["X", "Y"],
            )
            .astype(dtype)
            .T
        )
    
        targ_bars = (
            xr.DataArray(
                [["a||11111", "bb||222", "cccc||33"], ["a||4", "bb||5555", "cccc||66"]],
                dims=["X", "Y"],
            )
            .astype(dtype)
            .T
        )
    
        targ_comma = (
            xr.DataArray(
                [["a, 11111", "bb, 222", "cccc, 33"], ["a, 4", "bb, 5555", "cccc, 66"]],
                dims=["X", "Y"],
            )
            .astype(dtype)
            .T
        )
    
>       res_blank = values_1.str.cat(values_2)

/home/me/dev/xarray/xarray/tests/test_accessor_str.py:3319: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
/home/me/dev/xarray/xarray/core/accessor_str.py:508: in cat
    return self._apply(
/home/me/dev/xarray/xarray/core/accessor_str.py:232: in _apply
    return _apply_str_ufunc(
/home/me/dev/xarray/xarray/core/accessor_str.py:130: in _apply_str_ufunc
    return apply_ufunc(
/home/me/dev/xarray/xarray/core/computation.py:1270: in apply_ufunc
    return apply_dataarray_vfunc(
/home/me/dev/xarray/xarray/core/computation.py:295: in apply_dataarray_vfunc
    deep_align(
/home/me/dev/xarray/xarray/core/alignment.py:977: in deep_align
    aligned = align(
/home/me/dev/xarray/xarray/core/alignment.py:913: in align
    aligner.align()
/home/me/dev/xarray/xarray/core/alignment.py:594: in align
    self.assert_equal_dimension_names()
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <xarray.core.alignment.Aligner object at 0x7f1e824e00a0>

    def assert_equal_dimension_names(self) -> None:
        # Strict mode only allows objects having the exact same dimensions' names.
        if not self.join == "strict":
            return
    
        unique_dims = set(tuple(o.sizes) for o in self.objects)
        all_objects_have_same_dims = len(unique_dims) == 1
        if not all_objects_have_same_dims:
>           raise ValueError(
                f"cannot align objects with join='strict' "
                f"because given objects do not share the same dimension names ({[tuple(o.sizes) for o in self.objects]!r}); "
                f"try using join='exact' if you only care about equal indexes"
            )
E           ValueError: cannot align objects with join='strict' because given objects do not share the same dimension names ([('Y',), ('X', 'Y')]); try using join='exact' if you only care about equal indexes

/home/me/dev/xarray/xarray/core/alignment.py:498: ValueError

@max-sixty
Copy link
Collaborator

OK, I think I see what's going on now! Sorry I was slow. I appreciate you doing so much to make it easy to engage with the issue.

So when we say "'exact' allows shape-compatible DataArrays to be aligned with differing dimension names, whereas 'strict' forbids.", the existing behavior doesn't align dimensions with different names — it just ignores them.

My mental model has that as correct behavior — I haven't observed much demand for "only align these arrays if all the dimensions match"

Is that consistent with your view?

The change I would support is #6806, which would make strict raise an error if one object has a dimension with size 1 and another has a dimension with size n. Currently exact allows that.

Does that make sense? Lmk if I'm still missing something.

@etienneschalk
Copy link
Contributor Author

Hello,

Thank your for your time giving me feedback on this PR!

the existing behavior doesn't align dimensions with different names — it just ignores them.

Yes, they are not considered. It is like manipulating raw numpy arrays: the typing is structural and not nominal, I would say. This situation reminds me of the structural and nominal terminology used to describe type systems. So even if the two DataArrays are structurally equivalent, the 'strict' mode would fail because of the difference in names. Definitely the Python community is more leaning towards structural typing: operations on numpy arrays rely on the structure of arrays, PEP 544 – Protocols: Structural subtyping (static duck typing), etc.

But sometimes, having the safety of nominal typing can be useful. In a way, this is what xarray seems to aim to address by naming dimensions. Indeed, it is common to get confused with x, y, row, col, latitude, longitude conventions when working with georeferenced rasters, for instance. Structure alone is in such cases insufficient to work confidently with data, and xarray helps with that.

I found this interesting quote on the Wikipedia page of Structural type system:

A pitfall of structural typing versus nominative typing is that two separately defined types intended for different purposes, but accidentally holding the same properties (e.g. both composed of a pair of integers), could be considered the same type by the type system, simply because they happen to have identical structure. One way this can be avoided is by creating one algebraic data type for each use.

This issue is also well-known in the TypeScript world. See How do I prevent two types from being structurally compatible?

So to summarize, my mental model is the following:

  • 'exact' is structural
  • 'strict' is nominal

The change I would support is #6806, which would make strict raise an error if one object has a dimension with size 1 and another has a dimension with size n. Currently exact allows that.

I understood that the issue described in #6806 (comment) was related to arrays having the same structure but different dimension names being successfully aligned, not related to dimension sizes.

Also, I understand that implementing an option to use 'strict' by default would solve the scenario described in #6806 (comment) (not implemented in this PR)

I added new test to experiment. Here is a test matrix result of the current implementation:

test_align_exact_vs_strict_* dim names Nominally equivalent (same dim names) dim sizes Structurally equivalent (same dim sizes) exact strict Error message
*_same_dim_same_size x, x Yes 1, 1 Yes 🆗 🆗 -
*_same_dim_differing_sizes x, x Yes 1, 2 No (E1)
*_differing_dims_same_sizes x1, x2 No 1, 1 Yes 🆗 (E2)
*_differing_dims_differing_sizes x1, x2 No 1, 2 No 🆗 (E2)

(E1): Structural error

"cannot reindex or align along dimension 'x' because of "
"conflicting dimension sizes: {1, 2}"

(E2): Nominal error (check happens before the structural error has a chance to happen)

"cannot align objects with join='strict' "
"because given objects do not share the same dimension names "
"([('x1',), ('x2',)])"

Sorry, this starts to become very long!

@max-sixty
Copy link
Collaborator

(Great! I asked a question at #6806 to clarify the request)

doc/whats-new.rst Outdated Show resolved Hide resolved
xarray/core/alignment.py Outdated Show resolved Hide resolved
xarray/core/alignment.py Outdated Show resolved Hide resolved
xarray/core/alignment.py Outdated Show resolved Hide resolved
xarray/core/alignment.py Outdated Show resolved Hide resolved
xarray/core/combine.py Outdated Show resolved Hide resolved
@@ -969,6 +974,8 @@ def apply_ufunc(
dimensions as input and vectorize it automatically with
:py:func:`numpy.vectorize`. This option exists for convenience, but is
almost always slower than supplying a pre-vectorized function.
broadcast : bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should skip apply_ufunc for now (and just ignore arithmetic_broadcast in dot)

Seems like there are some confusing implications to implementing it. See #1618 for an alternative approach.

@@ -91,26 +94,35 @@ def _positive_integer(value: int) -> bool:
return isinstance(value, int) and value > 0


def _is_boolean(value: Any) -> bool:
return isinstance(value, bool)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comment: This is a matter of taste but this feels like a fairly tiny benefit for the added indirection. The diff would be a lot smaller too, if you just followed the existing copy-paste approach.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to apply "opportunistic refactoring", since I was updating this zone of the code. Currently, there is a mixed approach: lambdas are used, but a function _positive_integer already exists. So I tried to align all the existing checks to _positive_integer. It looks at least to me a little bit cleaner to refer multiple times one function rather than define multiple duplicated lambdas.
I understand however the main downside of it adding unrelated noise to the diff, sorry for that. But now that the change is made, should I revert it or keep it?

)


@pytest.mark.parametrize("join", ("outer", "exact"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets merge this with the previous test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are not exactly the same, the previous test test_concat_join_coordinate_variables_non_asked_dims tests coordinate variables while this one test_concat_join_non_coordinate_variables tests non-coordinate variables, also the parametrization is not possible with the previous test as the behaviour differs when using join=outer vs join=exact for coordinate variables

@@ -282,6 +282,7 @@ def apply_dataarray_vfunc(
*args,
signature: _UFuncSignature,
join: JoinOptions = "inner",
broadcast: bool = True,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for being unclear (again). I don't think we need this in align at all.

We should simply be checking OPTIONS["arithmetic_broadcast"] in Variable._binary_op . The whole business with align was a misunderstanding of mine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will close this PR as it diverged too much from the original wanted behavior

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants