Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add weight threshold option for spatial averaging #672

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

pochedls
Copy link
Collaborator

@pochedls pochedls commented Jun 28, 2024

Description

This PR adds an optional argument that requires a minimum fraction of data be available to perform a spatial average. The initial PR is for spatial averaging only (it would need to be expanded to handle temporal averaging).

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

@github-actions github-actions bot added the type: enhancement New enhancement request label Jun 28, 2024
Copy link

codecov bot commented Jun 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 100.00%. Comparing base (27396e5) to head (9331a02).

Additional details and impacted files
@@            Coverage Diff            @@
##              main      #672   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           15        15           
  Lines         1609      1634   +25     
=========================================
+ Hits          1609      1634   +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

xcdat/spatial.py Outdated
# zero out cells with missing values in data_var
weights = xr.where(~np.isnan(data_var), weights, 0)
# sum all weights (including zero for missing values)
weight_sum_masked = weights.sum(dim=dim) # type: ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed to add type: ignore for unclear reasons...

Copy link
Collaborator

@tomvothecoder tomvothecoder Jul 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think dim is expected to be a string but we pass a Hashable (dict key).

# type: ignore is fine here or set dim=str(dim) to remove that comment

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]), which is supposedly okay according to xarray docs.

From mypy:

error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]

I tried tuple(dims), but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).

Copy link
Collaborator

@tomvothecoder tomvothecoder Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just keep # type: ignore. The docstring for dim says:

Name of dimension[s] along which to apply sum. For e.g. dim="x" or dim=["x", "y"]. If “…” or None, will reduce over all dimensions.

Although the type annotation for dim is Dims = Union[str, Collection[Hashable], "ellipsis", None], the example includes a list (dim=["x", "y"]).

xcdat/spatial.py Outdated
@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]):
Data variable inside a Dataset.
axis : List[SpatialAxis]
List of axis dimensions to average over.
required_weight : optional, float
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is required_weight a good parameter name? What should the default be (currently zero to make default behavior backwards compatible).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe required_weight_pct? required_weight sounds like a boolean parameter.

I think 0 is the appropriate default value.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I like None as a default value to indicate the arg is not set by default.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is okay with me (given my comment above). I prefer required_weight (since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight sounds more Boolean-like than required_weight).

How about the alternative: minimum_weight?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how required_weight doesn't sound like a boolean. I like minimum_weight, it sounds clearer. How about min_weight?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

min_weight is good with me!

xcdat/spatial.py Show resolved Hide resolved
Comment on lines 291 to 309
def test_spatial_average_with_required_weight_as_None(self):
ds = self.ds.copy()

result = ds.spatial.average(
"ts",
axis=["X", "Y"],
lat_bounds=(-5.0, 5),
lon_bounds=(-170, -120.1),
required_weight=None,
)

expected = self.ds.copy()
expected["ts"] = xr.DataArray(
data=np.array([2.25, 1.0, 1.0]),
coords={"time": expected.time},
dims="time",
)

xr.testing.assert_allclose(result, expected)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is related to the code chunk in _averager that deals with the case where required_weight=None. I don't know why it would ever be None (it is supposed to be Optional[float] = 0., so I don't know how it could be None). If you don't have those conditionals you get an issue (maybe from mypy) about comparing None to int | float with the lines that check if required_weight > 0..

So if we could ensure that required_weight is always of type float we could remove this test and other conditionals in _averager. That would be ideal...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Optional indicates that the argument can also be None. You can just specify float as the type annotation if None is not expected.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like None as the default value

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. From the docs:

Note that this is not the same concept as an optional argument, which is one that has a default.

None is converted to 0. later. I guess None might feel different to the user than 0., so I'll leave this alone.

@tomvothecoder
Copy link
Collaborator

I'm going to open a separate PR to address this enhancement with the temporal APIs because those require more work and we can merge this earlier when it is ready.

@tomvothecoder tomvothecoder changed the title Set (optional) weight threshold for averaging operations Add weight threshold option for spatial averaging Jul 30, 2024
xcdat/spatial.py Outdated Show resolved Hide resolved
@tomvothecoder
Copy link
Collaborator

I started PR #683 for temporal operations. I used some of your code and split them up into reusable functions. We can think about making these functions generalizable across the spatial and temporal classes.

Check them out here: https://github.com/xCDAT/xcdat/pull/683/files

Copy link
Collaborator Author

@pochedls pochedls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review review and apply suggestions.

Comment on lines 291 to 309
def test_spatial_average_with_required_weight_as_None(self):
ds = self.ds.copy()

result = ds.spatial.average(
"ts",
axis=["X", "Y"],
lat_bounds=(-5.0, 5),
lon_bounds=(-170, -120.1),
required_weight=None,
)

expected = self.ds.copy()
expected["ts"] = xr.DataArray(
data=np.array([2.25, 1.0, 1.0]),
coords={"time": expected.time},
dims="time",
)

xr.testing.assert_allclose(result, expected)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. From the docs:

Note that this is not the same concept as an optional argument, which is one that has a default.

None is converted to 0. later. I guess None might feel different to the user than 0., so I'll leave this alone.

xcdat/spatial.py Outdated
@@ -716,6 +725,9 @@ def _averager(self, data_var: xr.DataArray, axis: List[SpatialAxis]):
Data variable inside a Dataset.
axis : List[SpatialAxis]
List of axis dimensions to average over.
required_weight : optional, float
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None is okay with me (given my comment above). I prefer required_weight (since this isn't a percentage, shorter is better, and this doesn't come across as a boolean to me – I think require_weight sounds more Boolean-like than required_weight).

How about the alternative: minimum_weight?

xcdat/spatial.py Show resolved Hide resolved
xcdat/spatial.py Outdated
# zero out cells with missing values in data_var
weights = xr.where(~np.isnan(data_var), weights, 0)
# sum all weights (including zero for missing values)
weight_sum_masked = weights.sum(dim=dim) # type: ignore
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a string. The loop above creates a list of strings (e.g., ["lat", "lon"]), which is supposedly okay according to xarray docs.

From mypy:

error: Argument "dim" to "sum" of "DataArrayAggregations" has incompatible type "list[str | list[str]]"; expected "str | Collection[Hashable] | ellipsis | None" [arg-type]

I tried tuple(dims), but that didn't work either. I'll leave this alone for now (but if you know how to fix the mypy complaint, please do!).

@tomvothecoder
Copy link
Collaborator

I've created general functions for weight thresholds here. You can copy them over to utils.py in your branch and then use them in the SpatialAccessor class as needed.

xcdat/spatial.py Outdated
Comment on lines 760 to 761
# need weights to match data_var dimensionality
if minimum_weight > 0.0:
weights, data_var = xr.broadcast(weights, data_var)
Copy link
Collaborator

@tomvothecoder tomvothecoder Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can remove this line.

Per docs for xr.broadcast:

xarray objects automatically broadcast against each other in arithmetic operations, so this function should not be necessary for normal use.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the weights are probably of dimensions [lat, lon] and this casts them to [time, lat, lon]. That way when you sum them below you get [time] (instead of a dimensionless scalar) which is then used to figure out which values do not meet the weight threshold and need to be masked.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Xarray should be handling the automatic broadcasting when using the xr.where() API to mask the data using weights. Otherwise, it would probably break with an error that the dimensions don't align.

According to the docs for xr.where():

Performs xarray-like broadcasting across input arguments.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unit tests pass with this line commented out. I also didn't implement xr.broadcast() explicitly in PR #683.

@tomvothecoder tomvothecoder force-pushed the feature/531_min_weight_for_average branch from b56befe to 34b570d Compare September 5, 2024 17:47
@tomvothecoder
Copy link
Collaborator

tomvothecoder commented Sep 5, 2024

Rebased this branch on latest main and pushed commit 34b570d (#672) with code review updates (refer to commit message for list of changes).

pochedls and others added 7 commits November 21, 2024 12:30
- Rename arg `minimum_weight` to `min_weight`
- Add `_get_masked_weights()` and `_validate_min_weight()` to `utils.py`
- Update `SpatialAccessor` to use `_get_masked_weights()` and `_validate_min_weight()`
- Replace type annotation `Optional` with `|`
- Extract `_mask_var_with_with_threshold()` from `_averager()` for readability
@tomvothecoder tomvothecoder force-pushed the feature/531_min_weight_for_average branch from 7c27e6f to 82cbfe2 Compare November 21, 2024 20:32
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pochedls Your PR is ready for review again.

I actually think it is okay to keep this PR and #683 as separate PRs. I've ported the utilities over from #683 and updated spatial.py to use them.

Comment on lines +757 to +761
# TODO: This conditional might not be needed because Xarray will
# automatically broadcast the weights to the data variable for
# operations such as .mean() and .where().
if min_weight > 0.0:
weights, data_var = xr.broadcast(weights, data_var)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# TODO: This conditional might not be needed because Xarray will
# automatically broadcast the weights to the data variable for
# operations such as .mean() and .where().
if min_weight > 0.0:
weights, data_var = xr.broadcast(weights, data_var)

Xarray will automatically broadcast to align the shapes of weights and data_var.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

[Enhancement]: Add weight threshold option for averaging operations
2 participants