Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extend padding functionalities #9353

Merged
merged 16 commits into from
Aug 21, 2024

Conversation

tsanona
Copy link
Contributor

@tsanona tsanona commented Aug 13, 2024

While working on a project I realized that in the dataset padding function one can set different constant_values for each dimension but not for each variable. I've added some simple code to allow this, one can now also define padding differently for each data variable. I also extended the tests to include these examples.

Let me know if I can do any changes to better fit your standards.

  • Tests added

Copy link

welcome bot commented Aug 13, 2024

Thank you for opening this pull request! It may take us a few days to respond here, so thank you for being patient.
If you have questions, some answers may be found in our contributing guidelines.

Copy link
Collaborator

@max-sixty max-sixty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tsanona , this is good. I added some comments to try and make it simpler to understand, but it's good functionality!

| Mapping[
Any, float | tuple[float, float] | Mapping[Any, tuple[float, float]]
]
| None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty gnarly type signature. Same with the docstring. But I don't think there's much we can do to simplify it.

We possibly could make float | tuple[float, float] | Mapping[Any, tuple[float, float]] into an alias?

Copy link
Contributor Author

@tsanona tsanona Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that's true, I'll do that.

Copy link
Contributor Author

@tsanona tsanona Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I changed it slightly. I created these aliases:

T_PadConstantValues = float | tuple[float, float]
T_VarPadConstantValues = T_PadConstantValues | Mapping[Any, T_PadConstantValues]
T_DatasetPadConstantValues = T_VarPadConstantValues | Mapping[Any, T_VarPadConstantValues]

This way it is a bit more organized. Also changed the implementation of pad for variables so that it can make use of Mapping[Any, float] to keep consistency with the pad function for datasets.

What do you think?
(Also let me know if the location in types.py and import in dataset.py is correct. I don't fully understand when to import types under the if TYPE_CHECKING: block or above in dataset.py)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

(others may have refinements...)

Used in 'constant'. The values to set the padded values for each
axis.
constant_values : scalar, tuple, mapping of hashable to tuple or
mapping of hashable to mapping of hashable to tuple, default: 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily on this PR but do you happen to know why the default is listed as 0 here but is None in the func signature?

Copy link
Contributor Author

@tsanona tsanona Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hum, what happened there, but it is a good point since if left at None the arrays are padded with np.nan. Should I change the default to nan or None?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah you mean changing the default and not the docstring?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sorry, so what's the actual value that goes onto the dataset, by default? I had originally thought it was 0, but does None / NaN actually get placed on the dataset? (I realize it's listed as the default in the signature)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wasn't sure when you first asked so I created a test for the default state and it does indeed look like it populates the padding with np.nan when constant_values is left at None. So we it is confusing to have default: 0 in the doc string. We can either change the doc string to None or NaN or change the default in the function signature to 0 rather then None. I'm not sure which would be better, what do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the docstrings, otherwise that's a breaking change!

Copy link
Contributor Author

@tsanona tsanona Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah of course, will change it to default: None in the docstring.
I'm also changing it for end_values which seems to me to have the same problem.
I also just realized it must have been copied from the docstring of np.pad since it looks quite similar and the actual default is 0.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thank you!

xarray/core/dataset.py Outdated Show resolved Hide resolved
xarray/core/dataset.py Outdated Show resolved Hide resolved
tsanona and others added 3 commits August 14, 2024 16:18
…taset pad func. enforce number values to be converted to tuples for all in `_pad_options_dim_to_index`. make variable pad funtion consistent with dataset. extend tests
@tsanona tsanona requested a review from max-sixty August 15, 2024 08:40
Copy link
Collaborator

@headtr1ck headtr1ck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful, thanks. I needed that already a few times and ended up writing my own wrapper.

@@ -9293,6 +9294,12 @@ def pad(
if not pad_dims.intersection(xindexes.get_all_dims(k)):
indexes[k] = idx

per_data_var_constant_values = {}
if isinstance(constant_values, dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The typing claims that any mapping works, but here you are only checking dicts.
I would propose to change to utils.is_dict_like and find an alternative to pop (Mapping does not define pop because it is read-only). The simplest way would be to transform it into a dict first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point, to be honest I just followed the way it is done in the implementation of pad for the variables in variable.py. Maybe I should just change all these type hints to dict rather than Mapping. What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a typing perspective dict is always problematic because it is invariant.
I would try to stick with Mapping instead.

Copy link
Contributor Author

@tsanona tsanona Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay, so just I'll just change this one isinstance(constant_values, dict) to utils.is_dict_like(constant_values). Thanks for the clarifications!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no.
For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable).
Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

Copy link
Collaborator

@keewis keewis Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.


In general, I believe we need to be thorough in defining the new feature:

  • should we also allow padding coordinate variables?
  • if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?
  • what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and no. For the check it should suffice, but the general Mapping class has no pop defined because it is read-only (non-mutable). Anyway, I guess the user would find it weird that this input is changed in-place, so remove the pop method and replace it with a get item call (not sure if the Mapping ABC defines a get method).

That's a very fair point, always forget that about dicts 🙃 , I'll figure out a different way to do it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alrighty, changed, now the input constant_values is not mutated in place.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the check needs to be more thorough? It is totally possible to create Dataset objects with a data variable that has the same name as a dimension:

xr.Dataset({"x": ("x", [1, 2])}).drop_indexes("x").reset_coords("x")

so we could potentially break the original use case if we remove all data variables from the dictionary, because we could also remove a dimension name.

In general, I believe we need to be thorough in defining the new feature:

  • should we also allow padding coordinate variables?
  • if we use it, should it still be possible to specify blanket dimension padding? I.e. should it be possible to mix the two? If so, how do we figure out which is which? Check if the value is a dict and decide it's the padding for a specific variable?
  • what should happen if we specify padding values only for a subset of the variables? Dimensions need to have the same size for all variables in a Dataset, so either we need to specify a fallback (by dimension names), or we need to require specifying the padding for all affected variables (and raise otherwise).

It is a fair point, I have rewritten this section so the original constant_variables mapping is not mutated.
Regarding the bullet points:

  • As I understand it was already possible to pad coordinate variables.
  • As it is codded now if a data var has a dim with the same name and the user sets a value with this name in the constant_values then the data var will be padded with those values (regardless of the dim being padded) and all other data vars that also contain that dim will be padded with that value, provided that it is the dim being padded (a.k.a. the dim provided in pad_width). In the tests I also show that one can mix data var and dim in the same dict. In that case the value for the data var has priority over the value of the dim.
  • I realized that it was already implemented that if one didn't set values for all dims that are being padded then they are be padded with 0, so I kept that. For example, if one sets a constant_value for var1 and is padding a dim that is contained in in other vars, var1 will be padded with that value along the dim while all other vars will be padded with 0 along the same dim.

I hope I didn't make it too confusing. Let me know if it answers your questions and if the logic makes sense :D

@@ -6689,17 +6689,45 @@ def test_polyfit_warnings(self) -> None:
ds.var1.polyfit("dim2", 10, full=True)
assert len(ws) == 1

def test_pad(self) -> None:
@pytest.mark.parametrize(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work if you want to pad along a dimension coordinate (aka. a variable that is called the same as it's dimension)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'll investigate :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, as I understand it most dims in the test dataset are dimension coordinates and they pad correctly, so I think so. In any case I've extended the tests to pad all dimensions just to be sure nothing is behaving incorrectly. Let me know if I missed any case.

@max-sixty
Copy link
Collaborator

Any other changes pending or is this ready to go?

(thanks a lot for the work @tsanona !)

k: v for k, v in constant_values.items() if k in var.dims
}
else:
filtered_constant_values = 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
filtered_constant_values = 0
filtered_constant_values = None

If I'm not mistaken

Probably a test should cover this as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we change that 0 to None then there is a discrepancy between what happens if we don't set all dim in constant_variables (this defaults to padding with 0 and is coded in the pad function from variable.py) and if we don't set all var (code I added in pad function from dataset.py which will then pad dimensions with np.nan).
I can either change every where which I think might be a breaking change (?) or just leave everything as is.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be worth changing but doesn't need to be in this PR (sorry if I slowed things down by introducing this). Assuming it isn't trivial, adding a TODO with a link to this would be very sufficient

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries, added the TODO 🚀
Lemme know if there's anything else left to do and thanks for all the comments!

@max-sixty max-sixty added the plan to merge Final call for comments label Aug 21, 2024
@max-sixty
Copy link
Collaborator

If we can add a whatsnew entry, that would be great!

@max-sixty max-sixty merged commit ed5900b into pydata:main Aug 21, 2024
28 checks passed
Copy link

welcome bot commented Aug 21, 2024

Congratulations on completing your first pull request! Welcome to Xarray! We are proud of you, and hope to see you again! celebration gif

@max-sixty
Copy link
Collaborator

Thank you very much @tsanona !

@tsanona tsanona deleted the extend-padding-functionalities branch August 21, 2024 21:19
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 21, 2024
* main:
  Extend padding functionalities (pydata#9353)
dcherian added a commit to TomNicholas/xarray that referenced this pull request Aug 22, 2024
* main: (214 commits)
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
  Improve error message for missing coordinate index (pydata#9370)
  Add flaky to TestNetCDF4ViaDaskData (pydata#9373)
  Make chunk manager an option in `set_options` (pydata#9362)
  Revise (pydata#9371)
  Remove duplicate word from docs (pydata#9367)
  Adding open_groups to BackendEntryPointEngine, NetCDF4BackendEntrypoint, and H5netcdfBackendEntrypoint (pydata#9243)
  Revise (pydata#9366)
  Fix rechunking to a frequency with empty bins. (pydata#9364)
  whats-new entry for dropping python 3.9 (pydata#9359)
  drop support for `python=3.9` (pydata#8937)
  Revise (pydata#9357)
  ...
dcherian added a commit to dcherian/xarray that referenced this pull request Aug 30, 2024
* main:
  Adds copy parameter to __array__ for numpy 2.0 (pydata#9393)
  `numpy 2` compatibility in the `pydap` backend (pydata#9391)
  pyarrow dependency added to doc environment (pydata#9394)
  Extend padding functionalities (pydata#9353)
  refactor GroupBy internals (pydata#9389)
  Combine `UnsignedIntegerCoder` and `CFMaskCoder` (pydata#9274)
  passing missing parameters to ZarrStore.open_store when opening a datatree (pydata#9377)
  Fix tests on big-endian systems (pydata#9380)
  Improve error message on `ds['x', 'y']` (pydata#9375)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
plan to merge Final call for comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants