Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] support for xarray DataArray & mtypes #3255

Merged
merged 13 commits into from
Aug 26, 2022

Conversation

benHeid
Copy link
Contributor

@benHeid benHeid commented Aug 13, 2022

Reference Issues/PRs

resolves #2655

What does this implement/fix? Explain your changes.

Does your contribution introduce a new dependency? If yes, which one?

What should a reviewer concentrate their feedback on?

Any other comments?

PR checklist

For all contributions
  • I've added myself to the list of contributors.
  • Optionally, I've updated sktime's CODEOWNERS to receive notifications about future changes to these files.
  • I've added unit tests and made sure they pass locally.
  • The PR title starts with either [ENH], [MNT], [DOC], or [BUG] indicating whether the PR topic is related to enhancement, maintenance, documentation, or bug.
For new estimators
  • I've added the estimator to the online documentation.
  • I've updated the existing example notebooks or provided a new one to showcase how my estimator works.

@benHeid
Copy link
Contributor Author

benHeid commented Aug 13, 2022

TODOs

  • make xarray to a softdependency (conditional import)
  • resolve TODOs in the code
  • Discuss if
    • DataSets should also be supported
    • Further scitypes should support xarray

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 13, 2022

for conditional execution (based on presence of package in the env), you can use:

if _check_soft_dependencies("xarray", severity="none"):
    import xarray
    etc

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 14, 2022

let me know if you think you are stuck - I can look at this is more detail, we can do pair programming, or I can give it a try.

@fkiraly fkiraly added the module:datatypes datatypes module: data containers, checkers & converters label Aug 14, 2022
@benHeid
Copy link
Contributor Author

benHeid commented Aug 15, 2022

let me know if you think you are stuck - I can look at this is more detail, we can do pair programming, or I can give it a try.

Thank you, I let you know if I am stuck

@benHeid
Copy link
Contributor Author

benHeid commented Aug 15, 2022

I have two questions:

  • I added the extend_conversion function from tables to series. (In panel, there exists something similar). Should I move this method to another file in the datatypes file, or should an extra PR do this?
  • Should I add xarray as a soft dependency somewhere that the automatic tests can install xarray?

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 15, 2022

I added the extend_conversion function from tables to series. (In panel, there exists something similar). Should I move this method to another file in the datatypes file, or should an extra PR do this?

Depends what you need it for. If needed in this PR, we can leave it in its location and refactor.
If you want it as a more general utility and not needed for the xarray topic, it is better putting in a separate PR.

Should I add xarray as a soft dependency somewhere that the automatic tests can install xarray?

Yes, pyproject.toml, all_extras dependency set.

* Use store dict for xarray DataArray conversion
* Fix linting errors in newly added code.
@benHeid
Copy link
Contributor Author

benHeid commented Aug 15, 2022

I added the extend_conversion function from tables to series. (In panel, there exists something similar). Should I move this method to another file in the datatypes file, or should an extra PR do this?

Depends what you need it for. If needed in this PR, we can leave it in its location and refactor. If you want it as a more general utility and not needed for the xarray topic, it is better putting in a separate PR.

Ok, it is not needed for the xarray topic.

@benHeid benHeid changed the title Enable xarray DataArray mtype as sctpye series. [ENH] Enable xarray DataArray mtype as sctpye series. Aug 15, 2022
@benHeid benHeid force-pushed the feature/2655_enable_xarray branch 2 times, most recently from 7eb4ad4 to 81c4aec Compare August 15, 2022 10:55
@benHeid benHeid force-pushed the feature/2655_enable_xarray branch from 81c4aec to 4654248 Compare August 15, 2022 10:59
@benHeid
Copy link
Contributor Author

benHeid commented Aug 15, 2022

Update:

I am currently working on probably transforming the index from pandas to xarray and vice versa. I assume that this is caused by the transpose which I apply to have the same shape after the conversion chain: pd.DataFrame -> xr.DataArray -> pd.DataFrame.

A solution for this will probably also solve also the problem with the tests in python 3.7 (it seems that in python 3.7 the index is a BaseIndex after the conversion, while for 3.8 and 3.9 it is a PeriodIndex)

…ons which swap the dimensions..

* Append deep_equals for xarray testing.
@fkiraly
Copy link
Collaborator

fkiraly commented Aug 15, 2022

A solution for this will probably also solve also the problem with the tests in python 3.7 (it seems that in python 3.7 the index is a BaseIndex after the conversion, while for 3.8 and 3.9 it is a PeriodIndex)

Ouch... how odd. I suppose this is one of those non-obvious problems that occur...

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 16, 2022

👏 nice! Looks like it passes now!

sktime/utils/_testing/deep_equals.py Outdated Show resolved Hide resolved
sktime/datatypes/_series/_check.py Show resolved Hide resolved
return concat_fun


def _extend_conversions(mtype, anchor_mtype, convert_dict):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If datatypes is refactored, then this method should be placed somewhere else. Currently similar methods are implemented in panels and tables.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, indeed - too much copy-paste!
Though I would go with incurring technical debt due to priorities...

Re refactor directions:

@chrisholder started sth, but abandoned it.
#1460
Haven´t had the time to salvage it.

This would be another option:
dylan-profiler/visions#194

sktime/datatypes/_series/_convert.py Show resolved Hide resolved
@benHeid
Copy link
Contributor Author

benHeid commented Aug 16, 2022

👏 nice! Looks like it passes now!

Yeah 🥳
I think you can now take a more detailed look at this PR. I added already some comments, and I would appreciate your feedback there.

Should we add the support for other mtypes and the dataset in this PR or should I create additional PRs for the other?

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 16, 2022

Should we add the support for other mtypes and the dataset in this PR or should I create additional PRs for the other?

I would go with the "minimal modular change" principle and separate PR, also finalizing this one first.
Which other mtypes do you have in mind?

Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice, thanks!

if this were not a draft, I would have only one blocking change, that's the coupling introduced in deep_equals, can we avoid it?

* Enable xarray with only one dimension (Explicit transformation to pandas DataFrame instead of using to_pandas)
* Add identity transformation from xr.DataArray to xr.DataArray
@benHeid
Copy link
Contributor Author

benHeid commented Aug 19, 2022

Update

  • I changed the check in deep_equals.
  • I manually create a dataframe instead of using to_pandas. This ensures that xr.DataArrays are always transformed to DataFrames and not to pd.Series if the xr.DataArray contains only one dimension. Note, to_dataframe does not work well, since it does some multi-indexing stuff in case of two dimension which needs to be unstacked.

@fkiraly fkiraly changed the title [ENH] Enable xarray DataArray mtype as sctpye series. [ENH] support for xarray DataArray & mtypes Aug 19, 2022
@benHeid benHeid marked this pull request as ready for review August 25, 2022 08:13
@benHeid benHeid requested a review from aiwalter as a code owner August 25, 2022 08:13
@benHeid benHeid requested review from fkiraly and removed request for aiwalter August 25, 2022 08:14
Copy link
Collaborator

@fkiraly fkiraly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works!

Do you want me to merge this, or wait for panel?

@benHeid
Copy link
Contributor Author

benHeid commented Aug 26, 2022

Works!

Do you want me to merge this, or wait for panel?

I would merge it and add the panel with a second PR if I need this. Currently, I can test and go on with the sktime-pyWATTS integration without implementing Panels. Perhaps Issue #2655 should stay open in that case.

@benHeid
Copy link
Contributor Author

benHeid commented Aug 26, 2022

@fkiraly Do we have to adapt utils.validation.series? In that file, something like VALID_DATA_TYPES is defined. Do I have to add xr.DataArray?

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 26, 2022

@fkiraly Do we have to adapt utils.validation.series? In that file, something like VALID_DATA_TYPES is defined. Do I have to add xr.DataArray?

I think that's used only in legacy code, so I would recommend to not touch it (who knows what will break).

The "critical touch point" is check_series, which is still linked to a few estimators that have not been fully refactored, such as ColumnwiseTransformer - but mostly decoupled from the vast majority of the code base.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module:datatypes datatypes module: data containers, checkers & converters
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] xarray support
2 participants