Representing & checking Dataset schemas #1900

max-sixty · 2018-02-09T18:06:08Z

What would be the best way to canonically describe a dataset, which could be read by both humans and machines?

For example, frequently in our code we have docstrings which look something like:

def get_returns(security_ids):
    """
    Retuns mega-dimensional dataset which gives recent returns for a set of
        securities by:
    - Date
    - Return (raw / economic / smoothed / etc)
    - Scaling (constant / risk_scaled)
    - Span
    - Hedged vs Unhedged

    Dataset keys are security ids. All dimensions have coords.
    """

This helps when attempting to understand what code is doing while only reading it.
But this isn't consistent between docstrings and can't be read or checked by a machine.
Has anyone solved this problem / have any suggestions for resources out there?

Tangentially related to python/typing#513 (but our issues are less about the type, dimension sizes, and more about the arrays within a dataset, their dimensions, and their names)

The text was updated successfully, but these errors were encountered:

shoyer · 2018-02-09T19:16:26Z

I think the right word for this may be "schema". For applications and models (rather than data analysis), these sort of conventions can be super-valuable. I like the idea of declarative spec that can be validated.

Just googling around, I came up with pandas-validator: https://github.com/c-data/pandas-validator

max-sixty · 2018-02-09T22:01:39Z

I think the right word for this may be "schema"

Right! 🤦‍♂️

Just googling around, I came up with pandas-validator

Interesting, thanks.

Do you think this fits into a 'function which validates', rather than a Mypy-like type annotation? I think ideally there would be a representation of the schema that could work with both, so maybe this isn't the important question atm.

max-sixty · 2018-02-09T23:31:44Z

And let me know if there are already textual schema definitions from other libraries that you think are good, before we go and build our own (we don't work with any netCDF-like files so don't have that context)

shoyer · 2018-02-09T23:53:49Z

ncdump -h (xarray.Dataset.info()) is one existing schema of sorts, but it's hardly machine readable.

benbovy · 2018-02-22T23:03:27Z

Somewhat related to this issue, I have implemented in xarray-simlab some logic to validate xarray.Variable objects (dimensions, dtype, etc.). See this base class and some sub-classes here. I use that in a way which is quite similar to pandas-validator(i.e., using class attributes).

I'm currently in the process of refactoring this using attrs, which supports both validator functions and type annotations. Not sure how to use the latter for xarray objects, though (BTW I wasn't aware of python/typing#513, good to know!!).

I agree that it would be nice to have a more generic way to describe xarray objects that can be reused in many contexts.

max-sixty · 2018-02-22T23:13:55Z

@benbovy That looks v interesting.
I think at the moment it would require a bit of work to validate normal xarray objects, is that right? (I'm looking at the __init__, which doesn't take the traditional args supplied to a Variable - tell me if I'm misreading it)

Separately - I didn't know about the project but looks awesome. Do we have a list of projects that integrate xarray? Let's start one somewhere if not @pydata/xarray ?

benbovy · 2018-02-22T23:25:11Z

@maxim-lian you're right. In this case xsimlab.Variable is a different concept than xarray.Variable, despite that they both have the same name. The former is tight to the modelling framework while the latter is only used for simulation inputs and outputs in xarray-simlab.

Do we have a list of projects that integrate xarray?

There is an ongoing discussion in #1850 about having something like xarray-contrib (likely a github organization).

max-sixty · 2018-03-28T19:46:33Z

The commentary in python/typing#513, and @shoyer 's doc https://docs.google.com/document/d/1vpMse4c6DrWH5rq2tQSx3qwP_m_0lyn-Ij4WHqQqRHY/edit#heading=h.rkj7d39awayl are good & growing

I'll close this as I think riding on those coattails - with the addition of names and Datasets as containers - makes the most sense.

(though reopen if we think there's something we could productively do separately)

JackKelly · 2021-10-08T07:04:51Z

I'm really interested in a machine-readable schema for xarray!

Pandera provides machine-readable schemas for Pandas and, as of version 0.7, panderas has decoupled pandera and pandas types to make pandera more useful for things like xarray. I haven't tried pandera yet but I plan to do some experiments soon.

shoyer · 2021-10-08T07:25:44Z

Pandera provides machine-readable schemas for Pandas and, as of version 0.7, panderas has decoupled pandera and pandas types to make pandera more useful for things like xarray. I haven't tried pandera yet but I plan to do some experiments soon.

Awesome -- would love to hear how this goes!

JackKelly · 2021-10-08T14:32:44Z

OK, I think pandera isn't the way forwards because it appears very tighly coupled to Pandas (so, for example, I don't think it's possible to use pandera with n-dimensional arrays).

But Pydantic looks promising. Here's a very quick coding experiment showing one way to use pydantic with xarray... it validates a few things; but it's not super-useful as a human-readable specification for what's going on inside a DataArray or Dataset.

rabernat · 2021-10-08T15:41:29Z

But Pydantic looks promising

Big 👍 to this.

andersy005 · 2022-01-09T03:56:40Z

xref the more recent issue: Xarray integration unionai-oss/pandera#705 which aims to implement a pandera.xarray module within pandera

jhamman · 2022-01-10T02:06:59Z

Related to the Pandera integration, we are prototyping the xarray schema validation functionality in the xarray-schema project.

kubaraczkowski · 2022-07-14T11:28:37Z

Does this project do (part of?) what's needed?
+1 on making xarrays with explicit 'structure' !

max-sixty mentioned this issue Feb 22, 2018

xarray contrib module #1850

Closed

max-sixty closed this as completed Mar 28, 2018

shoyer mentioned this issue Aug 3, 2018

Support non-string dimension/variable names #2292

Closed

JackKelly mentioned this issue Oct 8, 2021

Machine-readable schema & validator for xarray.Dataset openclimatefix/nowcasting_dataset#211

Closed

shoyer changed the title ~~Representing & checking Dataset metadata~~ Representing & checking Dataset schemas Oct 8, 2021

shoyer reopened this Oct 8, 2021

hammer mentioned this issue Oct 8, 2021

Explore typed/schema-based Dataset options sgkit-dev/sgkit#43

Closed

dcherian added the topic-typing label May 6, 2022

andersy005 mentioned this issue Jun 9, 2022

Datatype for a 'shape specification' of a Dataset / DataArray #6680

Open

ivirshup mentioned this issue Aug 18, 2022

Schemas? scverse/spatialdata#24

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Representing & checking Dataset schemas #1900

Representing & checking Dataset schemas #1900

max-sixty commented Feb 9, 2018

shoyer commented Feb 9, 2018

max-sixty commented Feb 9, 2018

max-sixty commented Feb 9, 2018

shoyer commented Feb 9, 2018

benbovy commented Feb 22, 2018

max-sixty commented Feb 22, 2018

benbovy commented Feb 22, 2018

max-sixty commented Mar 28, 2018

JackKelly commented Oct 8, 2021

shoyer commented Oct 8, 2021

JackKelly commented Oct 8, 2021 •

edited

Loading

rabernat commented Oct 8, 2021

andersy005 commented Jan 9, 2022

jhamman commented Jan 10, 2022

kubaraczkowski commented Jul 14, 2022

Representing & checking Dataset schemas #1900

Representing & checking Dataset schemas #1900

Comments

max-sixty commented Feb 9, 2018

shoyer commented Feb 9, 2018

max-sixty commented Feb 9, 2018

max-sixty commented Feb 9, 2018

shoyer commented Feb 9, 2018

benbovy commented Feb 22, 2018

max-sixty commented Feb 22, 2018

benbovy commented Feb 22, 2018

max-sixty commented Mar 28, 2018

JackKelly commented Oct 8, 2021

shoyer commented Oct 8, 2021

JackKelly commented Oct 8, 2021 • edited Loading

rabernat commented Oct 8, 2021

andersy005 commented Jan 9, 2022

jhamman commented Jan 10, 2022

kubaraczkowski commented Jul 14, 2022

JackKelly commented Oct 8, 2021 •

edited

Loading