Import frictionless data table schemas and json-schemas #420

e-lo · 2021-02-16T21:52:18Z

Is your feature request related to a problem? Please describe.
Many common data standards are codified in json files. Using the same data schema file without having to translate it to yaml or into classes itself reduces inconsistency and errors and greatly speeds the up the ability to validate that a dataframe is, for example, a valid GTFS Trips Table .

Describe the solution you'd like.

Add a functions in io.py to deserialize:

Overridable default mappings between checks in pandera and json schema/frictionless

Describe alternatives you've considered

hand-transferring and updating schemas from frictionless and json schema to classes or yaml files
automated transferring of frictionless and json schemas to intermediate yaml files to be read in

The text was updated successfully, but these errors were encountered:

cosmicBboy · 2021-02-17T17:55:46Z

thanks for this feature request @e-lo! Been keeping an eye out on the frictionless data ecosystem and was waiting for someone to have this use case :)

Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow? It might help in fleshing out the solution to this issue.

Just to riff off of your described solution, here are some initial thoughts:

MVP implementation of frictionless and json schema parser

There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.

Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera should support this, so we can kick this can down the road, at least for the initial MVP.

Looking at the constraints, I do believe there's a 1-1 mapping from frictionless data to pandera, so that should be fairly straightforward.

So as an initial approach, we should identify the intersection of features between (i) frictionless and pandera and (ii) json-schema and pandera and implement mappings from one system to the other.

e-lo · 2021-02-17T19:36:17Z

Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow?

Desired use case: when working with data with an expected structure codified in a schema file which defines a data standard, be able to validate the data that is read in using a pandera decorator which directly reference that [potentially external] schema file rather than hard-coding the schema.

Right now I'm not using pandera, but have been watching it to see if/when it would solve my use case because it comes packed with a bunch of features that I think would alleviate the need for a multitude of external validators that have sprung up, including the older but very sluggish ones from Frictionless themselves (they don't even use pandas!).

I'm guilty of having written one of those validators myself for validating if data is compatible with the "General Modeling Network Specification" for travel demand modeling:

note that we extended the frictionless schema slightly to add a "warnings" in additional "constraints"

The other big use case I'm thinking of right now is for GTFS (General Transit Feed Specification) as mentioned in the issue above. Ideally you shouldn't need to run a large validator (see the canonical one) to do some basic validation based on the official spec file, and useful tools for processing GTFS, like partridge, don't have any real validation other than field names and/or have the spec hard-coded in them rather than pointing to the "official" specification.

e-lo · 2021-02-17T19:37:23Z

There are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one.

Agree. The titles and descriptions are great for documentation but not necessary for usage.

e-lo · 2021-02-17T19:39:56Z

Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether pandera should support this, so we can kick this can down the road, at least for the initial MVP.

Agree that this is tricky, because it relies on a structure of data - not just a single df. Original frictionless validator didn't do this either...but if this is the only thing I have to implement myself then that's fine ;-)

MVP implementation could just be a uniqueness check if it isn't already explicitly specified as a constraint?

e-lo · 2021-02-17T19:41:43Z

So as an initial approach, we should identify the intersection of features between (i) frictionless and pandera and (ii) json-schema and pandera and implement mappings from one system to the other.

I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.

jeffzi · 2021-02-18T20:50:50Z

On particular case related to json schema is mapping OpenAPI data models to a pandera schema. The pandera schema that validates data received from the REST API could be synced with the API definition itself.

OpenAPI data models are based on an extended subset of JSON schema.

cosmicBboy · 2021-02-19T21:58:52Z

I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera.

Thanks @e-lo! Things are getting busy with pandera and I need to turn my attention to some other aspects of the project over the next few weeks, so your contribution would be much appreciated 🎉

I just added this issue to the 0.8.0 release milestone, I think we can tackle the json schema and OpenAPI specifications in separate issues.

A good place to start would be the contributing page to get your dev environment all setup, let me know if you hit any snags in the process.

Re: supporting frictionless, I think a nice UX would be something like:

import pandera as pa

schema = pa.from_frictionless_schema("path/to/schema.json")

@pa.check_input(schema)
def function(dataframe):
    ...
    # do stuff

For implementation, there are three modules to be aware of:

schema_statistics.py: this extracts schema statistics (fields, their data types, and checks and their sufficient statistics, e.g. min and max values) from a dataframe. It also defines functions for extracting the schema specification from a pandera schema (e.g. get_dataframe_statistics. This is probably where the heavy lifting of extracting the statistics from a frictionless data schema should occur in a function like get_frictionless_schema_statistics.
- note that infer_dataframe_statistics, infer_series_statistics, and infer_index_statistics in this module are misnomers... it should probably be parse_* instead of infer_*.
schema_inference.py: this basically wraps the functions in schema_statistics and exposes the function infer_schema to the end user.
io.py: logic where serialization/deserialization of yaml and serialization to python script lives. This would define from_frictionless_schema and call schema_statistics.get_frictionless_schema_statistics to generate a DataFrameSchema.

We might want a to_frictionless_schema in the future, but we can save that for later :)

Let me know if you have any questions!

TColl · 2021-04-03T18:55:00Z

@e-lo and @cosmicBboy - I've had a quick go at building out frictionless compatibility in PR above - I'd appreciate any feedback if/when you have a minute!

cosmicBboy · 2021-04-03T19:41:28Z

thanks @TColl! let's go ahead and merge this into the release/0.7.0 branch so we can make it available to users sooner

cosmicBboy · 2021-05-08T15:24:57Z

fixed by #454

e-lo added the enhancement label Feb 16, 2021

cosmicBboy added this to the 0.8.0 release milestone Feb 19, 2021

cosmicBboy mentioned this issue Feb 19, 2021

Import/export json-schemas and OpenAPI schemas #421

Open

cosmicBboy added the help wanted label Mar 24, 2021

TColl mentioned this issue Apr 3, 2021

Feature/420 #454

Merged

e-lo mentioned this issue Apr 5, 2021

create statewide analysis tables in Google BigQuery cal-itp/data-infra#31

Closed

cosmicBboy closed this as completed May 8, 2021

cosmicBboy removed this from the 0.8.0 release milestone May 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import frictionless data table schemas and json-schemas #420

Import frictionless data table schemas and json-schemas #420

e-lo commented Feb 16, 2021

cosmicBboy commented Feb 17, 2021

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

jeffzi commented Feb 18, 2021 •

edited

Loading

cosmicBboy commented Feb 19, 2021 •

edited

Loading

TColl commented Apr 3, 2021

cosmicBboy commented Apr 3, 2021

cosmicBboy commented May 8, 2021

Import frictionless data table schemas and json-schemas #420

Import frictionless data table schemas and json-schemas #420

Comments

e-lo commented Feb 16, 2021

cosmicBboy commented Feb 17, 2021

MVP implementation of frictionless and json schema parser

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

e-lo commented Feb 17, 2021

jeffzi commented Feb 18, 2021 • edited Loading

cosmicBboy commented Feb 19, 2021 • edited Loading

TColl commented Apr 3, 2021

cosmicBboy commented Apr 3, 2021

cosmicBboy commented May 8, 2021

jeffzi commented Feb 18, 2021 •

edited

Loading

cosmicBboy commented Feb 19, 2021 •

edited

Loading