- Sponsor
-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Import frictionless data table schemas and json-schemas #420
Comments
thanks for this feature request @e-lo! Been keeping an eye out on the frictionless data ecosystem and was waiting for someone to have this use case :) Out of curiosity, how do you use frictionless tables schema, json schema, and pandera in your workflow? It might help in fleshing out the solution to this issue. Just to riff off of your described solution, here are some initial thoughts: MVP implementation of frictionless and json schema parserThere are abstractions in either system that are currently not supported by pandera, for example field titles and descriptions. There's an issue for this #331, but I don't think completion of that issue should block this one. Another abstraction is the concept of primary and foreign keys. I'm not sure yet whether Looking at the constraints, I do believe there's a 1-1 mapping from frictionless data to pandera, so that should be fairly straightforward. So as an initial approach, we should identify the intersection of features between (i) |
Desired use case: when working with data with an expected structure codified in a schema file which defines a data standard, be able to validate the data that is read in using a pandera decorator which directly reference that [potentially external] schema file rather than hard-coding the schema. Right now I'm not using pandera, but have been watching it to see if/when it would solve my use case because it comes packed with a bunch of features that I think would alleviate the need for a multitude of external validators that have sprung up, including the older but very sluggish ones from Frictionless themselves (they don't even use pandas!). I'm guilty of having written one of those validators myself for validating if data is compatible with the "General Modeling Network Specification" for travel demand modeling:
The other big use case I'm thinking of right now is for GTFS (General Transit Feed Specification) as mentioned in the issue above. Ideally you shouldn't need to run a large validator (see the canonical one) to do some basic validation based on the official spec file, and useful tools for processing GTFS, like partridge, don't have any real validation other than field names and/or have the spec hard-coded in them rather than pointing to the "official" specification. |
Agree. The titles and descriptions are great for documentation but not necessary for usage. |
Agree that this is tricky, because it relies on a structure of data - not just a single df. Original frictionless validator didn't do this either...but if this is the only thing I have to implement myself then that's fine ;-) MVP implementation could just be a uniqueness check if it isn't already explicitly specified as a constraint? |
I can take a hack at this if it is helpful, starting with frictionless (because I think it maps easier to dfs) - if you point me to the best list for pandera. |
On particular case related to json schema is mapping OpenAPI data models to a pandera schema. The pandera schema that validates data received from the REST API could be synced with the API definition itself. OpenAPI data models are based on an extended subset of JSON schema. |
Thanks @e-lo! Things are getting busy with pandera and I need to turn my attention to some other aspects of the project over the next few weeks, so your contribution would be much appreciated 🎉 I just added this issue to the 0.8.0 release milestone, I think we can tackle the json schema and OpenAPI specifications in separate issues. A good place to start would be the contributing page to get your dev environment all setup, let me know if you hit any snags in the process. Re: supporting frictionless, I think a nice UX would be something like: import pandera as pa
schema = pa.from_frictionless_schema("path/to/schema.json")
@pa.check_input(schema)
def function(dataframe):
...
# do stuff For implementation, there are three modules to be aware of:
We might want a Let me know if you have any questions! |
@e-lo and @cosmicBboy - I've had a quick go at building out frictionless compatibility in PR above - I'd appreciate any feedback if/when you have a minute! |
thanks @TColl! let's go ahead and merge this into the |
fixed by #454 |
Is your feature request related to a problem? Please describe.
Many common data standards are codified in json files. Using the same data schema file without having to translate it to
yaml
or into classes itself reduces inconsistency and errors and greatly speeds the up the ability to validate that a dataframe is, for example, a valid GTFS Trips Table .Describe the solution you'd like.
io.py
to deserialize:Describe alternatives you've considered
The text was updated successfully, but these errors were encountered: