-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create schema
command
#60
Comments
Stale issue message |
|
Hi @mhuang74, I'm assigning this to you as you're on a roll! 😉 For the first cut, we don't need to use |
Thanks @jqnatividad. Actually, infers-jsonschema requires a properly typed JSON in the first place, the construction of which still requires type inference based on parsing and counting. |
Hi @jqnatividad, thanks for merging the POC. The first cut only supports one out of the many validation vocabularies that JSON Schema supports. To support more in a manner that would be useful and practical, perhaps each could be turned into an option that requires specific column selection, and schema command can fill in the validation constraints based on data values.
Thoughts? 6.1. Validation Keywords for Any Instance Type
6.2. Validation Keywords for Numeric Instances (number and integer)
6.3. Validation Keywords for Strings
|
@jqnatividad for the |
and We can even have pre-baked, optimized regexes for certain date formats (ISO 8601/RFC 3339) and support the date defined formats. This really make qsv's schema/validate combo very compelling! And yes, within reason, having options to fine-tune these settings would be great, but at the end of the day, the user can always manually tweak the resulting jsonschema, so having sensible defaults would be OK. Then again, docopt makes it easy expose these defaults to the user, so I'll leave it to you. Awesome work as usual @mhuang74 ! Really excited by the POC even in its present state. |
Even For an integer data type, if |
@jqnatividad Here's another iteration of schema which leverages cmd::stats. Please let me know if the integration approach looks okay. Also, here's the current docopt. Please note:
|
Excellent work as usual @mhuang74 ! I like how you do full round-tripping test with One thing that just occurred to me, given how large and verbose the JSON schema and CSV files are for testing As for docopt, I agree with your choices, though you may want to also set Regardless, I'm merging #158 just in time for the weekly release. 😄 |
`schema`: add value constraints via stats. Continue iterating on #60 ...
Thanks @jqnatividad. One little PR before heading to bed. Wanted to make command description clearer. |
Hi @mhuang74 , I managed to get it to validate by removing Also, all the fields are added as |
Thanks @jqnatividad for catching this bug. All fields are required because all columns are expected in CSV. But required columns may still have no value (denoted by "null" in Type list). I am running on an old machine, but it's pretty clear that validate could use some performance boost.
|
Thanks for the quick fix @mhuang74 ... just merged #163. And thanks for clarifying Speaking of which, I haven't found a good, performant implementation of JSON Table Schema - hopefully, your work here can be the foundation of a de facto jsonschema.rs based implementation, which we can later adapt as per Stranger6667/jsonschema#339. As for performance, on my fairly recent Ryzen 7 4800H 2.9 GHz laptop with 32gb, its much faster even if I'm running it on an Oracle VirtualBox VM (Ubuntu 20.04 LTS allocated with 12gb):
IMHO, But |
Not sure what is going on, but performance can be significantly different from my laptop (Intel Core i3 M 370 @ 2.40GHz w/ 2 cores but Hyper-threads to 4; 8 GB RAM). Now I am getting descent performance. Still using debug build. Opened #164 to improve validate performance. I will take this chance to play with the performance test suite and add validate to it.
|
Even though And yes, my numbers are using the release build. |
@jqnatividad I ran schema without
Release build was used for below. without
with
|
I suggest to make As for missing values, can't we leverage Running For Incident Zip, we get the following stats:
|
Thanks @jqnatividad. btw, I came across interesting documentation for doing the describe-extract-validate-transform steps with the tool from frictionless. |
OK @mhuang74 , so missing values is handled with And yes, big fan of the frictionless data project - as I'm heavily involved in open data and the CKAN project in particular :). However, most of their tooling is in python and not as performant... but if we can learn anything from the project, we should shamelessly borrow :) |
@jqnatividad Integrated grex with PR #168. Probably helpful in some scenarios though lack the smarts to recognize common formats. Let me know what you think. |
Great @mhuang74! As for lack of smarts, its certainly not for lack of great code :) I'm sure we can iterate and integrate some light-weight heuristics to discern common formats - perhaps a combination of column name rules (columns ending with date, dt, mail, etc.), precompiled regexes, etc. Will do some testing on my end and let you know... |
`schema` (#60): pattern constraint for string types. BTW, good call on the `enum` heuristic,
@mhuang74 we helped with a Smart City pilot last year, where there were a lot of data quality issues coming from the IoT sensors. I ended up using jsonschema there as well to validate the input files, but it was on python, so it was nowhere near as performant as Interestingly, a lot of the validation errors were on the timestamp field, as different sensors were sending them in varying formats, sometimes, the format even changing from the same vendor/sensor as they tweak the configuration. As you know, I used I'm thinking of extending In this way, WDYT? |
@jqnatividad I think this is definitely worth looking into. Maybe the adur public toilets dataset is not representative, but it has date stamps that fail the "date" constraint of JSON Schema. So in this case, dateparser may infer ExtractDate column is date, add date constraint in schema, but entire file would then fail validation ? adding format="date" to validate adur public toilets results in error
expected date stamp format
schema used
|
@mhuang74 Yes. It SHOULD fail per jsonschema spec as the date is not in RFC3339 format. But in the real-world, folks often use their locale date format. Perhaps we should add a
so the adur public toilets dataset (which is representative of the real world) will not fail validation. One reason why I added the Still, some users may prefer to not transform dates to RFC3339 and this is where |
@jqnatividad Yes, I agree a switch to specify strictness of date format would be quite useful. My preference is to flip it around and have a I prefer to have schema command to do what's hard for me to do manually: scan all the values and derive something helpful (eg min, max, enum list, etc). But if I want to enforce some rules that may or may not fit existing values, I would probably just hand-edit the schema file. But dates are so common that it may be nice to have a flag to simplify the process. If the If not, then only add And similarly for "datetime" and "time" types. But that's just my preference. I think either way would work well. |
Hi @mhuang74 , I added support for inferring I decided to skip inferring Do you want to take on implementing |
Thanks @jqnatividad for merging PR #177. I noticed funkiness with WorkDir:
notice two sets of timestamps in this test dir
|
Thanks @mhuang74 for quickly implementing Just in time for the NYC Open Data Week presentation this Saturday 😄 Will be sure to send you a draft of the preso before then... https://nycsodata22.sched.com/ BTW, I noticed a big performance regression with Will fix it next week... |
and with Thanks heaps @mhuang74! |
Conference talk on qsv sounds exciting. Looking forward to your presentation @jqnatividad. Thanks for closing issue. |
Hi @mhuang74 , didn't get a chance to send it to you before the talk (was working on it till the last minute), but I linked it on the README... 😄 |
Presentation looks engaging. Too bad I can't see the actual demo!
Appreciate the mention.
…On Tue, Mar 8, 2022 at 12:46 AM Joel Natividad ***@***.***> wrote:
Hi @mhuang74 <https://github.com/mhuang74> , didn't get a chance to send
it to you before the talk (was working on it till the last minute), but I
linked it on the README... 😄
—
Reply to this email directly, view it on GitHub
<#60 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJ6WZ5XOSXARJAW2UHKR7TU6YXEVANCNFSM5FGPNHEA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
Regards, Michael
*As for me, I will always have hope; I will praise you more and more.*
* - Psalm 71:14*
|
stats
does a great job of not only getting descriptive stats about a CSV, it also infers the data type.frequency
compiles a frequency table.The
schema
command will use the output of thestats
, and optionallyfrequency
(to specify the valid range of a field), to create a json schema file that can be used with thevalidate
command (#46) to validate a CSV against the generated schema.With the combo addition of
schema
andvalidate
, qsv can be used in a more bullet-proof automated data pipeline that can fail gracefully when there are data quality issues:schema
to create a json schema from a representative CSV file for a feedvalidate
at the beginning of a data pipeline and fail gracefully whenvalidate
failssample
to validate against a samplepartition
the CSV to break down the pipeline into smaller jobsThe text was updated successfully, but these errors were encountered: