-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add validate
command
#46
Comments
Some good ideas for reference: |
@jqnatividad Please let me know if you have some thoughts on the primary usecases.
Here's 1st pass at docopt
|
Hi @mhuang74, Awesome! I'm keen to see The primary usecases are the ones you listed, but I think we should break it up into three commands.
Thoughts? |
As for the docopt format, it looks good, but we can tackle that once we agree on the general approach. However, with that said, I think this is where multi-threading can really make a big difference, as we can break up the validation into chunks as each row can be validated independently, so we should include the |
@jqnatividad here's my POC for Example run documented in Validate.md. One problems that I need help with. Progress bar would panic if csv file has unequal number of columns (utils.rs:117). I am not sure how to refactor this. Any thoughts?
works fine when progress bar is quieted
|
@mhuang74 thanks for this! The problem is the progress bar does a count ala The fix is relatively simple, just set And while we're on the progress bar, you may want to use the new I'll add comments to the PR for both changes... |
thanks @jqnatividad. Above problem resolved. Simple csv check integration tests passing now. I need to add another test to properly check for output files. Any suggestions on how to check for output files? |
Great! As for testing the output files, having a sample csv and corresponding, relatively complex jsonschema (the PublicToilets schema on csvlint seems interesting and relatively complex), with expected valid and invalid rows should be good enough. |
@jqnatividad Need some help understanding standards around json schema, and whether choice of jsonschema crate is the right approach. Looks like it is working for dependent project cjval validating CityJSON files. But it doesn't work for the Public Toilet schema. Looks like the Public Toilets json schema conforms to JSON Table Schema rather than the JSONSchema which jsonschema-rs crate assumes. I attempted to run the public toilet schema through jsonschema-rs for validation, and it did not pick up and apply the constraints. Specifically, the "keywordLocation" and "instanceLocation" are all blank. what jsonschema-rs outputs with example People schema from JSON Schema site
snippet what jsonschema-rs outputs with Public Toilet schema
|
@mhuang74 let's just create a test schema & a corresponding CSV we can use. jsonschema seems to be the most mature library we can leverage so let's stick with it (not to mention the same author wrote the infers-jsonschema crate that we can leverage for the Although JSON Table Schema is a specialized standard for tabular data (read CSVs), there seems to be no other crate we can leverage for the non-trivial validation task. But since JSON Table Schema is a subset of the IETF JSON Schema standard, it should be good enough. What about creating a sample schema using the sample NYC 311 data that's used in qsv's benchmark script? It has different data types and opportunities for exercising validation tasks that jsonschema can do; it's a good showcase example; and we can baseline |
Thanks @jqnatividad. They all require hand crafting schema file, so I stuck with public toilet dataset. Integration test uses real data and illustrates invalid row showing up in output.invalid file. |
Public toilet it is! :) Lemme know when you want me to merge the PR. Convert it to draft in the meantime? |
Thanks @jqnatividad. I think it's ready for beta release. What do you think? |
Yep! Prepping for 0.3.0 release... :) Did some minor tweaks though - #148 |
Thanks for the fixups @jqnatividad . Good catch on progress bar. And just realized that named parameters are supported. |
@mhuang74 and 0.3.0 featuring However, I had to limit cross-compilation for now, so binaries for certain platforms are not published (c54a2f2). Good news is that I managed to trace the problem to jsonschema, and the fix has been merged (Stranger6667/jsonschema#336), and we'll be able to publish binaries for those platforms again in the next release. Thanks heaps again for your contributions!!! Are you up to taking on |
Thanks for all your help with tidying up the code @jqnatividad. One more bug fix PR for validate. PR #151 Sure, this has been great for my Rust learning. Happy to pick up schema (#60) next. |
Checks a CSV against a jsonschema file.
The jsonschema file can be located on the filesystem or a URL.
The text was updated successfully, but these errors were encountered: