Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC]validators: add yaml parsing based on suffix #754

Closed
wants to merge 1 commit into from

Conversation

aparcar
Copy link

@aparcar aparcar commented Oct 24, 2020

"JSON Schema" schemas are not only written in JSON but also in YAML, a
popular example is the Linux Kernel0. While it is possible to convert
any input file from YAML to JSON on the fly1, the automatic resolving
of schema references is not possible.

To allow YAML files for the CLI tool of jsonschema, check the file
suffix and parse the referenced file as YAML if it ends on yaml or
yml. In case the Python library pyyaml is not installed or any other
suffix is found, parsing defaults to JSON as before.

Signed-off-by: Paul Spooren [email protected]

@aparcar aparcar force-pushed the yaml branch 4 times, most recently from 18b6e58 to b039d29 Compare October 24, 2020 22:05
"JSON Schema" schemas are not only written in JSON but also in YAML, a
popular example is the Linux Kernel[0]. While it is possible to convert
any input file from YAML to JSON on the fly[1], the automatic resolving
of schema references is not possible.

To allow YAML files for the CLI tool of jsonschema, check the file
suffix and parse the referenced file as YAML if it ends on `yaml` or
`yml`. In case the Python library `pyyaml` is not installed or any other
suffix is found, parsing defaults to JSON as before.

[0]: https://github.com/torvalds/linux/tree/master/Documentation/devicetree/bindings
[1]: python-jsonschema#582 (comment)

Signed-off-by: Paul Spooren <[email protected]>
@Julian
Copy link
Member

Julian commented Oct 24, 2020

Hi! Thanks! Have you seen python-jsonschema/referencing#4? If not have a look.

I do recognize this is likely reasonably common enough definitely -- think there should be a generic solution here though.

(Especially since as-is today you could do just YAML support with a requests handler I believe.)

@Julian
Copy link
Member

Julian commented Oct 24, 2020

Also, to be clear, are you trying to add support just to the jsonschema CLI? Or to all $refs?

@aparcar
Copy link
Author

aparcar commented Oct 24, 2020

Hi! Thanks! Have you seen python-jsonschema/referencing#4? If not have a look.

Thanks, I skimmed through it and various of the links. I couldn't come up with a clear message from this thread. Neither yaml+ nor yaml:// looked appealing to me. I though the lowest hanging fruit is to trust suffixes. Whoever decides to add arbitrary suffixes to schema files should use JSON or create their own handlers. Whoever uses a yaml suffix but puts JSON in it, actively requests trouble.

Using the suffix should work for both local and remote files. I'd just expect that all files ending on yaml and yml are actually contain YAML.

Also, to be clear, are you trying to add support just to the jsonschema CLI? Or to all $refs?

I want to run jsonschema input.json root-schema.yaml and make jsonschema resolve all references automatically. Adding support for input.yaml would be another PR.

@Julian
Copy link
Member

Julian commented Oct 24, 2020

Neither yaml+ nor yaml:// looked appealing to me. I though the lowest hanging fruit is to trust suffixes. Whoever decides to add arbitrary suffixes to schema files should use JSON or create their own handlers. Whoever uses a yaml suffix but puts JSON in it, actively requests trouble.

I don't feel comfortable with this -- there are many reasons why trusting file suffixes isn't a good idea. The shortest summary of them is simply "In the face of ambiguity refuse the temptation to guess".

Someone may have a URL they don't want to break which served some JSON content previously and now serves YAML (and doing so would be backwards compatible for them say if they always used YAML to deserialize its contents). That's a crazy example I can think of in 3 minutes, which makes there be 20 other reasons I can't think of why someone may do this :)

So if/when this feature is added it definitely should be via explicit declaration of what file format is to be parsed. On the command line I'm still not very convinced that adding support for YAML is any better than saying "use yaml2json on your yaml first". I think someone mentioned "Windows" as a reason that wasn't easy for them, but yeah I don't want to introduce non-determinism in places that are currently deterministic.

@aparcar
Copy link
Author

aparcar commented Oct 25, 2020

First of, you're the expert here and jsonschema seems to be used by 5.000 others (based on GitHub), so if you say things will go bad, I believe it. Do you have a solution in mind or would you say this is a downstream issue?

Looking at dt-schema, they seem to have reinvented some skills of jsonparser to cope with YAML in the Kernel.

I don't feel comfortable with this -- there are many reasons why trusting file suffixes isn't a good idea. The shortest summary of them is simply "In the face of ambiguity refuse the temptation to guess".

True, a suffix like schema is really bad to parse, especially as the YAML parser even accepts JSON input to some degree:

>>> input = '{ "foo": "bar" }'
>>> import yaml
>>> import json
>>> yaml.safe_load(input)
{'foo': 'bar'}
>>> json.loads(input)
{'foo': 'bar'}
>>> 

So fallback/guessing seems in-fact unfeasible.

Adding a URL prefix to determine the file format however seems like reinventing suffixes? Also this would cause some jsonschema specific format. Would you suggest to raise this issue to the JSON Schema draft people?

So if/when this feature is added it definitely should be via explicit declaration of what file format is to be parsed. On the command line I'm still not very convinced that adding support for YAML is any better than saying "use yaml2json on your yaml first". I think someone mentioned "Windows" as a reason that wasn't easy for them, but yeah I don't want to introduce non-determinism in places that are currently deterministic.

It's not about a single input file but about automatic reference resolving. Say I have a $ref pointing to file:bootloader.yaml, then this file needs to go through some yaml2json pipe as well, which doesn't work. My workaround is to blindly load all YAML files of a folder and add them to a store, however on demand like jsonschema does it internally seems cleaner.

@Julian
Copy link
Member

Julian commented Oct 26, 2020

It's not about a single input file but about automatic reference resolving. Say I have a $ref pointing to file:bootloader.yaml, then this file needs to go through some yaml2json pipe as well,

OK cool that's exactly what I was asking by whether you just wanted it for the CLI or whether you wanted RefResolvers to support YAML -- sounds like you do indeed want the latter, which as you say is more involved.

especially as the YAML parser even accepts JSON input to some degree:

YAML is a superset of JSON :) -- so all JSON is valid YAML.

Adding a URL prefix to determine the file format however seems like reinventing suffixes?

The thing is that suffixes are already not meaningful -- in a URL (or a file path for that matter) they don't carry any required semantic meaning, particularly on *nix OSes. Heuristically sure, to a human, a file extension might mean something, but someone is free to add some "wrong" suffix to something for any reason they need/want to. Someone who knows more about URLs than I do would probably say something like that paths conflate where a resource lives with what kind of content it contains -- if you have a path with suffix, the suffix cannot really mean anything, otherwise you couldn't change what kind of content it contained without being unable to preserve where it lived. So you need some separate location than the path itself to put metadata about what it contains. On the web that's like the Content-Type header.

Adding a URL prefix to determine the file format however seems like reinventing suffixes? Also this would cause some jsonschema specific format. Would you suggest to raise this issue to the JSON Schema draft people?

I do indeed think this is a general problem, one that someone should have solved before (I think that was the gist of the comment I made on python-jsonschema/referencing#4). The general form of the solution I'd expect personally is a way to bundle together an HTTP client (say requests) with a way to specify how to load objects from URLs -- and you could do so by say, knowing how to deal with Content-Type headers from the server (so if the Content-Type was text/yaml you send it through the YAML parser).

Or you allow the developer (in our case schema author) to explicitly tell the HTTP client what deserialization method to use from within the URL (that was the yaml+https://... brainstorming). Because now you the person fetching the URL can explicitly indicate which deserializer to use. And of course if one doesn't exist for the format you presumably want a way to mount handlers for it, etc.

So yeah the above is the "structure" I'd assume a solution would come in, the issue is finding something that does this already because yeah we definitely aren't the first to want to do this I suspect...

@@ -837,18 +837,29 @@ def resolve_remote(self, uri):
except ImportError:
requests = None

try:
import yaml
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be delayed even later so we don't add overhead if yaml is not being used?

Perhaps json importing could do the same and save some startup time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think this is needed. A failed import is very fast, and even a successful one.

@robherring
Copy link
Contributor

I maintain the above mentioned dt-schema. While not having to implement the YAML refs myself would be nice, I do worry about the performance. Once you get to sufficiently large schemas, you may find yourself moving back to JSON as YAML parsing is slow. We did just that recently for some intermediate files. Module import time also is a bottleneck. Also, the schema are processed before being applied, so there's more to our RefResolver than just parsing YAML files.

@aparcar
Copy link
Author

aparcar commented Oct 27, 2020

While not having to implement the YAML refs myself would be nice, I do worry about the performance

From my understanding jsonschema uses a store for references to keep references from being parsed twice. Does dt-schema the same?

Once you get to sufficiently large schemas, you may find yourself moving back to JSON as YAML parsing is slow

I saw you're using ruamel.yaml rather than pyyaml, would you recommend that for this PR?

Module import time also is a bottleneck.

What about importing the file once at the beginning and storing the state? That way it's not "retried" on each appearing reference.

Also, the schema are processed before being applied, so there's more to our RefResolver than just parsing YAML files.

Could you please elaborate on that?

@robherring
Copy link
Contributor

While not having to implement the YAML refs myself would be nice, I do worry about the performance

From my understanding jsonschema uses a store for references to keep references from being parsed twice. Does dt-schema the same?

Yes, all the processed schema are loaded up front. It was a 3-4x speed up switching to json.

Once you get to sufficiently large schemas, you may find yourself moving back to JSON as YAML parsing is slow

I saw you're using ruamel.yaml rather than pyyaml, would you recommend that for this PR?

The primary reason for using ruamel.yaml was having line numbers.

If you pick one and a user picks the other one, then you're doing 2 YAML module imports rather than 1.

Module import time also is a bottleneck.

What about importing the file once at the beginning and storing the state? That way it's not "retried" on each appearing reference.

I was referring to python module import. Typically that's startup time, but can be on demand as you did. The only way to really improve that not doing 1 python call per data file.

Also, the schema are processed before being applied, so there's more to our RefResolver than just parsing YAML files.

Could you please elaborate on that?

It really stems from details of Devicetree in that the properties are essentially untyped and unsized in the instance data. Everything ends up as an array or matrix. The only type information is in the schemas. A schema may say 'foo: { const: 123 }', but the data will be 'foo: [[123]]'. In order to not have a ton of boillerplate, the processing transforms the schema into 'foo: { items: [ { items: [ {const: 123} ] } ] }. That's a simplified example. It's pretty ugly...

@tomkralidis
Copy link

Hi: any update on this one? We have the same requirements in numerous OGC API standards. Testing this PR locally seems to work as expected.

@Julian
Copy link
Member

Julian commented Jun 21, 2021

I'm not comfortable with a heuristic based approach here.

Happy to review a PR where deserealization formats are explicitly indicated, or to brainstorm what such an interface would look like.

@Julian Julian closed this Jun 21, 2021
@tomkralidis
Copy link

Not sure if this issue has since been resolved? fwiw I've done an updated/local patch of this PR at https://github.com/python-jsonschema/jsonschema/compare/main...tomkralidis:ref-yaml?expand=1 based on current main, which attempts through try/except as opposed to filename extension, if there's any interest.

@Julian
Copy link
Member

Julian commented Apr 13, 2022

I'd still want something explicit, rather than implicit, including via a fallback. I don't think I've spent any additional thinking myself on a design, but I'm still happy to review one if you have a proposal -- just as I say, explicit.

In other words, the author of a schema must be specifying in some explicit way that the document at that URL is a YAML document, it cannot be done in an implicitly guessed way.

Any design meeting that though I'm open to discuss!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants