-
-
Notifications
You must be signed in to change notification settings - Fork 581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC]validators: add yaml parsing based on suffix #754
Conversation
18b6e58
to
b039d29
Compare
"JSON Schema" schemas are not only written in JSON but also in YAML, a popular example is the Linux Kernel[0]. While it is possible to convert any input file from YAML to JSON on the fly[1], the automatic resolving of schema references is not possible. To allow YAML files for the CLI tool of jsonschema, check the file suffix and parse the referenced file as YAML if it ends on `yaml` or `yml`. In case the Python library `pyyaml` is not installed or any other suffix is found, parsing defaults to JSON as before. [0]: https://github.com/torvalds/linux/tree/master/Documentation/devicetree/bindings [1]: python-jsonschema#582 (comment) Signed-off-by: Paul Spooren <[email protected]>
Hi! Thanks! Have you seen python-jsonschema/referencing#4? If not have a look. I do recognize this is likely reasonably common enough definitely -- think there should be a generic solution here though. (Especially since as-is today you could do just YAML support with a requests handler I believe.) |
Also, to be clear, are you trying to add support just to the jsonschema CLI? Or to all $refs? |
Thanks, I skimmed through it and various of the links. I couldn't come up with a clear message from this thread. Neither Using the suffix should work for both local and remote files. I'd just expect that all files ending on
I want to run |
I don't feel comfortable with this -- there are many reasons why trusting file suffixes isn't a good idea. The shortest summary of them is simply "In the face of ambiguity refuse the temptation to guess". Someone may have a URL they don't want to break which served some JSON content previously and now serves YAML (and doing so would be backwards compatible for them say if they always used YAML to deserialize its contents). That's a crazy example I can think of in 3 minutes, which makes there be 20 other reasons I can't think of why someone may do this :) So if/when this feature is added it definitely should be via explicit declaration of what file format is to be parsed. On the command line I'm still not very convinced that adding support for YAML is any better than saying "use yaml2json on your yaml first". I think someone mentioned "Windows" as a reason that wasn't easy for them, but yeah I don't want to introduce non-determinism in places that are currently deterministic. |
First of, you're the expert here and Looking at dt-schema, they seem to have reinvented some skills of
True, a suffix like >>> input = '{ "foo": "bar" }'
>>> import yaml
>>> import json
>>> yaml.safe_load(input)
{'foo': 'bar'}
>>> json.loads(input)
{'foo': 'bar'}
>>> So fallback/guessing seems in-fact unfeasible. Adding a URL prefix to determine the file format however seems like reinventing suffixes? Also this would cause some
It's not about a single input file but about automatic reference resolving. Say I have a |
OK cool that's exactly what I was asking by whether you just wanted it for the CLI or whether you wanted
YAML is a superset of JSON :) -- so all JSON is valid YAML.
The thing is that suffixes are already not meaningful -- in a URL (or a file path for that matter) they don't carry any required semantic meaning, particularly on *nix OSes. Heuristically sure, to a human, a file extension might mean something, but someone is free to add some "wrong" suffix to something for any reason they need/want to. Someone who knows more about URLs than I do would probably say something like that paths conflate where a resource lives with what kind of content it contains -- if you have a path with suffix, the suffix cannot really mean anything, otherwise you couldn't change what kind of content it contained without being unable to preserve where it lived. So you need some separate location than the path itself to put metadata about what it contains. On the web that's like the Content-Type header.
I do indeed think this is a general problem, one that someone should have solved before (I think that was the gist of the comment I made on python-jsonschema/referencing#4). The general form of the solution I'd expect personally is a way to bundle together an HTTP client (say requests) with a way to specify how to load objects from URLs -- and you could do so by say, knowing how to deal with Content-Type headers from the server (so if the Content-Type was Or you allow the developer (in our case schema author) to explicitly tell the HTTP client what deserialization method to use from within the URL (that was the So yeah the above is the "structure" I'd assume a solution would come in, the issue is finding something that does this already because yeah we definitely aren't the first to want to do this I suspect... |
@@ -837,18 +837,29 @@ def resolve_remote(self, uri): | |||
except ImportError: | |||
requests = None | |||
|
|||
try: | |||
import yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be delayed even later so we don't add overhead if yaml is not being used?
Perhaps json importing could do the same and save some startup time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this is needed. A failed import is very fast, and even a successful one.
I maintain the above mentioned dt-schema. While not having to implement the YAML refs myself would be nice, I do worry about the performance. Once you get to sufficiently large schemas, you may find yourself moving back to JSON as YAML parsing is slow. We did just that recently for some intermediate files. Module import time also is a bottleneck. Also, the schema are processed before being applied, so there's more to our RefResolver than just parsing YAML files. |
From my understanding
I saw you're using
What about importing the file once at the beginning and storing the state? That way it's not "retried" on each appearing reference.
Could you please elaborate on that? |
Yes, all the processed schema are loaded up front. It was a 3-4x speed up switching to json.
The primary reason for using ruamel.yaml was having line numbers. If you pick one and a user picks the other one, then you're doing 2 YAML module imports rather than 1.
I was referring to python module import. Typically that's startup time, but can be on demand as you did. The only way to really improve that not doing 1 python call per data file.
It really stems from details of Devicetree in that the properties are essentially untyped and unsized in the instance data. Everything ends up as an array or matrix. The only type information is in the schemas. A schema may say 'foo: { const: 123 }', but the data will be 'foo: [[123]]'. In order to not have a ton of boillerplate, the processing transforms the schema into 'foo: { items: [ { items: [ {const: 123} ] } ] }. That's a simplified example. It's pretty ugly... |
Hi: any update on this one? We have the same requirements in numerous OGC API standards. Testing this PR locally seems to work as expected. |
I'm not comfortable with a heuristic based approach here. Happy to review a PR where deserealization formats are explicitly indicated, or to brainstorm what such an interface would look like. |
Not sure if this issue has since been resolved? fwiw I've done an updated/local patch of this PR at https://github.com/python-jsonschema/jsonschema/compare/main...tomkralidis:ref-yaml?expand=1 based on current main, which attempts through |
I'd still want something explicit, rather than implicit, including via a fallback. I don't think I've spent any additional thinking myself on a design, but I'm still happy to review one if you have a proposal -- just as I say, explicit. In other words, the author of a schema must be specifying in some explicit way that the document at that URL is a YAML document, it cannot be done in an implicitly guessed way. Any design meeting that though I'm open to discuss! |
"JSON Schema" schemas are not only written in JSON but also in YAML, a
popular example is the Linux Kernel0. While it is possible to convert
any input file from YAML to JSON on the fly1, the automatic resolving
of schema references is not possible.
To allow YAML files for the CLI tool of jsonschema, check the file
suffix and parse the referenced file as YAML if it ends on
yaml
oryml
. In case the Python librarypyyaml
is not installed or any othersuffix is found, parsing defaults to JSON as before.
Signed-off-by: Paul Spooren [email protected]