Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RefResolver (or its successor) should allow customizing deserialization, not only scheme / retrieval #4

Closed
martingkelly opened this issue Jun 4, 2018 · 26 comments
Labels
enhancement New feature or request

Comments

@martingkelly
Copy link

I'm using currently handling schemas written in YAML (for human-readability) but validated by jsonschema. I parse the schema in YAML (and make sure I subset my YAML to stay JSON-compatible) and then run the parsed dictionary through jsonschema. It is working nicely with one issue: When I want to use file:// includes, the files are implicitly assumed to be json.

@Julian I would like to add support for these files being YAML and am wondering if this is something you would support. If so, what is your recommendation for the cleanest way to add it? If it's not something you support, what would you recommend as the cleanest way to add the support for my own code without hacking/forking jsonschema? I thought about adding custom resolver as is done in python-jsonschema/jsonschema#225, but it sounds like RefResolver is not intended to be API-stable, so subclassing is liable to break in the future. Is there another way?

Thanks,
Martin

@martingkelly
Copy link
Author

Upon further code reading, I found the handlers argument to RefResolver. If I attach a yaml:// handler (or similar), I can make this work. That said, it feels hacky, in that yaml:// is not really a location (like file:// or http://) and it would be better to decouple the file format (YAML/JSON) from the location (including the logic for handling base URIs and $id).

@robherring
Copy link

@martingkelly I'm doing the same thing with YAML and have $ref's working. See here: https://github.com/robherring/yaml-bindings/blob/master/dtschema.py#L212

@martingkelly
Copy link
Author

Thanks @robherring, I ended up doing the same thing. I think it's not ideal in that by implementing your own handler, you lose the more general caching logic, and you lose the ability to do JSON http... ideally the document format and transport would be better decoupled.

Btw, I really like your device-tree YAML project; I think it's an excellent idea :).

@handrews
Copy link

handrews commented Jun 5, 2018

You can see one approach to this problem in JavaScript in the AutoExtensionDereferencer function of JSON Schema Tools. Although the "dereferencer" part is not necessarily relevant here, just the automatic file extension search approach, which also understands overriding the URI scheme to load from a file as if it were a file:// URI.

@martingkelly
Copy link
Author

Yes, something like that would work. @Julian any thoughts?

@Julian
Copy link
Member

Julian commented Jun 7, 2018

Have to respond in a bit more length -- something a bit more general than that (that doesn't use file extensions, just fully allows someone to inject in what to deserialize to what) is what I'd lean to, though this also strikes me as the sort of thing I wish existed in an external spec -- i.e. within HTTP, if you were able to say yaml+<URL> and that'd mean "what lives there at that URL should be treated as YAML".

To some extent that's Accept / Content-Type, but it's not used that way by client libraries right, so something that says "I'm JSON" -- there's no HTTP client (for Python) that will then say "OK I'll automatically know to call json.loads on that body" that I know of (and to allow you to basically hook into any possible Content-Type value).

@martingkelly
Copy link
Author

I agree with that ideally being in spec but don't know enough to say where that would fit in, if at all. @handrews any opinion on integrating this sort of thing into spec?

@Julian
Copy link
Member

Julian commented Jun 7, 2018 via email

@Julian
Copy link
Member

Julian commented Jun 7, 2018 via email

@martingkelly
Copy link
Author

I realize that would be more ideal, but I'm guessing it's out of scope and would take years to modify HTTP or URI specs. I'm wondering if there's any place for jsonschema to graft something in, even if not ideal.

@handrews
Copy link

handrews commented Jun 9, 2018

@martingkelly it really does not fit in the JSON Schema spec. For one thing, it's much better to allow implementations flexibility in how they manage this. Some implementations will be able to support rather complex loading/parsing features, others will be resource- or performance-constrained and will only want to support JSON.

Additionally, this is a pretty easy problem to solve outside of implementations- write your schemas in whatever you want, and have a build step that converts them to JSON.

@martingkelly
Copy link
Author

Additionally, this is a pretty easy problem to solve outside of implementations- write your schemas in whatever you want, and have a build step that converts them to JSON.

Although a build step that converts JSON to YAML is technically possible, I don't think it's a great solution, as all the file references have to somehow be faked out and turned into JSON files temporarily, or you have to copy the build directory somewhere else and convert the files into JSON (but keep the file extension of .yaml even though the file is JSON). If would much prefer the code itself natively supporting YAML. I can add this as a jsonschema extension using a handler, but it has the drawbacks I mentioned above.

@Julian
Copy link
Member

Julian commented Jun 11, 2018

Still haven't gotten a chance to elaborate here, but just on that last message -- just in case you weren't aware, all JSON is essentially valid YAML.

@martingkelly
Copy link
Author

I do know that :). In this case, I'm writing schemas in YAML (json-compatible subset) and parsing them with jsonschema. I like this approach because I find YAML much more readable than JSON, but I find jsonschema to be a useful tool.

@handrews
Copy link

@martingkelly why would you try to do anything other than edit the files in YAML? The build step would happen before you try to load anything, so you shouldn't need to refer directly to the YAML paths.

@martingkelly
Copy link
Author

My point is just that I would then have to (non-intuitively) write schema.json in my file instead of schema.yaml, even though I am referring to schema.yaml. This is a "gotcha" for schema usability that I'm hesitant to introduce. In addition, I'm not fond of requiring a build step just to get around jsonschema not parsing YAML... when possible, I like to be able to use libraries as they are without adding my own required wrappers. Requiring a build step is another "gotcha" for those dealing with schemas I write.

@handrews
Copy link

My point is just that I would then have to (non-intuitively) write schema.json in my file instead of schema.yaml, even though I am referring to schema.yaml

No, you're referring to the post-build-step schema.json, so it makes sense. Or you can just not use extensions, which is how it should be if you're doing hypermedia anyway because content negotiation.

Although obviously I agree with you on supporting loading from other formats as that's why I wrote that part of @cloudflare/json-schema-transform and packaged it in @cloudflare/json-schema-ref-loader (although those packages have weird restrictions around $ref for historical reasons, which I hope to fix one day).

@martingkelly
Copy link
Author

No, you're referring to the post-build-step schema.json, so it makes sense. Or you can just not use extensions, which is how it should be if you're doing hypermedia anyway because content negotiation.

This is one of those "it makes sense if you think about it the right way" things :). Yes, I understand your rationale, but I think this would be confusing for newcomers to my project, and I prefer to keep it as simple as I can.

It sounds like we all agree that it would be nice to support more underlying formats than just json... the question is where to add that support, since it sounds like it's out of spec for the jsonschema spec itself, and trying to change something in the URI spec would take forever and likely fail. It seems to me we are stuck with unofficial extensions to this particular library.

@handrews
Copy link

The spec already makes a point of saying that it interprets instances according to a data model rather than as JSON text. This holds true when the schema is an instance (with respect to the meta-schema).

So the spec already does not care how you get something into the data model, although it only defines the mapping for JSON.

What the spec avoids doing is placing requirements on such support. There are many reasons that an implementation may not want to support other formats. JSON was designed to be simple and fast to parse (which, as I understand it, is why there are some weird corner cases like duplicate properties when parsing- leaving them undefined allows a faster implementation than requiring detecting the error and taking action on it).

For that matter, an Internet of Things implementation might not even load JSON, it might only load CBOR.

@martingkelly
Copy link
Author

I understand JSON schema is not unique to JSON (despite the name), but since it's not specified how a parser should interpret a given $ref, each library will invent their own syntax for specifying content type. That seems non-ideal to me.

@martingkelly
Copy link
Author

I stupidly didn't see contentMediaType before in the spec, but what if we check for that key and, if specified, assume that a given $ref is not JSON but is instead the specified media type?

@handrews
Copy link

contentMediaType is for media types encoded as JSON strings, so I don't think it helps here?

but since it's not specified how a parser should interpret a given $ref, each library will invent their own syntax for specifying content type. That seems non-ideal to me.

Per section 8.3.1, $ref values are identifiers, and do not say anything at all about how the identified schema is loaded.

If it is being loaded over a network, then the network protocol (probably HTTP or CoAP) should be used to understand and handle the available media type(s). There is no need for JSON Schema to say anything about that process- it is how the web works already. For something like file:// URIs, while it is less clear how to detect and manage media types, that problem is not at all specific to JSON Schema.

If it has been pre-loaded (which is a common approach, as it separates parsing concerns from referencing concerns), then presumably it was parsed into an in-memory representation already and exactly how that happens and what formats are supported is intentionally left to implementations. Different implementations serve different needs, and can balance flexibility vs performance/resource consumption.

@handrews
Copy link

The idea of treating $ref as an RFC 8288 web link and allowing target hints has come up at least once before (with OpenAPI), but in a somewhat different context. There is https://github.com/json-schema-org/json-schema-spec/issues/141, which gets into talking about the type of reference/pointer targets (initially just for JSON Pointer, but the discussion expanded along the way). That might be a place to look if you want to advocate for something like this in the spec. We shouldn't hijack @Julian's project for spec proposals :-)

@martingkelly
Copy link
Author

Yes, in my case I'm dealing with only local file references, no network at all. I think having some way to specify a mime-type or similar would be useful... I frankly don't care if it's in the spec or not as long as it solves my problem :). It's perhaps hacky, but having an optional $mime-type key in parallel with $ref would solve the problem. You could then pair a parser to a mime type.

@Julian Julian changed the title Handling YAML $refs RefResolver 9 Jul 6, 2018
@Julian Julian changed the title RefResolver 9 RefResolver (or its successor) should allow customizing deserialization, not only scheme / retrieval Jul 6, 2018
@ghost
Copy link

ghost commented Aug 6, 2018

A very much useful step forward would be to pass the referrer document to the handler. Right now nothing is happening with it (PyCharm shows no usage beyond it being stored on the object).

But if you resolve relative file references, as one does in for instance OpenAPI specifications, then you need a source document to translate relativity. Right now, I either need a pre-process step that resolves all the references to absolute file URLs or patch jsonschema. Triggered by this bug.

Julian referenced this issue in python-jsonschema/jsonschema Aug 13, 2020
86f52b87 Fix a clear copy-paste error in the case names for tests from #394.
ec18a7d0 Merge pull request #360 from notEthan/duplicate_uris
cd9a4b9d change schemas with duplicate URIs http://localhost:1234/folder/
43e190e0 Merge pull request #394 from rjmill/rjmill/bools-and-1s-and-0s-inside-objects-and-arrays
85f0d459 Merge pull request #419 from ChALkeR/chalker/format-uri
54436216 Merge pull request #420 from ChALkeR/chalker/format/ip6
ad47b726 Add long valid and invalid ipv6
b2ab70ec More optional ipv6 tests
37189ae6 Test that uri format scheme is validated for allowed chars
2106ed17 backport uniqueItems cases to draft3
49fc2a95 backport const test cases to draft6
79fe60c0 more tests for true != 1 and false != 0 inside objects/arrays

git-subtree-dir: json
git-subtree-split: 86f52b87e3d572b8f808dd074a6b95c3695ba992
@Julian Julian transferred this issue from python-jsonschema/jsonschema Dec 5, 2022
@Julian
Copy link
Member

Julian commented Feb 23, 2023

Hello there!

This, along with many many other $ref-related issues, is now finally being handled in python-jsonschema/jsonschema#1049 with the introduction of this new referencing library which is fully compliant and has APIs which I hope are a lot easier to understand and customize. (At some point recently I'd transferred this issue to the issue tracker for referencing, in case that wasn't clear).

The next release of jsonschema (v4.18.0) will contain a merged version of that PR, and should be released shortly in beta, and followed quickly by a regular release, assuming no critical issues are reported.

Dereferencing YAML, TOML, or whatever else is definitely doable using referencing. If you still care to, I'd love it if you tried out the beta once it is released, or certainly it'd be hugely helpful to immediately install the branch containing this work (https://github.com/python-jsonschema/jsonschema/tree/referencing) and confirm. You can in the interim find documentation for the change in a preview page here. A section specifically on resolving YAML is here and can be combined of course with the example shown there on retrieving remote resources over HTTP if so desired.

I'm going to close this given it indeed seems like it is addressed by python-jsonschema/jsonschema#1049, but feel free to follow up with any comments. Sorry for the delay in getting to these, but hopefully this new release will bring lots of benefit!

@Julian Julian closed this as completed Feb 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants