Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could my "Son" project be useful to JSON Schema? #274

Closed
seagreen opened this issue Mar 18, 2017 · 14 comments
Closed

Could my "Son" project be useful to JSON Schema? #274

seagreen opened this issue Mar 18, 2017 · 14 comments

Comments

@seagreen
Copy link
Collaborator

One of the issues with JSON is that places basically no restrictions on parsers and generators. I first learned about this from a Reddit comment, of all things: https://www.reddit.com/r/programming/comments/59htn7/parsing_json_is_a_minefield/d98qxtj/

This puts projects like JSON Schema in an awkward position where they have to decide which details of JSON are insignificant and which aren't.

Clearly, whitespace outside of JSON Strings is insignificant. Clearly, the different between 1 and 2 is significant. But in between lies a gray area: Should JSON Schema be able to specify that certain control characters must be escaped? For instance the spec doesn't require x1f (DEL) to be escaped, but that might cause a problem in some circumstances. What about numbers? The difference between 10e2 and 10E2 is insignificant, but what about 100 and 1.00e2?

I'm working on a subset of JSON called Son that I'd like to be able to answer these problems. The goal is to eliminate redundencies in JSON so that actual restrictions can then be placed on parsers, such as "there should be a bijection from serialized Son to the parsed representation of Son". Then projects like JSON schema could only concern themselves with the subset of JSON represented by Son, instead of each project trying to figure out what part of JSON it wants to cover on its own.

If this doesn't seem like a helpful thing to base JSON Schema on feel free to just hit the close button, I don't want to clog up the issues with promoting my own project. If this does seem interesting, but you're not happy with the specific decisions Son made, please let me know either here or in Son's issues so I can look into it.

@handrews
Copy link
Contributor

Interesting! I think that such a thing is useful, but perhaps orthogonal to JSON Schema? Meaning that I think that requiring JSON Schema to use or operate on a restricted format of JSON (for any restriction) would reduce its applicability and therefore adoption. But enabling such things is very desirable (not entirely unlike enabling use with a media type such as CBOR that can map to JSON).

Also, have you seen I-JSON? I'm only vaguely aware of it so not sure how similar the projects are.

@seagreen
Copy link
Collaborator Author

I hadn't seen I-JSON! Very glad to know about it, I added it to my collection of JSON subsets: https://housejeffries.com/page/7. Let me know if you find more.

I actually agree that we don't want JSON Schema to only operate on serialized Son. We want it to operate on all JSON, not just a subset!

What I'm actually proposing is a little subtler. The argument goes like this:

  • JSON Schema places no restrictions on parsers at all, so it's not clear what parts of JSON should be in scope for tools like JSON Schema

  • To solve this JSON Schema currently just kind of guesses at what people might want to schema. For instance, it has the definition:

number - An arbitrary-precision, base-10 decimal number value, from the JSON "number" production

Which means that we don't distinguish between exponential and non-exponential notation (this makes sense, as letting someone say "you have to use exponential notation here" is probably outside the scope of JSON Schema). But we're still trying to decide whether JSON Schema should distinguish between 1.0 and 1: #152.

  • We could do a more clear job than we're currently doing in the "Instance" part of the spec by picking a subset of JSON, say one that didn't have exponential notation or trailing zeros in fractions, and then saying "JSON Schema only operates on values that can be distinguished within this subset. So we don't care about the difference between 1 and 1.0, because this subset of JSON doesn't distinguish them.

Son is such a subset, though of course the particular decisions it makes might not be suitable for JSON Schema, we could always make another.

@seagreen
Copy link
Collaborator Author

(One thing I really need to do is start working on the Son Parser Specification. Right now the only thing I've written is the data format. This make take a while, I want to get it exactly right. It's going to be something like "There must be a bijection between Son JSON and the parsed representation of the Son JSON", but I might be able to improve on that).

@handrews
Copy link
Contributor

JSON Schema places no restrictions on parsers at all, so it's not clear what parts of JSON should be in scope for tools like JSON Schema

That's a fundamental issue with JSON, and not one that I think can/should be addressed by JSON Schema. JSON Schema inherits JSON's ambiguities and must deal with them.

we're still trying to decide whether JSON Schema should distinguish between 1.0 and 1

Because JSON treats them ambiguously, so JSON Schema has to support the ambiguity. In the case of #152, we're deciding whether to support the common model from many languages of 1 being an integer while 1.0 is a float. Or perhaps to allow making that distinction through validation.

For exponential notation, I would consider a "format" value (they currently all apply to strings, but the way they are specified it is clear that "format" can apply to any type).

I may still be misunderstanding your point, but I just don't see that as a problem with JSON Schema. Your Son project sounds very interesting, and useful, but I think JSON Schema needs to be and describe regular JSON instances.

@seagreen
Copy link
Collaborator Author

I'll try to explain what I'm saying better. It's relevant to this part of the spec:

JSON Schema interprets documents according to a data model [...] null: A JSON "null" production [...] number: An arbitrary-precision, base-10 decimal number value, from the JSON "number" production [...] Whitespace and formatting concerns are thus outside the scope of JSON Schema.

This is saying that JSON Schema can't actually distinguish all of JSON. In the case of insignificant whitespace it's said explicitly, but consider numbers as well. Once 1e1 and 1E1 are converted to "arbitrary-precision, base-10 decimal number values" they become equal and indistinguishable.

Lets forget Son for now (at the moment I don't actually think Son is right for JSON Schema, except as a thought experiment). Are we sure we want JSON Schema only concerning itself with part of the details of JSON? This is the way we're doing it now if I read the spec correctly. And if so, are we happy about how we've defined this subset? I'm not sure about this, there's are things in the current language that seem ambiguous.

@handrews
Copy link
Contributor

@seagreen I think I see what you're getting at but I'm not sure I follow to the same conclusion. As far as I can tell JSON Schema accepts all possible JSON values (conforming implementations do not need to handle repeated object keys in any particular way, so saying that the effect is undefined is just acknowledging that JSON parsing libraries are inconsistent).

The data model is just how a validator is supposed to process things. Implementing "minimum" in terms of JSON representation strings would be needlessly complicated, so the data model says to treat numbers as numbers. The "arbitrary-precision base 10" part is how numbers are described in the most recent JSON RFC. There is no difference in how the validation keywords would handle 1e1 vs 1E1, so this data model has sufficient precision to make all validation functions work.

What are we losing by not being able to distinguish 1e1 and 1E1? JSON Schema isn't responsible for re-encoding the data into JSON, so we don't actually need to know how it was originally represented.

As for part of the details... there are many limits to what you can detect or enforce with validation. I don't see this limit as any more or less significant. If we wanted to offer validation for that, then we might need to update the data model. But is there a compelling use case?

@seagreen
Copy link
Collaborator Author

As far as I can tell JSON Schema accepts all possible JSON values

Absolutely! The fact that you had to say this means I've been explaining myself badly.

JSON Schema should definitely be able to validate all JSON values. The question is what it can distinguish. I think the current description of what it can distinguish can be improved. Take strings for instance (which are nice and simple). It currently says: string: A string of Unicode code points, from the JSON "string" production. Is this before or after escaping? It's not clear, you have to read the rest of the spec to find out.

@handrews
Copy link
Contributor

Absolutely! The fact that you had to say this means I've been explaining myself badly.

eh, to be fair, I was super-tired when responding last night and probably should have just left it until the morning :-P

It currently says: string: A string of Unicode code points, from the JSON "string" production. Is this before or after escaping? It's not clear, you have to read the rest of the spec to find out.

I'm still now sure how this matters. This is just identifying the part of the JSON spec (the "string" production) that fits into the validation data mode (unicode code points, which map to abstract characters). That has nothing to do with how those code points are or are not escaped in the JSON document representation. In fact, it is specifically there to avoid that problem- the JSON spec and the unicode spec determine how the representation is parsed into code points, and the implementation language determines how that is represented in memory.

If we could distinguish between representations before and after escaping, what would we do with that information? Just like the distinction between the two representations "1e1" and "1E1", I understand that you are talking about those distinctions, but I cannot come up with a single way in which we would want to use that information. The entire point of the data model is to make it absolutely clear which distinctions are meaningful and which are not. I can't think of any distinctions that the data model excludes as meaningless that would be of any use to us.

@seagreen
Copy link
Collaborator Author

If we could distinguish between representations before and after escaping, what would we do with that information?

You could write a schema saying U+007f (the DELETE character) isn't allowed in this document. I could see a use for that.

But!

I'm not saying we should do that. I personally like that JSON Schema doesn't know the difference between the single U+007f Unicode character and the sequence \u007f. What I am saying is that the current section data model section isn't very clear.

My concrete suggestions are twofold: add "unescaped" to the definition of JSON Schema string, and think hard about #152 because almost no JSON parsers in the wild preserve the number of significant digits, so we might not want JSON Schema to require that.

@handrews
Copy link
Contributor

You could write a schema saying U+007f (the DELETE character) isn't allowed in this document. I could see a use for that.

Can't you do that already? The escaping doesn't matter, both the schema and the instance are parsed, then JSON Schema rules become relevant.

I think we're going to have to wait for other folks to chime in because I'm just not getting this :-/

As far as #152, I have no opinion on it specifically, but I would oppose validating anything that is lost during RFC-conforming parsing so that would seem to mean opposing #152. If nothing else, this discussion has clarified that for me so thanks for persisting!

@seagreen
Copy link
Collaborator Author

Sounds good, we can wait for other commenters. If it turns out no one else is interested we can just close this.

I would oppose validating anything that is lost during RFC-conforming parsing

Did you take a look at the reddit discussion I linked to? (https://www.reddit.com/r/programming/comments/59htn7/parsing_json_is_a_minefield/d98qxtj/) I really don't think there's such a thing as RFC-compliant parsing. JSON is a specification for certain sequences of codepoints, not for parsers or generators.

@handrews
Copy link
Contributor

@seagreen could you put a more descriptive title on this? I keep having to read it again to figure out why it's still open :-P

I would change the title myself but I still really do not understand what you're trying to do here.

With respect to "RFC-conforming parsing" I just mean any parser that is considered to correctly map RFC-conforming codepoint sequences into a given language's data model. If "successfully" is impossible to define precisely, I still do not think it is JSON Schema's responsibility to "fix" that. JSON Schema works with the resulting data model, not the encoding.

@seagreen
Copy link
Collaborator Author

Let's close this, I don't think there's any interest in this from JSON Schema side. I do appreciate your patience while I babbled away here though. ❤️

In case anyone who comes along later is wondering what I was say (because re-reading the thread I don't think I explained myself well):

  1. Every spec building on JSON has to make decisions about what parts of JSON are "significant" (eg 1e0 and 1E0, ["foo"] and [ "foo" ], 1.0 and 1, escaped and unescaped code points, and so on.

  2. It's a waste for every spec building on JSON to make these decisions separately.

  3. We should try to build a common spec for this (which would go in the hierarchy of standards between JSON and JSON Schema). Son might be a good starting point for thinking about this.

@handrews
Copy link
Contributor

@seagreen thanks! and you're welcome :-)
Agreed on the idea of addressing these topics in a separate spec. That makes a lot of sense to me, I was just struggling to figure out how it fit here. But I do see the point of the work in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants