Keyword for identifying bootstrapping rules #217

handrews · 2022-08-02T01:33:06Z

handrews
Aug 2, 2022

This proposal is related to #198 Media type registration, but I think it will be easiest if we can debate it separately here and integrate it there if it turns out to be suitable.

Bootstrapping rules are the things an implementation needs to know in order to figure out how to even begin processing a schema.

Knowing to look in $schema for a value that might directly indicate the draft is a bootstrapping rule.
Knowing to look for $vocabulary in the meta-schema, and that the "core" vocabulary is the most important one for identifying the draft, would be another.
Knowing to look in $defs for schemas that have $id so that you can resolve an embedded meta-schema is arguably a bootstrapping rule.
An example of an obsolete bootstrapping rule would be using pathStart to disambiguate which schema applies to different parts of a compound document (up through draft-03, or draft-04 if using Hyper-Schema)

The bootstrapping rules are smaller than the full processing rules or the core vocabulary.

Whether $ref operates by replacing its context or as a normal by-reference applicator is a processing rule, but not a bootstrapping rule.
Whether $comment is defined or not is part of the core vocabulary, but not a bootstrapping rule.

As of 2020-12, the full bootstrapping process involves following $schema references until reaching a schema that has its own $id as its $schema, at which point you can check its $vocabulary to determine its own processing rules. Then you must walk back out, with each meta-schema determining the processing rules of the previous one, until you reach the processing rules of the (non-meta) schema.

In practice, this can usually be short-circuited, but the general case is expensive and requires access to at least two resources. Furthermore, $schema and $vocabulary do much more complex things than just identify the bootstrapping rules, which means that they are subject to change for a wide variety of reasons (particularly the relatively recent $vocabulary). Specifically, they signal:

the valid syntax for the schema (validation against the meta-schema)
the keyword semantics and support requirements for processing the instance with the schema
the overall processing rules, including the bootstrapping rules but also things like whether $ref is a separate specification or not

If we register a media type now that is intended to continue working for the future, and rely on $schema (with a URI value identifying the meta-schema) for bootstrapping, then we are committing to one of the following:

always needing to look inside the meta-schema to determine the bootstrapping rules, with the problems inefficiencies outlined above
define some sort of internal syntax/structure for $schema URIs that convey the information, which violates the concept of URIs as opaque and introduces opportunities for custom meta-schema URIs to get the internal syntax wrong

While current standard meta-schema URIs encode the draft in the URI, this is primarily for human convenience. It is also not a universal internal syntax, as custom meta-schema URIs are not expected to have a draft identifier embedded within them. So while standard meta-schema URIs have a human-friendly internal convention, in technical terms they remain opaque identifiers, as they should.

Another drawback to defining an internal syntax such as query parameters is that we would have to get the syntax right the first time.

None of this positions us well for the future.

A better approach would be to introduce a new keyword, which I'll call $jsonSchema (exact name TBD if we go this route), that appears in the schema (not the meta-schema) and identifies the bootstrapping rules (and only the bootstrapping rules).

This would allow determining the bootstrapping rules within the single schema resource. Should those rules indicate looking through $schema, the meta-schema could also declare its own bootstrapping rules and not require further examination of meta-meta-meta-...schemas.

$jsonSchema's value should be an opaque identifier, probably a URI. It should not be a semantic versioning number, as the guarantees that semver attempts to make do not fit our situation well.

Even if we were to continue the current style of draft publication, it would not change every draft. While this is not intended to address past drafts, as examples there are no changes in bootstrapping rules between draft-06 and draft-07, or between 2019-09 and 2020-12.

This would preserve the intent of $schema as a freely customizable value. It would decouple the processing rules from the core vocabulary or any question of draft vs version vs whatever. Aside from the keywords involved in the bootstrapping process, most vocabularies will not care about that process, and should function with any bootstrapping rules.

All bootstrapping keywords should be part of the standard core vocabulary, making it the only one with a direct dependency. But they are not the same thing ($recursive* vs $dynamic* changed in 2019-09 to 2020-12, but the bootstrapping rules are the same as by the time you have bootstrapped, you know which of them to expect - you do not need to fully evaluate the meta-schema in order to bootstrap).

If we wanted to move vocabulary or even keyword declaration into the schema (rather than meta-schema), this could be done as an updated set of bootstrapping rules. Implementations that do not recognize the new rules MUST refuse to process the schema.

An analogous media type parameter would be added, allowing schema content negotiation based on processing rules. This could allow some sort of graceful degradation when outdated implementations are all that is available.

A possible drawback would be needing to specify both $jsonSchema and $schema. This is worth more thought than I've been able to give it, but in the short term, we could define default bootstrapping rules in the absence of $jsonSchema, allowing things to continue to function more-or-less as they do now. But once we figure out a way to avoid this duplication, then a new bootstrapping rules identifier would be used.

Another possible way to future-proof this would be to make $jsonSchema an object with one permanently defined key, e.g. $rules, that actually defines the bootstrapping rules.

There are no doubt many possibilities. We should explore those rather than dismissing this proposal simply because of the possible duplication (although it may be dismissed for other reasons).

pinging @jdesrosiers in particular

gregsdennis · 2022-08-02T06:29:40Z

gregsdennis
Aug 2, 2022
Maintainer

Currently $vocabulary only has any function in a schema that's being processed as a meta-schema. Could we have $jsonSchema (or whatever) work the same way?

We had discusssed before how having core as a vocab didn't make sense because its purpose was to define processing rules. This was the first I had heard you mention having a separate keyword for processing rules, which I expect this proposal is intended to formalize. Since we're effectively pulling "core" out of $vocabulary (which only functions inside meta-schemas) into $jsonSchema, I think it makes sense that the new keyword also only functions inside a meta-schema.

I think this would resolve the duplication issue since it would be located in the document within $schema, implementations still only need to look at $schema, and the "$schema rabbit hole" would be reduced to a single step for meta-schemas that declare it, which is ideal (we could even require it in meta-schemas if we really wanted).

12 replies

handrews Aug 2, 2022
Author

I'm not going to be convinced that the meta-schema walking is not a problem. It's not a good thing that you need a second resource, which you may not have any way of retrieving, in order to even get started processing the schema. Consider a bundled custom meta-schema: without bootstrapping rules, you don't know where to look for the embedded meta-schema (definitions or $defs or some future location?). In that case, technically you do have the schema, but can't find it. (Yes, I realize you could just ignore the question of keywords and scan the whole json document for objects with an $id, but that would not be following the spec, it would be hacking around it).

@gregsdennis

If you're maintaining a cache of known meta-schemas, then I really don't think the lookup chain much of a problem.

But the entire point of the vocabulary system is that you will frequently encounter unknown meta-schemas, even if all of the vocabularies involved are know.

Unknown meta-schema with known vocabularies is expected to be a common case. If it weren't, there wouldn't be a point to any of this because you wouldn't be able to assemble custom sets of known vocabularies.

And note that the unknown meta-schema may well have itself as its own $schema.

gregsdennis Aug 2, 2022
Maintainer

I don't agree that a single running instance of an implementation will frequently encounter unknown meta-schemas. Different running instances of a implementation may encounter different meta-schemas, yes, but a single running instance should only encounter a few, certainly few enough to cache for the lifetime scope of the host application.

Once a meta-schema is known, it can be recognized immediately by its identifier. Meta-schema assessment should be a once-per-lifetime-scope activity.

And I'm not fully convinced this is something that the specification needs to solve. It seems more of an application concern to me.

jdesrosiers Aug 2, 2022
Maintainer

Consider a bundled custom meta-schema

Good point. I think it would be possible with an induction proof, but that's not something we want people to have to implement.

Are there any other use-cases anyone can think of that this approach would restrict? It would be helpful to help weigh the pros and cons. I completely agree that having to look at a second resource is not ideal, but I think requiring people to have two version/dialect identifying keywords rather that the one we have now weighs pretty heavy in the cons column.

handrews Aug 2, 2022
Author

I think requiring people to have two version/dialect identifying keywords rather that the one we have now weighs pretty heavy in the cons column.

I think we should be working through that issue and trying to find the optimal solution for it before tallying up the cons. I wrote this up quickly because I promised I'd unblock you, which means that I haven't had the time to sort out how we can remove or mitigate that problem. I was hoping we could all work collaboratively to explore the options.

handrews Aug 3, 2022
Author

@gregsdennis

I don't agree that a single running instance of an implementation will frequently encounter unknown meta-schemas. Different running instances of a implementation may encounter different meta-schemas, yes, but a single running instance should only encounter a few, certainly few enough to cache for the lifetime scope of the host application.

You're making a tremendous number of assumptions here. It's not a good idea to just dismiss a swath of use cases as unlikely based on any one person's experience.

A generic hypermedia client (such as Ketting, if it were to have JSON Schema support added, or sleepwalker, which I wrote a decade ago), would encounter arbitrarily many meta-schemas depending on how many APIs you access during a session, and how customized their JSON Schema usage is.

And again, the point is not to look at what seems likely now, but to look long-range. The point of the vocabulary system was to make vocabularies dependable and dialects extremely customizable. Making design decisions that assume rarely seen custom meta-schemas, or assume a well-known small set, is not based on any real long-term data and will preclude such an ecosystem from ever developing, undercutting the vocabulary and dialect work substantially.

That is why I am so adamant about solving this correctly. Solving this based on the assumptions that certain valid, intended uses of the system won't happen is a sure-fire way to ensure that they never happen. And then what was the point of all of this?

jdesrosiers · 2022-08-02T20:46:33Z

jdesrosiers
Aug 2, 2022
Maintainer

I agree with almost all of this. However, I would propose one change. Instead of introducing a new keyword for the processing rules, we can introduce a new variable for declaring the dialect schema. So, $schema would declare the processing rules and a new keyword, $dialect, would point to the meta-schema that declares the vocabularies for the dialect.

Both this and the original proposal split the functionality of $schema into two keywords, they just refactor them differently. I think this is better because it allows $schema to be the go-to to determine processing rules no matter what version is in use and can remain so in the future.

5 replies

gregsdennis Aug 2, 2022
Maintainer

Would it be reasonable to also say that if $dialect is absent then it assumed to be the same value as $schema? This way, if you're just using one of the draft meta-schemas you only need $schema, but if you're using a custom meta-schema in order to get new keywords, you can put that in $dialect (instead of the current practice of putting it in $schema).

jdesrosiers Aug 2, 2022
Maintainer

The idea is that $schema (or the originally proposed $jsonSchema) wouldn't change from release to release. Roughly speaking, it only needs to change when the core vocabulary changes. So, you would always need both.

handrews Aug 2, 2022
Author

I don't think it's a good idea to radically change the meaning of $schema, which is a very well-established keyword at this point.

jdesrosiers Aug 2, 2022
Maintainer

I don't see it as a radical change at all. $schema has always been the indicator of how to bootstrap and process the schema. Even in 2019-09 and 2020-12 I would argue that it still does that. The first bootstrapping rule is to look at $schema and retrieve the meta-schema to determine the rest of the bootstrapping rules. It just changed from a one step to a two step process. This would would move the bootstrapping rule identification back to a one step process and leave the processing rules as a second step. The major departure from how $schema has worked in the past is that it would no longer identify a meta-schema. In my opinion it's more consistent for $schema to maintain it's purpose as identifying bootstrap rules than it is to maintain it's meta-schema identification purpose.

handrews Aug 3, 2022
Author

@jdesrosiers OK, I can definitely follow your argument here. I'm still a bit skeptical, but will keep an open mind — if other folks are sold on tightening the focus of $schema and moving the rest of its functionality to a $dialect, then I don't think I'd consider it a deal-breaker.

I would certainly prefer this over only using $schema.

jdesrosiers · 2022-08-03T17:14:18Z

jdesrosiers
Aug 3, 2022
Maintainer

I think requiring people to have two version/dialect identifying keywords rather that the one we have now weighs pretty heavy in the cons column.

I haven't had the time to sort out how we can remove or mitigate that problem. I was hoping we could all work collaboratively to explore the options.

(I thought this deserved its own thread). I'm definitely willing to explore the options. The only option that comes to mind so far is to allow the bootstrap identifying keyword to default to the latest version, but that's not great for compatibility. I'm continuing to think about it and I look forward to hearing other ideas.

0 replies

awwright · 2022-08-04T00:37:06Z

awwright
Aug 4, 2022
Maintainer

I'm still researching this, but I'm coming to a finding that if we tighten up interoperability rules around forwards compatibility, then such a keyword should not be necessary yet; it should only be necessary is in the future, and only to do certain things. Here is why:

If schema has no $schema, then there's a limited number of interpretations that can possibly be applied; no extra keyword should be necessary.
If there is a $schema, within a certain range of built-in values, then again, there's a limited number of interpretations that can possibly be applied; no extra keyword is necessary.
If there is a $schema, and that meta-schema is machine-readable, then there must be something that made it machine-readable. There's an infinite number of machine-readable meta-schemas, but only a couple mechanisms that can make a meta-schema machine readable. And actually, I think $vocabulary is the only one.

If we define a new "bootstrapping" process in the future, then it would either:

use $schema with a value that's not defined, or
$schema would point to a meta-schema with an unknown but required $vocabulary, or
the meta-schema would use a keyword (not $vocabulary) that's not understood, and so the schema would not be machine readable (if the new process is yet-unsupported by the validator).

If we make all three of these conditions an error (currently only (2) seems to be an error), then I think this would solve the problem you're identifying.

13 replies

handrews Aug 4, 2022
Author

Should there be any difference between a hypothetical "$jsonSchema" keyword with an undefined value, versus a new "$" keyword? I don't see why there should be a difference, and if there's no difference, defining new keywords seems like a better solution.

@awwright I don't understand your point here. $jsonSchema would be a new $ keyword, obviously, but what does that mean to you that you would prefer to define other new keywords? What other new keywords?

awwright Aug 5, 2022
Maintainer

Maybe I don't quite understand something, can you elaborate on what happens if the value for this new keyword is unrecognized, and how different values might be processed differently?

handrews Aug 5, 2022
Author

The values identify the processing rules. If the value is recognized, then the implementation will know whether or not it can process the schema and evaluate instances with it. If the value is unrecognized, then it cannot determine that and therefore must assume that it cannot process the schema.

We could attempt to get fancy with indicating somehow that a new value is compatible with certain older values, but that gets into semver problems. Processing rules should not change often, particularly not once we stabilize things (which is why I'm in favor of retaining the "draft" label until that stabilization, so that's it's understood that changes may still be relatively frequent (like once a year) but that is not the expected rate of change long-term.

awwright Aug 5, 2022
Maintainer

Ok. So above, I made the argument that once we have a feature, we can't really remove it or change it. Our options are we add features, or we deprecate something in such a way it can't be used at the same time as a new feature. $jsonSchema is only useful in the latter situation, and even then, that's complicated and I think there's better ways to solve those problems.

One of the complexities is we have to determine how "deeply" this impacts processing. Do the effects of "$jsonSchema" cross document boundaries? Can a meta-schema mean two different things depending on the $jsonSchema in the referring document?

It seems to me that anything that we could potentially need from a $jsonSchema keyword, we can do with some moderate error requirements (error on all unrecognized "$" keywords). Then if ever some dramatically different processing ever becomes necessary, that can simply be indicated by defining and using a new "$" keyword.

handrews Aug 10, 2022
Author

@awwright

Do the effects of "$jsonSchema" cross document boundaries? Can a meta-schema mean two different things depending on the $jsonSchema in the referring docum

$jsonSchema, like $schema, is resource-scoped. It does not cross resource boundaries, whether that's an embedded resource or a separate document. The schema resource is the unit of schema processing, as it has been since 2019-09.

Your alternative to $jsonSchema is too vague for me to be able to tell what I think of it. I also don't understand what you think the drawbacks are (other than the scoping thing which hopefully I have clarified now). This is probably related to the fact that I still don't really understand your compatibility ideas, although I am looking forward to seeing some concrete examples sometime as we discussed in last week's call. Perhaps your concerns will make more sense to me then.

handrews · 2022-09-29T20:44:13Z

handrews
Sep 29, 2022
Author

This proposal will be re-evaluated in the context of the ongoing SDLC discussion. I'm going to lock this discussion, and most likely start a new one once I have an updated concept.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Schema

Keyword for identifying bootstrapping rules #217

{{title}}

Replies: 5 comments 30 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

JSON Schema

Keyword for identifying bootstrapping rules #217

handrews Aug 2, 2022

Replies: 5 comments · 30 replies

gregsdennis Aug 2, 2022 Maintainer

handrews Aug 2, 2022 Author

gregsdennis Aug 2, 2022 Maintainer

jdesrosiers Aug 2, 2022 Maintainer

handrews Aug 2, 2022 Author

handrews Aug 3, 2022 Author

jdesrosiers Aug 2, 2022 Maintainer

gregsdennis Aug 2, 2022 Maintainer

jdesrosiers Aug 2, 2022 Maintainer

handrews Aug 2, 2022 Author

jdesrosiers Aug 2, 2022 Maintainer

handrews Aug 3, 2022 Author

jdesrosiers Aug 3, 2022 Maintainer

awwright Aug 4, 2022 Maintainer

handrews Aug 4, 2022 Author

awwright Aug 5, 2022 Maintainer

handrews Aug 5, 2022 Author

awwright Aug 5, 2022 Maintainer

handrews Aug 10, 2022 Author

handrews Sep 29, 2022 Author

handrews
Aug 2, 2022

Replies: 5 comments 30 replies

gregsdennis
Aug 2, 2022
Maintainer

handrews Aug 2, 2022
Author

gregsdennis Aug 2, 2022
Maintainer

jdesrosiers Aug 2, 2022
Maintainer

handrews Aug 2, 2022
Author

handrews Aug 3, 2022
Author

jdesrosiers
Aug 2, 2022
Maintainer

gregsdennis Aug 2, 2022
Maintainer

jdesrosiers Aug 2, 2022
Maintainer

handrews Aug 2, 2022
Author

jdesrosiers Aug 2, 2022
Maintainer

handrews Aug 3, 2022
Author

jdesrosiers
Aug 3, 2022
Maintainer

awwright
Aug 4, 2022
Maintainer

handrews Aug 4, 2022
Author

awwright Aug 5, 2022
Maintainer

handrews Aug 5, 2022
Author

awwright Aug 5, 2022
Maintainer

handrews Aug 10, 2022
Author

handrews
Sep 29, 2022
Author