Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic vocabulary support #561

Closed
handrews opened this issue Mar 8, 2018 · 23 comments
Closed

Basic vocabulary support #561

handrews opened this issue Mar 8, 2018 · 23 comments

Comments

@handrews
Copy link
Contributor

handrews commented Mar 8, 2018

Proposal

  • Vocabularies

    • Group keyword definitions and their semantics (the proposal does not add any way to formally define semantics in JSON, they're defined in the specs just as they are now)
    • Are identified by URIs, which (like meta-schema URIs) should not automatically be retrieved
    • The URI SHOULD point to a meta-schema describing only that vocabulary's keyword
    • SHOULD NOT overlap with or redefine keywords from other vocabularies
      • The hyper-schema meta-schema includes validation, but the hyper-schema vocabulary is only base and links, and the keywords in the LDOs
    • Vocabularies are file-scoped; keywords MUST have the same semantics throughout a single schema document
  • Meta-Schemas

    • In a meta-schema, $vocabularies takes a list of URIs identifying the vocabularies described by the meta-schema
    • Like $schema in a schema, $vocabularies must be in the root object of the meta-schema
    • Meta-schemas SHOULD validate the combination of vocabularies they declare
    • Combining vocabularies is facilitated by $recurse (Recursive schema composition #558)
    • Meta-schemas MAY further constrain that vocabulary combination
    • Meta-schemas MAY describe keywords that are not in any declared vocabulary
    • Meta-schemas that do not declare a vocabulary, or that declare additional keywords, create an anonymous vocabulary (this just fits existing meta-schemas into the vocabulary concept)

Examples:

NOTE: "core-applicators" (stuff moved by #513) and "validation-assertions" (stuff left behind by #513) are not final names or vocabulary boundaries, I literally made them up while typing, please do not complain about whether they are "correct".

The applicators (per #513) as a vocabulary

This is where $recurse ( #558 ) would primarily be used. This assumes that dependencies has been split per #528, and the applicator version is still called dependencies, while the string form is re-named and left in the validation vocabulary.

{
    "$id": "http://json-schema.org/draft-08/vocabularies/core-applicators",
    "$schema": "http://json-schema.org/draft-08",
    "type": ["object", "boolean"],
    "properties": {
        "allOf": {"type": "array", "items": {"$recurse": true}},
        "anyOf": {"type": "array", "items": {"$recurse": true}},
        "oneOf": {"type": "array", "items": {"$recurse": true}},
        "not": {"$recurse": true},
        "if": {"$recurse": true},
        "then": {"$recurse": true},
        "else": {"$recurse": true},
        "items": {
            "oneOf": [
                {"type": "array", "items": {"$recurse": true}},
                {"$recurse": true}
            ]
        },
        "additionalItems": {"$recurse": true},
        "contains": {"$recurse": true},
        "properties": {"type": "object", "additionalProperties": {"$recurse": true}},
        "patternProperties": {"type": "object", "additionalProperties": {"$recurse": true}},
        "additionalProperties": {"$recurse": true},
        "propertyNames": {"type": "object", "additionalProperties": {"$recurse": true}},
        "dependencies": {"type": "object", "additionalProperties": {"$recurse": true}
    }
}

Hyper-Schema as a vocabulary

This only shows some of the LDO fields, and ignores that we actually distribute the links schema as a separate file

{
    "$id": "http://json-schema.org/draft-08/vocabularies/hyper-schema",
    "$schema": "http://json-schema.org/draft-08/hyper-schema#",
    "type": ["object", "boolean"],
    "properties": {
        "base": {"type": "string", "format": "uri-template"},
        "links": {
            "type": "array",
            "items": {
                "type": "object",
                "required": ["rel", "href"],
                "properties": {
                    "rel": {"type": "string"},
                    "href": {"type": "string", "format": "uri-template"},
                    "hrefSchema": {"$recurse": true},
                    "targetSchema": {"$recurse": true},
                    "targetMediaType": {"type": "string"},
                    "targetHints": {"type": "object", "additionalProperties": true},
                    "headerSchema": {"$recurse": true},
                    "submissionSchema": {"$recurse": true},
                    "submissionMediaType": {"type": "string"},
                    ...
                }
            }
        }
    },
    "links": [
        {
            "rel": "self",
            "href": "{%24id}",
            "templateRequired": ["$id"]
        }
    ]
}

Hyper-Schema meta-schema with vocabularies

This assumes a "validation-assertions" vocabulary for the vocabulary spec, and assumes the core keywords do not need to be declared as a vocabulary (although maybe they should be, I'm not sure). Also, I'm waving my hands when it comes to where the basic annotations (title, default, etc.) live, just pretend that's settled somehow please, as sorting that out is not the point of this issue.

This also assumes that the draft-08 regular schema properly assembles everything except for the hyper-schema vocabulary. So while we declare all of the vocabularies explicitly, to get the meta-schema behavior, we just combine the regular meta-schema and the hyper-schema-vocabulary-only meta-schema (shown above).

{
    "$id": "http://json-schema.org/draft-08/hyper-schema",
    "$schema": "http://json-schema.org/draft-08/hyper-schema#",
    "$vocabularies": [
        "http://json-schema.org/draft-08/vocabularies/core-applicators",
        "http://json-schema.org/draft-08/vocabularies/validation-assertions",
        "http://json-schema.org/draft-08/vocabularies/hyper-schema"
    ],
    "allOf": [
        "http://json-schema.org/draft-08/schema",
        "http://json-schema.org/draft-08/vocabluaries/hyper-schema"
    ]
}

OpenAPI 3.0's superset/subset problem

Using a meta-schema to constrain or add lightweight extensions helps discourage creating many similar vocabularies. For example, consider a meta-schema for OpenAPI's schema object, which does not allow the "null" type and instead has a boolean "nullable" keyword, and also does not allow patternProperties. @philsturgeon has referred to this mismatch as a "superset/subset".

Also, they require extension keywords to begin with "x-" and forbid other keywords that are not defined in the spec. Note the use of unevaluatedProperties (#556) for this.

This example explicitly allOfs the vocabulary schemas. A variation on the proposal is for $vocabularies to also do that implicitly. Needs a bit more thought on whether you'd ever not allOf them, and why. See #558 for why just allOf works without redefining recursive keywords (the core-applicators vocabulary would be written with "$recurse": true instead of "$ref": "#").

NOTE: "core-applicators" (stuff moved by #513) and "validation-assertions" (stuff left behind by #513) are not final names or vocabulary boundaries, I literally made them up while typing, please do not complain about whether they are "correct".

{
    "$id": "https://www.openapis.org/schema-object-metaschema",
    "$schema": "http://json-schema.org/draft-08",
    "$vocabularies": [
        "http://json-schema.org/draft-08/vocabularies/core-applicators",
        "http://json-schema.org/draft-08/vocabularies/validation-assertions"
    ],
    "allOf": [
        {"$ref": "http://json-schema.org/draft-08/vocabularies/core-applicators"},
        {"$ref": "http://json-schema.org/draft-08/vocabularies/validation-assertions"}
    ],
    "properties": {
        "type": {
            "type": "string",
            "not": {"const": "null"}
        },
        "nullable": {
            "type": "boolean",
            "default": false
       },
       "patternProperties": false
    },
    "patternProperties": {
        "^x-": true
    },
    "unevaluatedProperties": false
}

What's going on here is:

  • Anywhere an implementation sees a keyword form the core-applicators or validation-assertion vocabularies, it knows that the semantics are as defined by those vocabularies
  • It knows this even if it has never previously encountered this OpenAPI meta-schema, because $vocabularies makes that clear while just having an allOf is ambiguous.
  • While all OpenAPI schema objects are valid against the given vocabularies, the reverse is not necessarily true:
    • The validation-assertion vocabulary defines semantics for a "null" value for type, the meta-schema prevents that value from appearing
    • It also prevents the array form of type – in the normal meta-schema, the type of type is "type": ["string", "array"]
    • The core-applicator vocabulary defines semantics for patternProperties, but the meta-schema prevents that keyword from being used
  • The meta-schema defines additional keywords, which form a small anonymous vocabulary of sorts:
    • nullable is an extension keyword
    • ^x- is an extension keyword pattern
    • These are defined just as they would be without $vocabularies

There's more to work out but I think this is enough to start the conversation and find out which parts are particularly confusing.

@handrews
Copy link
Contributor Author

handrews commented Mar 8, 2018

A key point of this proposal is that since meta-schemas still work the same way as always (although hopefully with #558's $recurse), an implementation is not required to do anything to support vocabularies. $vocabularies allows a more generic, flexible implementation of JSON Schema to be more intelligent about what it can process.

But a validator that either just looks at $schema for the standard meta-schema, or behaves based on caller input or configuration rather than paying attention to meta-schemas, is still just as compliant as always.

Unless we decide that $vocabularies implicitly means that the vocabularies are combined with allOf, there's no mandatory implementation for $vocabularies.

@philsturgeon
Copy link
Collaborator

So to confirm, by default it's assumed "all vocabularies" are used? What is "all" and where does that come from.

This was referenced Mar 8, 2018
@handrews
Copy link
Contributor Author

handrews commented Mar 9, 2018

@philsturgeon I'll break the "by default" down a bit:

  • If there is no $schema in the schema, then (as one presumably does now) the implementation either makes a guess, applies its own default, or requires other input telling it how to proceed
  • If there is a $schema but no $vocabularies in the meta-schema, then just as now, an implementation SHOULD behave in accordance with $schema (but I've noticed many do whatever their default is unless you specifically tell the otherwise- honestly I think we should tighten up this requirement but I'll file that separately)

In the 2nd case, with a $schema but no $vocabularies, the meta-schema is considered to be defining its own anonymous vocabulary. However, there is no practical impact of that concept right now, it just allows us to always say what vocabularies are involved (either an anonymous one, a set of identified ones, or a set of identified ones with an additional anonymous vocabulary (in the OpenAPI example, nullable and ^x- are the terms in the anonymous vocabulary; type is not, as it is defined by one of the $vocabularies and the meta-schema is just adding a syntactical constraint without changing the semantics)

It is still the case that by default, all unrecognized keywords are ignored. Therefore, there is no concept of "all" vocabularies.

If you had an actual blank meta-schema, it would allow everything, but would not indicate any semantics. So I wouldn't consider it a vocabulary. You don't have a vocabulary until you constrain that open set of everything into specific syntax constraints (expressed as meta-schemas) and semantics (defined in prose specifications- I have some thoughts on formalizing this but almost certainly not in draft-08, I'd like to get some feedback and understand use cases with the basic concept first).

@handrews
Copy link
Contributor Author

I have updated the initial comment to include more examples of how the current specifications would be handled with this proposal, including showing the core applicator and (most of the) hyper-schema vocabulary meta-schemas, and how the hyper-schema meta-schema (the one people reference today) would be built from the vocabularies.

@handrews
Copy link
Contributor Author

As another test case, I am looking at how to frame JSON-LD as a vocabulary, mostly to allow the two systems to be used side-by-side and ensure that JSON Schema does not conflict with it. I've filed json-ld/json-ld.org#612 asking some questions about their existing JSON Schema for JSON-LD to start with this.

This is related to #309.

@philsturgeon
Copy link
Collaborator

Makes sense! I love this approach, as it’ll help get folks extending JSON Schema for their own needs, without dumping the discrepancies into a word doc or forcing guesswork onto implementors.

@handrews
Copy link
Contributor Author

I'm going to start integrating bits of this into PRs. I'll wait on the $vocabularies keyword and the specific details of meta-schemas and vocabularies, but as part of #513 I need to explain the more modular concept of vocabularies. It's necessary to explain why the keywords move into core and why the others are in validation.

This will stay open for more feedback on the details as they are all new with this issue. I'll mark this as Accepted when we have agreement (or conspicuous lack of objections) on the details, and then move on to a PR. Until then, I'll just be referencing the general direction.

@handrews
Copy link
Contributor Author

Given how much I've talked this up across every project, slack channel, or other forum I can find, I think this has been open for feedback long enough. Moving to PRs now!

@jgonzalezdr
Copy link
Contributor

I know this ticket has been open for a long time, and I'm sorry for jumping into it while @handrews is already preparing a proposal, but while thinking about #682 I thought of a drawback that most probably has already been debated and discarded, but I could not find such discussion in the issue tickets.

My concern about $vocabulary is that it applies only to meta-schemas, not to schemas.

Until now, the minimum number of documents for validating an instance was just one (the schema); for implementations the actual meta-schema document was not really needed, since the schema grammar checks could be hardcoded following specs.

However, if the vocabularies that a schema is compatible with are declared in the meta-schema, the minimum number of documents for validating an instance becomes two (the schema and the meta-schema); implementations will have to get access to the meta-schema document just to check the vocabularies and most possibly just ignore the declared grammar in the meta-schema and apply its own hardcoded one.

Moreover, should implementations validate the schema against the meta-schema? What happens if the meta-schema is in fact incompatible with a declared vocabulary (e.g. declaring that the keyword $schema is not allowed or that properties is of type array)?

I wonder if it has been taken in account the possibility of declaring the vocabularies in the schema instead of in the meta-schema (maybe optionally in addition of doing it in the meta-schema).

@gregsdennis
Copy link
Member

@jgonzalezdr the minimum number of documents for validating an instance becomes two (the schema and the meta-schema)

Technically, they've always required the schema and the meta-schema. It just happens that the meta-schema has been hard-coded. This change will simply require a "softer coding" of the meta-schemas in that they'll need to be extensible to accept other sets of keywords.

What happens if the meta-schema is in fact incompatible with a declared vocabulary?

Just like with any other schema, it remains the author's responsibility to require that a schema is valid, whether through compatible keywords or compatible vocabularies. For example, there's nothing currently stopping an author from writing this schema:

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "integer",
  "minimum": 10,
  "maximum": 9
}

Instances will always fail against this schema. As such, you may say that it's "invalid." These are semantics that JSON Schema does not protect itself against. It assumes that the authors can identify such conflicts and avoid (or correct) them. The same goes for vocabularies.

Additionally, I believe @handrews's PR does indicate that implementations SHOULD compare declared vocabularies and attempt to identify compatibility issues.

@handrews
Copy link
Contributor Author

@jgonzalezdr I am thrilled to see more people engaging the topic of vocabularies! It is a major complex change, and while folks such as @gregsdennis have provided valuable feedback and ideas I have been worried that it has been too poorly communicated (by me) to attract sufficient feedback.

Also, we have tickets that took five years to resolve- the concept of "a long time" is relative, and I'm more worried that I'm rushing this concept.

All add a few more things to @gregsdennis's excellent reply:

Recognizing a URI is enough

It is a key (and perhaps poorly explained) principle of JSON Schema that implementations may determine behaviors simply based on the URI of a schema or meta-schema. There are many ways that this can work, the two most common simply being that the behavior for a given URI is hard-coded, or that the document associated with the URI is available from some sort of local cache.

With meta-schemas, most implementations either let you choose the draft version (the only interoperable use for meta-schemas until now) when you instantiate or call the code, and then completely ignore $schema if it is present, or they distribute the meta-schema as part of the software package, and load and validate it from that local copy. Most implementations that support validating against the meta-schema also support disabling it. Or only perform the validation when specifically asked.

Side note: we've caused problems with pre-packaged meta-schemas by bug-fixing the meta-schemas in place (under the same URI), as @gregsdennis has reminded me from time to time. 😄 While I've been unapologetic about this in the past, I think #612 will provide a better way to manage that in the future, which I need to write up in that issue.

From the point of view of how schemas and meta schemas currently are (or aren't) processed, vocabularies and $vocabulary are intended to be an implementation detail. If you recognize the URI in a schema's $schema, and already "know" the keyword semantics involved (whether they are declared with $vocabulary or not, you really only need a declared $vocabulary for interoperability), then you can still just hardcode the behavior. It will be just as valid (and technically just as risky) then as it is now.

I'd be interested in any suggestions on how to make this more clear. Preferably using fewer words, which is not my strong point 😛

Meta-schemas that are more restrictive than their vocabularies are a feature

Regarding:

Moreover, should implementations validate the schema against the meta-schema? What happens if the meta-schema is in fact incompatible with a declared vocabulary (e.g. declaring that the keyword $schema is not allowed or that properties is of type array)?

While declaring that a keyword has a different type than that given in its specification would be a problem, a restriction such as not using a particular keyword is fine. Restricting $schema might be problematic as it is part of the bootstrapping sequence for processing schemas, but given how often it's left off in practice I suppose you could do that if you really wanted to. But I wouldn't.

However, forbidding a keyword like uniqueItems, which presents security challenges (passing an enormous array to uniqueItems will likely consume all memory and could be used in a denial-of-service attack) is completely reasonable, and to be expected. This is part of why meta-schemas and vocabularies are distinct concepts, even though that means that a file describing a vocabulary would probably have a vexing amount of overlap with a meta-schema. Which is why I want to punt the formal description vocabulary file to the next draft.

Another use case might be restricting keyword combinations, like requiring that items is always present when type is array, or vice versa. There are very good reasons that this restriction is not part of the specification, but particular applications may have good reasons for imposing them.

These use cases explain why vocabularies and meta-schemas are separate. The above restrictions do not change the semantics of the keywords at all. Wherever the meta-schema allows the keywords, their semantics are exactly as identified by the vocabulary. But the meta-schema can be used to restrict cases that are problematic for an application.

Combining the above two concepts

In order to support an open-ended set of meta-schemas that implement custom restrictions, implementations will need to actually perform meta-schema validation. This whole concept encourages people to customize meta-schemas for their applications. Validation against the meta-schema handles this correctly, but implementing keyword semantics still requires a human to read a specification and write code that implements it.

So we expect implementations to simply recognize the URIs in $vocabulary, and either have hardcoded those semantics, or be able to load some sort of plugin extension thingy (technical term 😉 ) that has been provided to it by the user.

This is another reason that I have hesitated to come up with a file format for vocabularies that would live at the vocabulary's URI. At least to start, I would prefer that implementations just use vocabulary URIs as actual identifiers, not as locators.

Once we have some feedback on how this works in practice, we can talk about what information would be useful to put in a vocabulary file, if any, and how to go about that.

So, to summarize this section, while outright conflicts between a meta-schema and its vocabulary are bad (and produce undefined behavior, at least with respect to whatever the meta-schema author was trying to do), "conflicts" that syntactically compatible are expected and even encouraged.

Regarding conflicting vocabularies

The PR states:

Meta-schema authors SHOULD NOT use "$vocabulary" to combine multiple vocabularies that define conflicting syntax or semantics for the same keyword. As semantic conflicts are not generally detectable through schema validation, implementations are not expected to detect such conflicts. If conflicting vocabularies are declared, the resulting behavior is undefined.

@gregsdennis I think this is the part you were referring to about detecting conflicting vocabulary semantics? My intention here is that meta-schema authors, who chose what to put in that meta-schema's $vocabulary, be responsible for avoiding conflicts.

I don't think there is any feasible way to expect implementations to do so. Defining vocabularies with compatible semantics for the same keyword (most obviously a vocabulary that adds more values for format) is expected, so you can't just error out because a keyword appears in two vocabularies, if you can even figure that out in the first place.

This definitely falls under the sort of "yeah, you can write invalid things, but don't" category as your maximum < minimum example.

@jgonzalezdr
Copy link
Contributor

@gregsdennis Technically, they've always required the schema and the meta-schema. It just happens that the meta-schema has been hard-coded. This change will simply require a "softer coding" of the meta-schemas in that they'll need to be extensible to accept other sets of keywords.

Technically you're right. The problem that I wanted to illustrate is that vocabularies implicitly have a meta-schema associated with them. Therefore, schemas for vocabularies must be compliant with the implicit vocabulary meta-schema, independently of what a "custom" meta-schema defined by $schema says (in other words, there is an implicit "allOf" of both the vocabulary and the custom meta-schemas) .

Of course, custom meta-schema should also be compliant with the vocabulary meta-schema, but what is a bit tricky here is that since schemas must be compliant with the vocabulary's implicit meta-schema but it is not mandatory for implementations to check that the schema validates against its "custom" meta-schema, at the end in most cases the "custom" meta-schema is totally bypassed.

@gregsdennis Just like with any other schema, it remains the author's responsibility to require that a schema is valid, whether through compatible keywords or compatible vocabularies. For example, there's nothing currently stopping an author from writing this schema:

It's not exactly the same case. The example that you propose is semantically incorrect (it does not have sense or practical usage), but is grammatically correct, and is confined at schema level, therefore implementations will have no problem in processing that schema. The problem that I commented is more like having incompatible grammars at meta-schema level (a bit like expecting that a sentence was grammatically correct in two different languages), and I think that it shall be very clear how implementation shall proceed when they detect such conflicts instead of delegating to meta-schema writers the responsibility of writing proper meta-schemes.

@gregsdennis
Copy link
Member

gregsdennis commented Nov 27, 2018

It's not exactly the same case.

You're right. It's not exactly the same. And you properly described why they're not exactly the same. However the responsibility of ensuring a custom meta-schema uses compatible vocabularies still lies on the author. The author must understand the vocabularies they're referencing and how they may interact.

it is not mandatory for implementations to check that the schema validates against its "custom" meta-schema

A requirement to validate against the custom meta-schema (including all of the vocabularies that it uses) is one of the changes being proposed. Implementations will need to update to begin performing this validation. It should be noted, though, that once a schema has been validated it need not be validated again; this is a one-time operation, so performance impact to validation processes is negligible.

@jgonzalezdr
Copy link
Contributor

The problem that I see here is that historically the vocabulary and meta-schema concepts where entangled, but is has proven that separating both concepts is necessary. I suppose that initially the $schema keyword was intended to be used just as an URI to identify the vocabulary, then taking in account that a schema is just JSON that can be validated by another schema, at the end the URI became some sort of URL for the implicit vocabulary meta-schema.

Maybe the problem is aggravated because we are using the meta-schema term to really talk about a different and much narrower concept: vocabulary dialects.

Let me define some nomenclature that may help establish some common ground:

  • A meta-schema is just a schema that validates another schema.
  • As a simple definition, a vocabulary is a set of keywords with associated sematics and minimal grammar. It has an associated implicit meta-schema that validates schemas for that vocabulary according to that minimal grammar.
  • A vocabulary dialect is a redefinition of a vocabulary which schemas are still compatible with the original vocabulary. This redefinition may restrict the original grammar (e.g. disable keywords) or extend it to add new keywords. Dialect have an associated meta-schema.
  • Redefining a vocabulary changing keyword semantics or relaxing the original grammar is not a dialect, it's an entirely new vocabulary.

I think that we all agree that implementations should only consider the vocabulary of a schema to decide if and how to process it. Dialects that only change the grammar would be transparent for the implementation, since they only impact the schema at the time of writing it, not at the time of using it. Implementations that "know" dialects that add additional keywords can process the additional keywords with its associated semantics (i.e. application-specific implementation), but other implementations should safely just ignore them.

I think that I haven't said nothing really new here, and the idea behind the work in progress by @handrews is aligned with that. But I see a "design smell" in it: it makes meta-schemas a "first-class citizen", and we now have 3 levels: meta-schema, schema, instance. Moreover, as the vocabularies that have to be used by a schema are declared in the meta-schema, a meta-schema may not be able to validate itself, so we can end up with a meta-meta-schema or even a (meta)^n-schema.

However, if the vocabularies that should be used to process a schema are defined in the schema itself, this infinite regression problem is solved. Additionally, implementations will have at hand the information needed to determine if and how to process the schema, and at the end, the meta-schema is not strictly necessary, since the schemas will have to be anyway compatible with the implicit vocabularies meta-schemas (i.e. not ill-formed).

As a conclusion:

  • $vocabulary could be used to identify the vocabularies that a schema uses. Pure URI could be enough.
  • $schema could be used to identify the meta-schema that validates the schema. For schemas not based on dialects that would just be the "official" vocabulary meta-schema (or for multi-vocabulary schemas a meta-schema merging them), or for dialects the dialect`s meta-schema. Implementations should not depend on the meta-schema for processing, and just use it to perform optional schema validation and detect application-specific dialects.

@handrews
Copy link
Contributor Author

@jgonzalezdr This is a great summary, thanks.

There is no infinite regression- you always end up at a self-validating meta-schema, generally speaking, it will be the one that is analogous to the current http://json-schema.org/draft-07/schema one. In the vast majority of the cases, all of the URIs involved will be recognizable as "standard" to implementations, so no actual processing will have to take place unless validation of meta-schemas is requested. And even then I expect only one level of validation- there's no need to have the standard meta-schemas validate each other every time you do anything.

Your conclusion is proposing moving $vocabulary out of the schema and into the meta-schema, correct?

That is actually how a much earlier version of this worked (I'm not sure I actually ever wrote that version up except partially on my laptop- this concept went through a lot of unsuccessful iterations before I even bothered posting anything).

The problem with putting $vocabulary into schemas is that it will be very common for schemas to rely on multiple vocabularies- that's part of the point of vocabularies, to make things more modular.
Using four or so will be common, maybe more depending on exactly how we sort out format and the content* keywords. So that would require every single schema author to list out those four, six, ten, however many vocabularies. And get the combination right. And get the associated meta-schema right.

That is far too high of a burden on schema authors, who are a very large population (among people who care about JSON Schema at all :-D ). Schema authors will have a wide variety of skill level and experience, and it needs to be very easy to write a basic schema.

The set of people who will write meta-schemas is much smaller, although probably moderate-sized now that dialects (great term, btw, very helpful) will be easy to create. But I think it is reasonable to ask people who want to create dialects to understand the constituent vocabularies, and how vocabularies in general work.

To me, it is a requirement that writing a schema is not more complicated with vocabularies than it currently is without.

For most schema authors, vocabulary composition is an implementation detail. Most will probably just go on using the new versions of the regular meta-schemas we have right now and just start using new features like unevaluatedProperties.

There is a vexing problem where, as you say, vocabularies kind of have an implicit meta-schema, except that we actually need to make it explicit, which runs the risk of annoying duplication. This is, as far as I can tell, a fairly intractable problem, and the main reason why I've gone through several versions of the proposal before posting.

This is why I have avoided defining the resource at the vocabulary URI, a.k.a. the vocabulary description file.

I am hoping that putting this out there in the real world will bring us feedback that will tell use what we need from that file. Originally, I was going to have them be the meta-schemas, maybe with some more extra keywords, but that got very complicated very quickly.

If you have an alternative that preserves the simplicity for schema authors, I would love to hear it. Otherwise, I think the current approach of leaving the vocabulary file undefined allows us to avoid actually duplicating things between the meta-schema and vocabulary file (because there is no vocabulary file) and spend more time coming up with a practical solution to the problem.

@xferra
Copy link

xferra commented Nov 27, 2018

Meh, It took almost all day reading all-those-related-mega-threads regarding schema and/or data evolution. 😕 Still have no idea about best practice and final decision or even what proposals were rejected (time to create FAQ?).

Practical Scenario 1:
Let's say I gonna release my vocabularies as nodejs packages and distribute via npm. I'm gonna follow semver and semver ranges - so, clients will know when changes expected to be backward compatible.

  1. How to specify compatible vocabularies in plain JSON Schema? (oh we gonna build schema dependency manager 👍 )
  2. To be updated

@jgonzalezdr
Copy link
Contributor

jgonzalezdr commented Nov 28, 2018

@handrews: I see your point and I buy into that. My concern is about complex schemas "$ref-ing" other ones in a deep structure, with different vocabularies involved. But anyway, most probably the meta-schema "chain" will finish with the validation vocabulary meta-schema, isn't it?

I'll give a thought to some practical use cases to see if I can find any caveat.

There is a vexing problem where, as you say, vocabularies kind of have an implicit meta-schema, except that we actually need to make it explicit, which runs the risk of annoying duplication. This is, as far as I can tell, a fairly intractable problem, and the main reason why I've gone through several versions of the proposal before posting.

It may seem unnecessary and superfluous, but to ensure defined behavior of implementations against "ill" meta-schemas, my opinion is that specs should state, in addition to the current rule that a schema MUST be valid against its declared meta-schema, that a schema MUST be valid according to the rules for all vocabularies declared in the meta-schema.

The temptation to rule that a schema must also be valid against vocabularies' implicit meta-schemas should be avoided, since actually the vocabulary rules can be fairly complex, and some may not be formalized as a meta-schema (at least with the current validation vocabulary). For example, an invented vocabulary could make mandatory that if both the "minimum" and "maximum" keywords are present, the value for the minimum must be lower than the maximum. Or that the number of items of an array is even.

@handrews
Copy link
Contributor Author

@jgonzalezdr

Don't borrow trouble worrying about incredibly complicated meta-schema chains. I do not think it those will be common, and in any event this is why we publish drafts. This is not a set specification.

The worst case scenario here is that we get feedback that it's confusing and slow, and then we improve it.


my opinion is that specs should state, in addition to the current rule that a schema MUST be valid against its declared meta-schema

This is already done. From section 7 of the current published spec (referring to the $schema URI):

The current schema MUST be valid against the meta-schema identified by this URI.


a schema MUST be valid according to the rules for all vocabularies declared in the meta-schema.

There is no way to enforce or even detect this. The meta-schemas take care of the validation that is possible.


The temptation to rule that a schema must also be valid against vocabularies' implicit meta-schemas should be avoided

No one is proposing anything at all involving implicit meta-schemas.

@handrews
Copy link
Contributor Author

@xferra great questions!

We will definitely not be implementing any sort of packaging/versioning system!

From the specification perspective, the place to put any versioning string is in the meta-schema URIs, and now also in the vocabulary URIs. You can put semver or whatever else in those.

These are identifiers, so it is not necessary to serve the meta-schema at a retrievable URL (although it is possible, of course). It is (for this one draft, at least) expressly forbidden to serve any document at the vocabulary URI, so those are purely identifiers, so there's not really anything to package and distribute for vocabularies.

If you implement a vocabulary with a plugin for a validator (or other tool), you should package and distribute it however the tool is packaged and distributed.

If you want to distribute meta-schemas (particularly if you use a URN or have any other situation where the meta-schema cannot be served from its URIs, I would treat that basically like a configuration file, and package it either on its own (in which case your versioning may or may not match the URIs, I could see use cases either way), or package it with whatever uses it (the way many validators actually package the standard meta-schemas).

I'm sure I'm missing some things here, but I think a key point is that right now there's nothing to distribute for vocabularies. You just document them and identify them with a URI.

By the time we know enough to design some sort of vocabulary description file, we should know a lot more about use cases and real-world practices.

Also, thanks for slogging through it all. Believe it or not, this is nowhere near the longest topic we've had! But I know it's a lot of work.

@jgonzalezdr
Copy link
Contributor

a schema MUST be valid according to the rules for all vocabularies declared in the meta-schema.

There is no way to enforce or even detect this. The meta-schemas take care of the validation that is possible.

In fact, implementations for a vocabulary do enforce / detect that the schema they are processing is valid.

My point is not that a tool without prior knowledge of the vocabulary should be able to detect a mismatch between the meta-schema and the declared vocabulary automatically.

Let me elaborate a bit more about the problematic that I think should be addressed:

Suppose that I write a draft-07 schema that declares "properties": []. I then execute a tool to validate an instance against that schema. The tool will of course detect that the schema is not valid, either by validating the schema against the draft-07 meta-schema before semantic processing, or during hardcoded syntactic and semantic processing. Everything fine.

Suppose now that I write a meta-schema based on draft-07's, that allows properties to be an array. Then I modify my schema such that $schema references that new meta-schema. When I run the tool to validate an instance against it, it will fail complaining that it cannot proceed because $schema is not recognized. I then force the tool to interpret the schema as a draft-07 schema, and the tool can either detect that the schema is not valid for the draft-07 vocabulary, or even just crash. But nobody could complain, because I'm using the tool in a non-compliant way, making it "break" the rules.

Suppose now that I write a meta-schema based on draft-08's, that allows properties to be an array, and that declares in $vocabulary that it's a "required" validation vocabulary. I then modify my schema such that $schema references that new meta-schema. An then again I run the tool to validate an instance, then the tool goes ahead because the schema is valid against its meta-schema and it recognizes the validation vocabulary, and finally the tool crashes. I could perfectly moan in this case, because iirc I haven't broken any of draft-08's mandatory requirements.

This is the "weak" point I see in the PR right now, I may be wrong but I think that it allows to define meta-schemas which are not valid for a vocabulary. This didn't happen with previous drafts because the vocabulary was tightly associated with a single meta-schema.

@jgonzalezdr
Copy link
Contributor

To address this issue, in the "Best Practices for Vocabulary and Meta-Schema Authors" a new paragraph could be added in the likes of:

Meta-schema authors SHOULD NOT write meta-schemas that define conflicting syntax with the vocabularies declared in "$vocabulary". If conflicting syntax with vocabularies is defined, the resulting behavior is undefined.

This requirement is similar to the "combining conflicting vocabularies" already present in the PR, but complementing it to address conflicts between the meta-schema and the vocabularies.

@handrews
Copy link
Contributor Author

@jgonzalezdr OK, the best practices idea makes sense, thanks!

I'm not sure your example quite works out that way, but since we have a more concise recommendation to go with here I'm not going to sort that out.

@handrews
Copy link
Contributor Author

Thanks all! Merged #671. There will no doubt be more work on vocabularies but I'm calling this and the other closely related issues targeted for draft-08 done!

If there was anything left unresolved from the discussion here, please file a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants