Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is "$ref" and how does it work? #514

Closed
handrews opened this issue Nov 28, 2017 · 69 comments · Fixed by #628
Closed

What is "$ref" and how does it work? #514

handrews opened this issue Nov 28, 2017 · 69 comments · Fixed by #628

Comments

@handrews
Copy link
Contributor

handrews commented Nov 28, 2017

The question of whether $ref behaves like inclusion or delegation has come up several times.

  • Inclusion would mean that the $ref object can be replaced with its target (in a lazily evaluated process).
  • Delegation means that the target of the $ref is processed against the current instance location, and the "results" (boolean assertion outcome and optionally the collected annotations) of $ref are simply the results of the target schema.

Inclusion

In its original form, $ref in draft-03 and in the separate JSON Reference I-D is explicitly defined as inclusion. Implementations MAY choose to replace the reference object with its target.

There are some subtleties involved in replacement. You definitely need to adjust the $id when you do the replacement or else base URIs get messed up. Tools such as JSON Schema Ref Parser that "dereference" $refs do just that.

You also need to deal with $schema, which (when the target is in a different schema document) can be different in the source and target schemas. The obvious solution is to set $schema in the copied-over replacement, however @epoberezkin has observed that this conflicts with how non-schema instances work with meta-schemas.

We cannot make any assumption about instance documents. A key feature of JSON Schema is that it works with plain old application/json documents. There is no way for a plain JSON document to change what schema is used to process it. Changing $schema in the middle of validating a schema against its meta-schema introduces behavior that is not possible with other instance documents.

Delegation

With delegation, these problems do not exist. Each subschema is evaluated in the context of its containing schema document, regardless of whether processing reached it from elsewhere in that same document or from a $ref in a separate document. Since the processing is done per-document, each document can use a different $schema.

Results are returned in the form of a single overall boolean assertion outcome (so it doesn't matter to the referencing document what the assertions were or how they were processed) and optionally a set of annotation data (which is a set of name-value pairs of some sort).

The only subtlety is in combining data for the same annotation that appeared in both a local document subschema and a remote $ref'd document subschema. However, this is easily addressed: once the annotation data is "returned" across the $ref, it is combined with other annotation data by the rules of the schema containing the $ref. This keeps all processing consistent within each schema document, such that the rules can change independently on each side (for instance if they are upgraded to a new draft at different times).

Another nice property of delegation (also from @epoberezkin) is that $ref can become a "normal" keyword that has assertion and/or annotation results that are combined with adjacent schema keywords just like everything else.

Alternate approaches

NOTE: This section is about showing that it is possible to handle the limitations in other ways. Neither of these approaches is a serious proposal for recommendation!

An alternate approach to inclusion

The one use case that is not well-handled by delegation is that of packing multiple schema documents into a single distribution unit (file, resource, whatever). There's some debate as to how valid or important this use case is, but it does come up. This is only done by replacing specific non-cyclic $refs, and does not involve trying to "dereference" all $refs in the system.

For those who really want to do this, it occurs to me that there is another way to handle it: data: URIs:

Let's say that this is our reference target:

{
    "$schema": "http://json-schema.org/draft-06/schema#",
    "propertyNames": {"pattern": "^foo"}
}

Here we see how it can be inlined into a draft-04 schema using a data: URI:

{
    "$schema": "http://json-schema.org/draft-04/schema#",
    "allOf": [{
        "$ref": "data:application/schema+json,%7B%22%24schema%22%3A%20%22http%3A//json-schema.org/draft-06/schema%23%22%2C%20%22propertyNames%22%3A%20%7B%22pattern%22%3A%20%22%5Efoo%22%7D%7D"
    }]
}

Of course this looks horrible and would never be suitable for human consumption. But it would work.

I'm not seriously advocating this as a recommendation, but it does illustrate that the problem is solvable in the delegation model if you are willing to make some tradeoffs.

An alternate way to change processing rules

The limitation of replacement is that changing $schema in the middle of a file make validating schemas against meta-schemas different from validating other instances against schemas, as we cannot specify a schema change mechanism for plain application/json instance documents.

However, we could extend the instance-to-schema linking process to allow associating an instance JSON Pointer with each linked schema. For schemas-as-instances, this would be equivalent to setting $schema at the location identified by the JSON Pointer.

This would have to be done as a link attribute/media type parameter thing of some sort. Like the data: URI solution, this gets ugly pretty quickly, and makes it more likely to hit HTTP header length limitations.

It's also arguably a lot harder to hold in one's head than simply saying that $schema is scoped to a JSON document rather than an individual schema object. While the data URI alternative given above is ugly, the ugliness comes from a separate standard, and is purely opt-in. Schema authors have to choose to use those URIs, and implementations would choose whether to support data URIs or not.

This approach of targeting different schemas at different portions of the instance would place a significant burden on all implementations.

Conclusions

Based on this write-up, I am inclined to formalize $ref as delegation. It is a more flexible model, and the technique for working around its limitations correctly restricts the burden of implementation to only those who choose to use or support it. Inclusion is less flexible, and the workaround (if we did anything about it at all) is burdensome to all implementations.

@handrews
Copy link
Contributor Author

Another interesting concept is the one raised by @darrelmiller in OAI/OpenAPI-Specification#556 (comment), which classifies $ref as a link serialization.

This would put make it somewhat like <a href="..."> in HTML, as an untyped (or implicitly typed as it serves exactly one purpose) hyperlink. That is consistent with the delegation approach, including the use of data URIs.

In the OpenAPI issue, there's some discussion of title and description as link attributes to distinguish them from any such fields in the target schema, with two different syntax possibilities. I'm still intrigued by that idea, but I'm not as sure that it is necessary. In particular, deferred keywords as proposed by #515 may provide a more general way to "override" such annotations across references.

@erayd
Copy link

erayd commented Nov 28, 2017

Regarding $ref, I feel strongly that this should be a delegation. This allows behaviors such as validating part of a document against an external schema, without needing to care which version of the spec that remote schema uses - as long as the validator supports it, it will work.

For the particular case of schema transformations, I think that there should be no visibility whatsoever inside the target schema for the purpose of transformation. It's essentially a function call. Schemas wishing to transform the target should use a different keyword for that purpose (e.g. $include, or something else that indicates the target is a source for patching).

@handrews
Copy link
Contributor Author

As a note for readers here, if you want to discuss @erayd's points about schema transformations, that is going on in #515, so please join the discussion over there.

I'd like to keep this issue focused on delegation vs inclusion (and in particular, see if anyone wants to advocate for inclusion as the sentiment so far is definitely trending towards delegation).

@epoberezkin
Copy link
Member

Two additional arguments for delegation:

  1. "Lazy evaluation" of inclusion is counter to users' expectations. The inclusion model implies that you can produce a "final", "resolved" schema document. In general case it is either impossible (in case you limit the data representation to JSON) or requires self-referencing data-structures (that are not universally supported and more difficult to process when they are supported; also the situation when you cannot represent "resolved" JSON Schema in JSON feels very wrong).
  2. Delegation model has better extension potential for the future, e.g. for parametrised schemas (https://github.com/json-schema-org/json-schema-spec/issues/322).

@handrews
Copy link
Contributor Author

handrews commented Dec 2, 2017

@epoberezkin thanks for the comment. I agree on the extensibility thing and #322. I'm still not necessarily sold on that (or $data) but I think that sorting out this issue and #515 will help us decide on both of those.

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

Why is "lazy evaluation" associated to inclusion and not delegation? Couldn't you do either lazily?

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

I also think lazily evaluating a schema is a stupid idea and an edge case that no really need or wants. I.e. just compile the schema first instead of lazily evaluating.

Alternate Alternate Approach

JSON schema should be completely independent of JSON Ref, except to possibly require that JSON Refs shall be resolved prior to evaluation.

@erayd
Copy link

erayd commented Dec 4, 2017

@sam-at-github Lazy evaluation makes implementation of recursive schemas easy. That's not an edge case; recursive schemas are common.

Also, non-lazy evaluation necessarily implies the evaluation of lots of schema paths which may not be applicable to the document instance at hand. This is a significant waste of resources, particularly for complex schema definitions being run against a simple document.

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

That's not an edge case; recursive schemas are common.

Resolve / compile it to a native reference.

Also, non-lazy evaluation necessarily implies the evaluation of lots of schema paths which may not be applicable to the document instance at hand. This is a significant waste of resources,

That's why you do it once - compile.

Just makes things so much simpler if JSON Ref is a independent standard and JSON Schema stacks on top just like TCP/IP etc.

Last time I checked in with JSON Schema spec was over a year ago. They were arguing about $ref then!

@erayd
Copy link

erayd commented Dec 4, 2017

@sam-at-github I think we may have been talking about two different things then. But my point still stands - whether you're evaluating it or compiling it, the end result is you're still following lots of schema branches that may not make sense to follow, particularly in the case of dynamically generated schema which may not be able to be compiled in advance, or non-compiling implementations.

Assuming compilation happens where practical in either case, what advantages does non-lazy handling of $ref bring, that makes it better than lazy?

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

Assuming compilation happens where practical in either case, what advantages does non-lazy handling of $ref bring, that makes it better than lazy?

  1. You verify that your refs actually resolve up front, instead of your validator blowing up when it tries to resolve a certain unusual doc hitting a broken ref.
  2. It make this spec simpler. JSON Schema doesn't have to care about refs or how or when they are resolved. It just assumes they are, and JSON Schema authors can stop arguing about it ;).

@handrews
Copy link
Contributor Author

handrews commented Dec 4, 2017

@sam-at-github

Resolve / compile it to a native reference.

This is a JSON Specification. There are no native references. For some implementations that will make sense, for some it will not. But it's all irrelevant to the specification: We work with JSON, and with the underlying data model of JSON. That is all.

You verify that your refs actually resolve up front, instead of your validator blowing up when it tries to resolve a certain unusual doc hitting a broken ref.

It's easy to statically pre-check that all $ref targets have been seen. That has nothing to do with runtime lazy evaluation.

You verify that your refs actually resolve up front, instead of your validator blowing up when it tries to resolve a certain unusual doc hitting a broken ref.

Putting complexity somewhere else doesn't make it go away. Also, even when it was a separate specification, you still needed to perform lazy evaluation in the context of schemas. Validating the meta-schema against itself requires this. You don't even have to look far.

Even when JSON Reference was separate, this was always a concern that had to be dealt with (from draft-pbryan-zyp-json-ref-03, the last separate I-D for JSON Reference):

Documents containing JSON References can be structured to resolve
cyclically. Implementations SHOULD include appropriate checks to
prevent such structures from resulting in infinite recursion or
iteration.

This has always been done by resolving in the context of the instance. Your schema references may cycle, but given a finite instance document you will always find an end to the recursion.


You haven't made a single compelling statement for your position. You've just kind of dropped inhere and called us all stupid, and pointed out that it's easier to implement if it's less powerful. Which is no surprise, but does absolutely nothing to address the many real-world uses of cyclic references and lazy evaluation.

I've worked with very large sets of schemas- there is no way that project could have possibly worked without lazily evaluated $ref.

I've also inherited a project that forbid schema authors from using any recursive $refs so that they could dereference everything as a preprocessor. It really didn't end well. They ended up introducing another $ref-ish keyword that was evaluated after $ref. So instead of a uniform flexible lazy evaluation system, they had a weird custom two-phase mess.


Removing the very long-standing (pre-draft-04) lazy evaluation nature of $ref is absolutely not on the table as an option.

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

Didn't call you stupid ...

I'm arguing for inclusion as it makes dealing with refs outside of JSON schema and refs possible. I still think lazy evaluation is stupid. As long as it's not mandatory.

Putting complexity somewhere else doesn't make it go away.

No, but it can certainly make things less complex.

@handrews
Copy link
Contributor Author

handrews commented Dec 4, 2017

You still haven't made a case for how your approach solves all use cases. You've just decided many of our use cases are unimportant. You've asserted that things that you care about are easier (probably true) but haven't offered any alternatives for what the community as a whole needs. I do not see this line of reasoning going anywhere.

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

Not claiming to solve all use cases. Let's put it this way. You asked:

The question of whether $ref behaves like inclusion or delegation has come up several times.

Inclusion would mean that the $ref object can be replaced with its target (in a lazily evaluated process).
Delegation means that the target of the $ref is processed against the current instance location, and the "results" (boolean assertion outcome and optionally the collected annotations) of $ref are simply the results of the target schema.

In response to this is this.

@handrews
Copy link
Contributor Author

handrews commented Dec 4, 2017

@sam-at-github any serious proposal needs to solve all use cases. This thread seems done.

@sgpinkus
Copy link

sgpinkus commented Dec 4, 2017

OK Fine ... but just to summarize my position:

  1. lazy is stupid but that is beside the point of this issue. Sorry to start a war.
  2. I prefer inclusion.

Tools such as JSON Schema Ref Parser that "dereference" $refs do just that.

This is exactly the crux of my argument. Modularity, and Separation of concerns. An author of a JSON Schema implementation can re-use "JSON Schema Ref" or what ever de-refer lib they want.

@Relequestual
Copy link
Member

@sam-at-github if you prefer inclusion, then you should consider the fact that de-refernecing to native references is not something that's possible in all languages. This is a general specification that isn't language specific, and as such shouldn't create something that will be an issue for some languages.

@Relequestual
Copy link
Member

Some libraries allow for specifying to use inclusion, in the case of a few, simple, known schemas, where it will be faster and you know it will be 100% resolveable. I don't see any issue with this being something implementors can allow as non default behaviour. It's useful in my use case where I might have MANY JSON instances to validate, and I don't want an HTTP call for every validation call, and I know they fully resolve (because the schemas are mine).

Could we specify that implementors MAY do this, as long as they document the potential issues with doing so, and it MUST NOT be default behaviour?

@handrews
Copy link
Contributor Author

handrews commented Dec 4, 2017

@Relequestual depending on exactly where we go with vocabularies and $schema, dereferencing ends up subject to quite a few caveats. I have no objection to documenting what is and isn't a legal dereference somewhere, but I think just specifying the behavior and not the implementation is the right approach.

If an implementation realizes that it can produce the correct behavior through inlining, that's fine. The point of this decision is that we've had proposals come up that will work with one behavior but not the other, so we need to pick which behavior we endorse.

This is similar to "flattening" allOf. We don't make a point in the spec that you can do that, but it's actually a pretty safe transformation most of the time and there are tools that offer it. So I don't think we should confuse the issue beyond making sure that our wording is about behavior and not implementation.

@handrews
Copy link
Contributor Author

handrews commented Dec 5, 2017

@Relequestual actually, allOf flattening is pretty relevant here.

IF the $ref target shares the same meta-schema, and if $id is adjusted accordingly, then when treating $ref as delegation (meaning that it simply produces the same result- validation outcome and annotation values- as its target), then

{"pattern": "^foo", "$ref": "anotherfile"}
can be "dereferenced" to
{"pattern" : "^foo", "allOf": [{contents of "anotherfile" with appropriate "$id"}]}

As in my last comment, I think this is more of a tutorial guide sort of thing rather than an "implementations MAY do this" sort of thing. Informally, implementations can do any transformations that they want as long as the outcome is computably the same.

@gregsdennis
Copy link
Member

@Relequestual

This is a general specification that isn't language specific, and as such shouldn't create something that will be an issue for some languages.

It seems contradictory that JSON Schema can be language agnostic while considering the languages in which it's implemented.

@gregsdennis
Copy link
Member

gregsdennis commented Dec 5, 2017

Just to put in my vote, I am also in favor of inclusion, however I understand the necessity to define how $id and $schema are included/transformed with the ref object.

Edit: I don't know what I'm talking about here. Keep reading.

@sgpinkus
Copy link

sgpinkus commented Dec 5, 2017

@Relequestual

you should consider the fact that de-refernecing to native references is not something that's possible in all languages.

I know some languages don't support a native ref type, but struggle to see the barrier still. Python comes to mind - sort of, but I'm not convinced there is not a simple work around in any language actually supporting a de-serializing a JSON string into some native object type in the first place.

By "native ref" I'm not saying literal built-in ref type. For example consider:

json_str = '{
  "a": { "$ref": "#/q" }, 
  "b": { "$ref": "#/r" }
  "q": "blah",
  "r": {}
}'
myDoc = deref_my_doc(json_str)

Now myDoc.a or myDoc['a'] or myDoc.get('a') what ever has to give "blah". There doesn't have to be a literal native reference. Granted there is an ambiguity of whether myDoc.a actually is the same storage location as myDoc.q, but I think there are solutions to that. I won't work through here (haven't thought about it too hard ..), but basically yeah I think myDoc.b and myDoc.r must ref the same object (storage location). Tentatively, whether myDoc.a and myDoc.q do can be implementation dependent IMO. Deal breaker?

This is a general specification that isn't language specific, and as such shouldn't create something that will be an issue for some languages.

"all languages must be supported" gets raised a bit. But what are the entry level requirements of a language to support JSON and JSON Schema exactly? There is two requirements for the above:

  1. The language must have an native object representation, such that it can deserialize a JSON string into this native format in the first place.
  2. Must support object references.

The above sort of hinges on the second requirement because it is literally impossible to de-ref a document with loops if it doesn't. But I don't of a language that supports requirement 1. and not 2.

@Relequestual
Copy link
Member

Relequestual commented Dec 13, 2017

I've re-read this, skipping over my own comments, and eached the exact same conclusion.

I feel, we should specify:

  1. Delegation as the default and expected behaviour.
  2. Implementors MAY provide an interface to use inclusion, but MUST make people aware of the pitfalls / issues as per What is "$ref" and how does it work? #514 (comment), and have a mechanisum detect where a schema is not fully dereferenceable (like cyclic references) and throw errors in such situations.

As I said, this allows people with small schemas, that are fully known to them, to save time when doing batch validation of JSON.

I would suggest no further comments, and a general thumbs up or down on this comment, and move forward to work on phrasing for making this clear.

@Relequestual
Copy link
Member

=O As a rare event, I'm going to self assign and plan to make a Pull Request to solve this issue (unless anyone objects).

@sebilasse
Copy link

sebilasse commented Jan 16, 2018

I would also suggest to preserve the decorating properties title and description.
The spec says about these properties

Both of these keywords can be used to decorate a user interface

But isn't the sense of a "decorator" that if you reuse references to decorate them differently.
Happens fairly often to me (e.g. producing forms w. JSON Schema).

http://json-schema.org/latest/json-schema-core.html#rfc.section.8 says:

All other properties in a "$ref" object MUST be ignored.

could this become something like
"All properties in a "$ref" object except "$ref", "title" and "description" MUST be ignored"
?

@handrews
Copy link
Contributor Author

@sebilasse see #523

Relequestual added a commit to Relequestual/json-schema-spec that referenced this issue Jun 28, 2018
@ghost ghost removed the Status: In Progress label Jun 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants