Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

contentSchema has implementation-defined referencing behavior when contentMediaType is not present #1381

Open
Julian opened this issue Feb 8, 2023 · 11 comments · May be fixed by #1564
Open
Assignees

Comments

@Julian
Copy link
Member

Julian commented Feb 8, 2023

The text of 2020-12's contentSchema says:

The value of this property MUST be a valid JSON schema. It SHOULD be ignored if "contentMediaType" is not present.

In #1288 which I've found, there was some back and forth about what the first sentence there means (which I agree with -- i.e. that the current spec seems to treat this schema as any other subschema, by virtue of omission). But the second sentence still has undefined behavior, because "SHOULD" there isn't clear.

Specifically, this schema (or others like it) when validating the instance 12:

{
  "$ref": "#foo",
  "contentSchema": {
      "$anchor": "foo",
      "type": "string"
  }
}

either blows up if one follows the SHOULD or else returns invalid if one ignores the SHOULD about contentMediaType being missing. (FWIW of the implementations Bowtie supports, there appear to be at least 1 implementation with both behaviors, though unclear whether they explicitly intend to do so).

It seems unnecessary to have undefined behavior in this case (one reminiscent of old $ref), so my personal opinion would be to remove that second sentence (and not intimate that it's ok to ignore it), or to just make it invalid to have contentSchema present without contentMediaType (and add a dependentRequired to the metaschema).

(n.b. #1352 already tweaked the language here coming out of #1288, but the second sentence is still present.)

Julian added a commit to python-jsonschema/referencing-suite that referenced this issue Feb 8, 2023
Note we avoid some side question about whether this is true even when
contentMediaType is not present, as that appears undefined in the spec.

See json-schema-org/json-schema-spec#1381 which may fix this for future
spec versions.
@gregsdennis
Copy link
Member

gregsdennis commented Feb 8, 2023

blows up if one follows the SHOULD

I don't follow your logic here. Following the SHOULD means that contentSchema is ignored in your example. I see nothing wrong with that.

It doesn't mean that identifiers aren't still registered, just that the keyword isn't evaluated.

@Julian
Copy link
Member Author

Julian commented Feb 8, 2023

Here's the language from draft 7's $ref definition:

All other properties in a "$ref" object MUST
be ignored.

where it was essentially universally understood that this means identifiers are not registered. As far as a quick skim, no other language is present there, that's how we chose to communicate that behavior.

So I think you're asserting something (that there's only one interpretation for the current language) which isn't consistent with historical wording.

But at worst even if you're right there I'd still think it's unclear rather than really undefined.

@jdesrosiers
Copy link
Member

I think the intention was that that quote is in the context of using the contentSchema keyword as an assertion (which is optional, off-by-default behavior). It doesn't mean that the contentSchema sub-schema should be ignored, it means that it shouldn't apply contentSchema as an assertion on the parsed string data. I think it's a "SHOULD" in order to allow for but discourage heuristic-based parsing of the string value.

So, I think it would be wrong for your example to produce an error. The text is certainly ambiguous and should be clarified in the next release, but I weight intention pretty high when there is ambiguity in the text and I'm pretty confident that was the intention.

@Julian
Copy link
Member Author

Julian commented Feb 9, 2023

Fair enough maybe it's not needed for us to argue or agree on how to interpret the current draft (I suspect we probably all agree it's quite unlikely anyone has implemented anything related to this at all, let alone anything depending on the small nonsense point I'm raising) so perhaps consider this just a "it seems like this line needs tweaking for the next version" ticket.

@gregsdennis
Copy link
Member

@Julian @jdesrosiers is there an action here? Do we need to clarify anything in the current text?

@Julian
Copy link
Member Author

Julian commented Oct 20, 2024

If it were me my recommendation was to remove that second sentence given that we no longer allow contentSchema to be used as an assertion, and/or to make it invalid to have contentSchema without contentMediaType in the text and metaschema, but it's fairly minor given I think nearly no one does anything with these keywords so if others don't agree I'm fine if you close.

@gregsdennis
Copy link
Member

We did receive a question about this in Slack in the past week, so I'll look at how it can be made more explicit, but I don't think any requirements need to change.

@jdesrosiers
Copy link
Member

I think whether or not this needs action depends on the resolution of #1288.

According to Henry, the example given wouldn't be valid. The schema in contentSchema is not a subschema. It's an annotation that happens to be in the shape of a schema. So, the $ref in that example is invalid because it's referencing a location that is known to not be a schema. That's true regardless of whether contentMediaType is present.

However, if it's decided that contentSchema is a subschema of the schema it appears in, then this becomes an issue we should probably address in the next release.

@gregsdennis
Copy link
Member

I agree that it's worth revisiting that conversation. Primarily, I'd like to identify (good) use cases for considering it a subschema.

I can't think of any good reasons someone would want to $ref into that subschema (i.e. creating an $id in that subschema that is then referenced from somewhere else), however it might be handy to be able to $ref out of it, e.g. into $defs.

If it's not a subschema, then even something as simple as

{
  // ...
  "contentSchema": { "$ref": "#/$defs/my-content" },
  "$defs": {
    "my-content": { /* ... */ }
  }
}

wouldn't be possible.

@jdesrosiers
Copy link
Member

jdesrosiers commented Oct 24, 2024

There are other considerations as well. For instance, contentSchema not being a subschema violates the principle of least surprise. People will assume this is a normal subschema and be confused when they find it's not, which will lead to a lot of questions we have to answer. Another consideration is ease of implementation. It will vary given different architectures, but I know one option would be problematic for me to implement while the other would require nothing at all.

@gregsdennis gregsdennis self-assigned this Nov 21, 2024
@gregsdennis gregsdennis moved this from Awaiting PR to In Progress in Stable Release Development Nov 21, 2024
@karenetheridge
Copy link
Member

karenetheridge commented Nov 21, 2024

I suspect we probably all agree it's quite unlikely anyone has implemented anything related to this at all

FWIW, I have fully implemented contentMediaType, contentEncoding and contentSchema evaluation (behaviour defaulting to annotation-only, but when enabled, the data instance is decoded and then evaluated against the schema in the context of the surrounding schema), along with a pluggable interface for adding decoders for media types and encodings, and I made use of it at $lastjob in validating APIs which included a uuencoded string as part of a message payload.

Additionally, the code has a comment saying "since contentSchema should never be assumed to be evaluated in the context of the containing schema, it is not appropriate to gather identifiers found therein". So my interpretation was that identifiers included within should not be available from outside this subschema. However, the way I am doing this also makes it impossible to set the context properly when actually evaluating the schema (as I describe it in the next paragraph), and I think the change proposed in this issue and related PR is better, where we do allow the identifiers to work as normal, so I am making this change now.

FWIW, I also (when contentSchema-evaluation mode is enabled) process the schema in place, rather than in isolation, in order for errors and annotations to appear in the right place in the overall result object. I have been pondering for some time making a proposal that this change be made officially in the next draft, as well as splitting the Content vocabulary into two just as we did for format (content-annotation and content-assertion), but since we're abandoning the vocabulary concept I'm not sure how that proposal would look anymore.

(much of the content of this post has changed since I originally wrote it, so please ignore the email for it if you first read it there.)

karenetheridge added a commit to karenetheridge/JSON-Schema-Modern that referenced this issue Nov 21, 2024
We treat the contentSchema subschema like any other, both traversing it for
contained identifiers (always) and evaluating a decoded data instance against
it (when the validate_content_schemas option is enabled).

See discussion at json-schema-org/json-schema-spec#1381
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
4 participants