Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disallow even optional "content*" processing #1296

Merged
merged 3 commits into from
Oct 17, 2022

Conversation

handrews
Copy link
Contributor

This feature added substantial complexity and security exposure for essentially no benefit. The functionality could be trivially implemented as a library on top of a JSON Schema implementation that supports annotation collection.

Fixes #1287. Intentionally avoids topics still under discussion in #1288.

This added substantial complexity for essentially no benefit.
The functionality could be trivially implemented as a library
on top of a JSON Schema implementation that supports annotation
collection.
jsonschema-validation.xml Outdated Show resolved Hide resolved
jsonschema-validation.xml Outdated Show resolved Hide resolved
handrews and others added 2 commits September 21, 2022 20:06
Co-authored-by: Jason Desrosiers <[email protected]>
based on review feedback.
@karenetheridge
Copy link
Member

karenetheridge commented Oct 6, 2022

I'm against this change as it is written; I've found it useful in OpenAPI validation. I understand that in some situations it might be a security risk, so it's reasonable to have its support be optional (as it already is), and to default to NOT performing validation. I've been thinking of splitting it into two vocabularies: Annotation and Validation. Like with format, it can be enabled when desired, and left disabled most of the time.

To clarify: I'm fully onboard with changing the spec to clarify that an author of a schema should never be unsure whether content* will be validated or not -- implementations should NOT validate by default, so that leaves either something in $vocabulary to indicate that validation is desired, or an out-of-band configuration variable that defaults to false (which is what I've done in my implementation, so it can be used in certain situations where appropriate decoding algorithms have been provided to the implementation).

@handrews

This comment was marked as outdated.

@handrews

This comment was marked as outdated.

@handrews
Copy link
Contributor Author

handrews commented Oct 10, 2022

@karenetheridge since the content* keywords in the https://json-schema.org/draft/next/content vocabulary are strictly annotations, there are two ways to validate string contents while remaining in compliance with the spec:

  1. take the output of the completed initial evaluation, and for each instance location to which content* annotations are attached
    • decode the string according to contentEncoding (the spec gives no guidance on handling any errors)
    • parse the string according to contentType (the spec gives no guidance on handling any errors)
    • evaluate the parsed string against contentSchema as a new evaluation process
    • recurse if any content* annotations are present in this new output
    • store the output to return alongside but separately from the initial evaluation and any other secondary evaluations
  2. create a content-assertion extension vocabulary analogous the the format-assertion vocabulary, which would perform the encoding/parsing/additional validation as part of the regular (single) evaluation process

Option 1 can be done in two ways: Shoved into the main JSON Schema implementation, creating a completely different set of output requirements from everything else (as currently suggested by the spec). Or it can be done as a separate library, without burdening the JSON Schema implementation's output interface that works for literally everything else in JSON Schema.

There is no functional difference whatsoever between doing it within the implementation or as a separate library. Please let me know if you do see a difference, because I would like to understand that. I am proposing this change because fundamentally, it's not a change. It just moves non-essential functionality to a library, which could even have its own little specification to ensure interoperability of this area. That would be an improvement over the current situation.

Option 2 is not impacted by this change at all. Supporting a contentSchema assertion-applicator is tricky as it is not clear how to represent instance locations within a string, so for now any such feature would not be interoperable, but we could figure that out if we wanted to make it interoperable.

@karenetheridge
Copy link
Member

I don't understand your references to output requirements in Option 1. What does output have to do with anything here?

The issue I have with moving this functionality out of the spec is, if we only talk about annotations in the spec, we lose the information needed to implement these keywords as validations consistently.

But, more importantly, if we "..take the output of the completed initial evaluation..", then we can't use the results of evaluating the content* keyword(s) and use them as inputs to applicator keywords one level up in the document. So I see Option 2 as much more preferrable.

it is not clear how to represent instance locations within a string.. but we could figure that out if we wanted to make it interoperable

The string is decoded (using the specified media type) into something that matches the JSON document model, so I presumed (when implementing this) that the instance location just carries on using json pointers inside the decoded data structure. That's under-specified though, so it would be good to clarify that (assuming the use of validation stays in the spec at all).

@handrews
Copy link
Contributor Author

But, more importantly, if we "..take the output of the completed initial evaluation..", then we can't use the results of evaluating the content* keyword(s) and use them as inputs to applicator keywords one level up in the document

The spec has never allowed this since contentSchema was introduced in 2019-09. It's expressly forbidden:

Implementations MAY offer the ability to decode, parse, and/or
validate the string contents automatically. However, it MUST NOT
perform these operations by default, and MUST provide the validation
result of each string-encoded document separately from the enclosing
document.

You're currently not allowed to use that result within the validation of the enclosing document. It MUST be produced as separate output results, which puts content* assertion behavior into a totally unique processed divorced from normal implementation.

This PR removes that uniqueness. If anything, it makes what you want to do easier by removing this constraint.

So I see Option 2 as much more preferrable.

That's great! This PR actually makes that easier by removing the part that more-or-less forbids that.

I don't hear any actual objection from you regarding what this PR does. If anything, this sounds like support. You're saying "no, don't merge this", but also "my preferred solution is the thing that is easier if we merge this."

so I presumed (when implementing this) that the instance location just carries on using json pointers inside the decoded data structure. That's under-specified though, so it would be good to clarify that (assuming the use of validation stays in the spec at all).

You should file an issue for that, as it is not relevant to this PR. It was underspecified before, and it will be the same level of under-specified afterwards.

@karenetheridge
Copy link
Member

It was underspecified before, and it will be the same level of under-specified afterwards.

Well no, not really - if you take out the descriptions of how the content* keywords would validate, then that leaves the reader to make up their own (inconsistent) rules. ..But on looking again, I see you're not removing those paragraphs.

I don't hear any actual objection from you regarding what this PR does. If anything, this sounds like support. You're saying "no, don't merge this", but also "my preferred solution is the thing that is easier if we merge this."

I guess? I don't see any path forward to allowing in-place evaluation as assertions, other than creating a new vocabulary (akin to the format-annotation and format-validation vocabulary split in 2020-12), but I'd been pondering that myself and it seems reasonable.

So I guess my vote is "not-no" - I'm not terribly happy with the spec staying in the state after merging this, but the parts you are removing are parts that I definitely disagreed with. If that makes sense :)

@handrews
Copy link
Contributor Author

@karenetheridge

the parts you are removing are parts that I definitely disagreed with. If that makes sense :)

Yes, exactly! This PR is about removing a bad solution to the problem you want to solve. It takes no position on other possible solutions, and you are quite welcome to file proposals for other solutions. I would actually suggest a slightly different approach to validation, using more specific keywords. A major problem with these a validation keywords is that's they're so open-ended. But if you specifically want to validate JSON-in-JSON, or or parse HTML-in JSON, etc., you can write a more clear spec for that behavior.

But that's a discussion for another issue (or discussion).

I'll go ahead and merge this, then. Two "yes"es and a "not-no" seem sufficient :-)

@handrews handrews merged commit 27cc759 into json-schema-org:main Oct 17, 2022
@handrews handrews deleted the no-auto-content branch October 17, 2022 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposal: remove optional automatic processing of "content*" keywords
3 participants