Add keyword id to output unit #1065

jdesrosiers · 2021-01-15T22:53:39Z

A schema using dialect-a can $ref a schema using dialect-b but, there is nothing in the output format that indicates the dialect that was used to determine each output unit result. This can be important for keywords whose semantics have changed over time, but their name hasn't changed. The ones that come to mind immediately are maximum/exclusiveMaximum, minimum/exclusiveMinimum, format, and links.

We could potentially determine the dialect for an output unit by using the absouteKeywordLocation to lookup the schema and inspect it's $schema property, but that has two problems. First, absoluteKeywordLocation is not required. Second, if the output is passed to a third party for processing, it might not have access to the original schema to look up it's dialect.

The simplest approach would be to add a field such as dialect to the output unit. However, It would be more useful to have a unique identifier for each keyword. That way, when we introduce a new draft, if a keyword doesn't change, it's ID doesn't have to change. That would make it easier for tools to do things with annotations where they don't care what dialect they are as long as they have the same semantics (such as title and description).

The text was updated successfully, but these errors were encountered:

gregsdennis · 2021-01-16T01:20:16Z

We should probably properly define a way to identify a dialect first. From 2020-12:

A dialect is defined as a set of vocabularies and their required support identified in a meta-schema.

It has no specific identifier, but perhaps the meta-schema $id (specified by $schema in the schema) could suffice since this effectively describes this set of vocabularies.

First, absoluteKeywordLocation is not required. - @jdesrosiers

The reason it's not required is because the relative location is sufficient when the navigated path contains no references.

That said, maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity because

a dereference takes you to a new dialect (different meta-schema),
a keyword contains multiple semantic definitions due to the specific set of vocabularies include (this also brings up the namespacing concept we had discussed somewhere else).

I do like the idea of including something in the output, though.

jdesrosiers · 2021-01-17T21:37:48Z

I think identifying a dialect is sufficiently defined.

The "$schema" keyword is both used as a JSON Schema dialect identifier and
as the identifier of a resource which is itself a JSON Schema, which describes the
set of valid schemas written for this particular dialect.

Let's not get into the choice to make absoluteKeywordLocation required or not. It's not relevant to this issue and it's just going to get us off on a tangent argument. We can save that discussion for another time. For this issue, it's sufficient to say that even if it is present, it's still not a solution to this problem.

maybe it's sufficient to include the $schema value and/or $vocabulary value in the cases where there is ambiguity

A vocabulary identifier is certainly an equivalent alternative to a dialect identifier, but I'd still prefer a keyword level identifier so we don't have a new version of every keyword for every draft even for the keywords that don't change. That would be a big help for annotation related tooling that I was working on a while ago (and hope to get back to soon).

I would prefer to always include this information rather than only when the dialect changes. Consider the case when I pass off my output to a third party library for processing. That third party library shouldn't need to know anything about the original schema and wouldn't know what dialect to process the schema as if it isn't in the output, even if the dialect doesn't change. You could pass the library a default dialect to use, but it would be easier for everyone involved if that information is just always there.

handrews · 2021-05-08T01:41:01Z

@jdesrosiers just make the keyword name a fragment on the vocabulary URI. We can declare the resources to be of media type application/schema-vocabulary+json and the fragment semantics to be that each keyword is usable as a fragment.

(We should also really register application/schema+json and if we want it application/schema-vocabulary+json)

gregsdennis · 2021-05-08T08:03:05Z

This should be mentioned over in #973.

jdesrosiers · 2021-05-14T18:19:34Z

@handrews

just make the keyword name a fragment on the vocabulary URI

That's exactly how I identify keywords now and it leaves me with the problem I described in my previous comment.

{
  "$id": "https://example.com/contrived-example",
  "allOf": [
    {
      "$id": "/foo",
      "$schema": "https://json-schema.org/draft/2020-12/schema",
      "title": "Foo"
    },
    {
      "$id": "/bar",
      "$schema": "https://json-schema.org/draft/2019-09/schema",
      "title": "Bar"
    },
    {
      "$id": "/baz",
      "$schema": "https://example.com/my-dialect",
      "title": "Baz"
    }
  ]
}

In this example, the title keywords would have the identifiers, https://json-schema.org/draft/2020-12/schema#title, https://json-schema.org/draft/2019-09/schema#title, and https://example.com/my-dialect#title respectively. Let's assume that title in https://example.com/my-dialect has different semantics that the standard JSON Schema semantics. If I have tooling that collects all the title annotations, the only way to do that is to maintain configurations that says when I say "title" I mean https://json-schema.org/draft/2020-12/schema#title or https://json-schema.org/draft/2019-09/schema#title and not https://example.com/my-dialect#title.

If each keyword has an identifier that doesn't change across drafts, we can avoid this problem. The title keywords from the example would be identified as, https://json-schema.org/keywords#title, https://json-schema.org/keywords#title, and https://example.com/my-dialect#title respectively. Now I can ask my tooling for all the https://json-schema.org/keywords#title annotations and not have to worry about if draft 2020-12, draft 2019-09, and my-dialect each have different semantics for title.

Of course it could be difficult to determine when an new keyword identifier should be minted. For example, if we changed description to allow an array of strings instead of just a string, does that deserve an new identifier? If I want to collect all the description information, both keywords express that information so maybe they should have the same identifier. However, tooling would break if it expected a string and suddenly started getting an array of strings, so maybe they should have different identifiers.

gregsdennis · 2021-05-14T22:44:10Z

This sounds like a bigger issue than just in the output. Sure, the output is probably the primary benefactor, but it sounds like this is more about adding a property to vocabularies (maybe in line with @handrews' idea of a machine-readable vocab file) than extending output specifically.

jdesrosiers · 2021-05-16T19:49:44Z

@gregsdennis I think I see what you mean. I'm open to discussing alternatives that also solve this problem.

gregsdennis · 2022-07-26T22:04:14Z

This was also mentioned here.

I think restructuring the output based on subschemas instead of keywords solves this (related comment).

gregsdennis · 2022-09-26T22:57:54Z

@jdesrosiers can you confirm whether your concerns here were covered in #1249?

jdesrosiers · 2022-09-28T20:22:19Z

can you confirm whether your concerns here were covered in #1249?

No, I don't see how those changes address this problem. There's still no indication in the output unit of what dialect or vocab the keyword is defined in.

handrews · 2022-09-28T20:26:20Z

@jdesrosiers @gregsdennis can we for now settle on the dialect IRI as the thing to be added? Or if we want this issue to be about a more comprehensive identification (dialect+vocab+keyword), can we split out an issue for just dialect identification?

gregsdennis · 2022-09-28T21:46:33Z

I question the need for dialect identification in the output.

The schema doesn't declare individual keywords by their vocab; it just uses the keyword. For example, there is no https://json-schema.org/draft/next/vocab/core#$id; it's just $id. A meta-schema that says "these keywords are available," then the schema just includes the keywords without any mention of what vocab that keyword came from. I don't see why it should be any different for the output.

If you really need to know what vocab a keyword came from, that information is traceable by going back to the schema that produced the output and looking at its meta-schema. I don't see the value in having that information in every instance of the output, especially not for every listing of a keyword.

there is nothing in the output format that indicates the dialect that was used to determine each output unit result.

There is: schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

From the examples in the current doc:

    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    }

Properties are given here. Schema location is given here. You can look up the dialect if you want.

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

handrews · 2022-09-28T22:20:25Z

Properties are given here. Schema location is given here. You can look up the dialect if you want.

If I call an API that, somewhere on the back end, uses JSON Schema to evaluate some data and produce annotations, and then sends that annotation data back to me without sending me the schema, no, I can't just look up the dialect.

This is a common problem, and is why resources typically have "self" links even though the self link URI is probably the URI that you just got it from. Because sometimes you end up with a resource representation with no way to tell where it came from. It's also why it's a good idea to have an absolute $id in a schema even if you don't have any relative URI-references in it.

We do not need the dialect URI in every output unit. We just need a map of the absolute URIs (without fragments) in schemaLocation to dialect URIs. It can go at the top level, since the schemaLocation is sufficient to determine which dialect applies to a given output unit.

gregsdennis · 2022-09-28T23:18:19Z

Can someone please post an example of how they want this fit into the output?

jdesrosiers · 2022-09-28T23:47:48Z

You may have identifiers for keywords in your implementaiton, but JSON Schema defines no such identifiers. I see no reason why the output should include identifiers for keywords when no such definition exists.

Part of the proposal was to define URIs to identify each keyword. That would definitely be a requirement for adding a keyword URI in the output.

I still think this is the best solution in the long run because it would allow output format post-processors to know that one keyword has the same semantics as another. For example, someone could write a dialect that has the same semantics as the standard dialect except all the keywords are translated to Spanish. Tools that can process the standard dialect output should be able to process the Spanish dialect output as well because the semantics of the keywords should be the same. Of course this would require changes to the vocabulary system in addition to just defining keyword URIs, but I think this would be a good path to be on.

schemaLocation gives the absolute location of the keyword. From that you can trace back to the schema and its meta-schema to get the dialect in use.

I mention in the issue description why I didn't think that was a good enough solution and @handrews reiterated that view point.

Can someone please post an example of how they want this fit into the output?

Honestly I haven't had a chance to fully think through how this would fit in to the new output format. We could figure out how to incorporate a dialect ID into what we have now, but I think that would be a partial solution. I'd like to explore incorporating the idea of keyword URIs to the vocabulary system first and then see what that means for the output format. We can probably just put this issue on hold for now.

handrews · 2022-09-28T23:48:58Z

This is almost exactly the example in your recent PR, except I changed the output units with an evaluation path prefix of /properties/bar/$ref so that they jumped to a different schema resource (https://json-schema.org/schemas/other) on that $ref, to which I assigned a different dialect/version. Please disregard the fact that the output spec doesn't match the dialects and draft-07 didn't support annotations like this.

{
  "valid": true,
  "dialects": {
    "https://json-schema.org/schemas/example": "https://json-schema.org/draft/2020-12/schema",
    "https://json-schema.org/schemas/other": "https://json-schema.org/draft-07/schema"
  },
  "nested": [
    {
      "valid": true,
      "evaluationPath": "",
      "schemaLocation": "https://json-schema.org/schemas/example#",
      "instanceLocation": "",
      "annotations": {
        "title": "root",
        "properties": [
          "foo",
          "bar"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1",
      "instanceLocation": "/foo",
      "annotations": {
        "title": "foo-title",
        "properties": [
          "foo-prop"
        ],
        "additionalProperties": [
          "unspecified-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar",
      "instanceLocation": "/bar",
      "annotations": {
        "title": "bar-title",
        "properties": [
          "bar-prop"
        ]
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/foo/allOf/1/properties/foo-prop",
      "schemaLocation": "https://json-schema.org/schemas/example#/properties/foo/allOf/1/properties/foo-prop",
      "instanceLocation": "/foo/foo-prop",
      "annotations": {
        "title": "foo-prop-title"
      }
    },
    {
      "valid": true,
      "evaluationPath": "/properties/bar/$ref/properties/bar-prop",
      "schemaLocation": "https://json-schema.org/schemas/other#/$defs/bar/properties/bar-prop",
      "instanceLocation": "/bar/bar-prop",
      "annotations": {
        "title": "bar-prop-title"
      }
    }
  ]
}

handrews · 2022-09-28T23:53:07Z

@gregsdennis the above example is only for the dialect IRI. I'm not currently concerned with anything more granular (hence me asking if we can either agree to just do this, or split this out to its own issue if folks want to keep working on the more granular thing here).

So now if I want to know the dialect for that last output unit, I look at "schemaLocation": "https://json-schema.org/schemas/other#", chop off the fragment, and look it up as something like output["dialects"]["https://json-schema.org/schemas/other"]

handrews · 2022-09-29T00:47:04Z

🤦 I initially copy-pasted the diff instead of the changed file, so the example two comments above was a mess. It's fixed now.

gregsdennis · 2022-09-29T20:40:37Z

I think that's doable, but defining how dialects are declared seem to still be an open topic on Keyword for identifying bootstrapping rules #217. We should nail that down before deciding how to put it in the output.

handrews · 2022-09-29T20:43:24Z

@gregsdennis I am working on a replacement proposal for #217, but that depends on how some things regarding the SDLC play out.

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

gregsdennis · 2022-09-29T20:46:41Z

For now, the important part is to get it into the output format. We can update the source of the dialect IRI as needed as we change that. If we change it.

My point is that we don't have the "dialect IRI" defined as a concept. We seem to be using the meta-schema IRI, but is that really the same thing? It doesn't make sense to include something that isn't defined yet.

handrews · 2022-09-29T20:51:08Z

Yes, it's definitely the same thing. I thought that was made clear in the spec. If not, that's just an oversight.

gregsdennis · 2022-09-30T02:16:13Z

Okay. I still want to hold this open until process is defined. Less to think about.

gregsdennis · 2023-05-07T16:50:25Z

Coming back to this, the idea has warmed on me.

I like the idea of the root output unit not actually being an output unit but a host for other evaluation data with the actual evaluation results in the details property.

From this comment and the one that follows it, I think it makes sense to have keywords identified. This aligns with the ideas to resolve keyword collisions (also from that discussion) in that their output need to be reported somehow.

I imagine output will go through some more refinement soon.

gregsdennis · 2024-02-13T20:54:28Z

@jdesrosiers would it make sense to only include dialect in nodes where the dialect changes, or would you have it in every node?

jdesrosiers · 2024-02-15T17:38:16Z

It could work either way, but I would include it in every node. I feel like if I was writing code to process this output, it would be annoying if I had to look in two places to determine the dialect. It would be easier if it was just always in the same place when it's needed.

gregsdennis · 2024-02-15T20:50:10Z

Yeah, thinking about it again, it would only make any sense at all to omit it in some nodes for the hierarchical format. The list format would still need it in every node.

gregsdennis added output core labels Jan 16, 2021

handrews mentioned this issue May 8, 2021

Restructuring of output formatting #973

Closed

handrews mentioned this issue Sep 21, 2022

Clarify the handling of "contentSchema" #1288

Closed

gregsdennis mentioned this issue May 2, 2023

URI naming scheme for user/one-off/non-standard keywords #1401

Open

gregsdennis mentioned this issue Jun 15, 2023

Complete the new version of the Spec json-schema-org/community#415

Open

5 tasks

This was referenced Jul 6, 2023

Keyword URIs #1422

Open

Machine-readable dialect (not vocabulary) definition document #1423

Open

gregsdennis mentioned this issue Feb 8, 2024

Update output to make the root object an overview with results contained inside #1487

Closed

gregsdennis removed the core label Jun 18, 2024

gregsdennis added this to Feature: Standardized Output Jul 17, 2024

gregsdennis moved this to In Discussion in Feature: Standardized Output Jul 17, 2024

gregsdennis removed the output label Jul 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add keyword id to output unit #1065

Add keyword id to output unit #1065

jdesrosiers commented Jan 15, 2021

gregsdennis commented Jan 16, 2021 •

edited

Loading

jdesrosiers commented Jan 17, 2021

handrews commented May 8, 2021

gregsdennis commented May 8, 2021

jdesrosiers commented May 14, 2021

gregsdennis commented May 14, 2021

jdesrosiers commented May 16, 2021

gregsdennis commented Jul 26, 2022

gregsdennis commented Sep 26, 2022

jdesrosiers commented Sep 28, 2022

handrews commented Sep 28, 2022

gregsdennis commented Sep 28, 2022

handrews commented Sep 28, 2022 •

edited

Loading

gregsdennis commented Sep 28, 2022 •

edited

Loading

jdesrosiers commented Sep 28, 2022

handrews commented Sep 28, 2022 •

edited

Loading

handrews commented Sep 28, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 29, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 29, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 30, 2022

gregsdennis commented May 7, 2023

gregsdennis commented Feb 13, 2024

jdesrosiers commented Feb 15, 2024

gregsdennis commented Feb 15, 2024

Add keyword id to output unit #1065

Add keyword id to output unit #1065

Comments

jdesrosiers commented Jan 15, 2021

gregsdennis commented Jan 16, 2021 • edited Loading

jdesrosiers commented Jan 17, 2021

handrews commented May 8, 2021

gregsdennis commented May 8, 2021

jdesrosiers commented May 14, 2021

gregsdennis commented May 14, 2021

jdesrosiers commented May 16, 2021

gregsdennis commented Jul 26, 2022

gregsdennis commented Sep 26, 2022

jdesrosiers commented Sep 28, 2022

handrews commented Sep 28, 2022

gregsdennis commented Sep 28, 2022

handrews commented Sep 28, 2022 • edited Loading

gregsdennis commented Sep 28, 2022 • edited Loading

jdesrosiers commented Sep 28, 2022

handrews commented Sep 28, 2022 • edited Loading

handrews commented Sep 28, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 29, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 29, 2022

handrews commented Sep 29, 2022

gregsdennis commented Sep 30, 2022

gregsdennis commented May 7, 2023

gregsdennis commented Feb 13, 2024

jdesrosiers commented Feb 15, 2024

gregsdennis commented Feb 15, 2024

gregsdennis commented Jan 16, 2021 •

edited

Loading

handrews commented Sep 28, 2022 •

edited

Loading

gregsdennis commented Sep 28, 2022 •

edited

Loading

handrews commented Sep 28, 2022 •

edited

Loading