-
Notifications
You must be signed in to change notification settings - Fork 5
Add schema.fieldsMatch
property; clarified extra/non-specified fields in Table Schema
#39
Conversation
I think a found one more good argument why partiality configuration is better to be on the Table Schema level. If we think about it conceptually:
JSON Schema has settings like |
This seems to mix two indepenent features:
Both should get their own flag. If order is relevant, additional fields can stil be allowed after the last defined field, so both flags are independent from each other. |
@nichtich Probably two flags |
Deploying datapackage with Cloudflare Pages
|
schema.partial
property; clarified extra/non-specified fields in Table Schema schema.exact/orderedFields
properties; clarified extra/non-specified fields in Table Schema
I've updated the PR to have two properties that make data source requirements stricter if set to
Selected defaults make it non-breaking in relation to the current v1 definition. |
Hi @peterdesmet, can you please take a look? WDYT? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Selected defaults make it non-breaking in relation to the current v1 definition.
@roll, can you clarify this comment? The defaults for both fields are false
, allowing partial matching, while frictionless v1 mandated complete matching.
Irrespective, do you think the default should be the most lenient approach?
|
||
A Table Schema descriptor `MAY` contain a property `exactFields` that `MUST` be boolean with default value `false`: | ||
|
||
- **false** (default): The number of fields in the data source `MUST` be equal or more than the number of elements in the `fields` array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the (common) use case where you have less fields in the data source compared to the fields
array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterdesmet
It's a good question. Won't it make the data source invalid against the given Table Schema? I guess software still can read it but my feeling that the missing fields need to be marked as a structural error
I'm just trying to think here about a Table Schema as a data contract and missing fields seems to be a violation but I'm not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In biodiversity informatics, it is common to work with schemas (like this one https://rs.gbif.org/extension/gbif/1.0/distribution_2022-02-02.xml), where a publisher will only use a number of fields, since not all apply to the data they hold. I'd like to be able to use Table Schemas for that. 😄 Fields that are expected to be there can be indicated with the required
property (and raise a validation error if they are not).
If you agree, would it be possible to update the wording to allow this use case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of course, missing fields can be filled by null
and then e.g. required: true
be used but it seems a little bit different as it's on an individual rows level
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peterdesmet
Make sense!
I've updated the PR -- if a Table Schema author needs strictness here, they can use exactFields: true
. If there is a use case of requiring equal or more
fields we can add a separate constraint later
Co-authored-by: Peter Desmet <[email protected]>
@peterdesmet
It doesn't use
Yeah, I was thinking about it, and I did like the concept of applying constraints similar to JSONSchema e.g. |
This reverts commit f04b7ed.
@roll the current SHOULD wording is a "recent" change though. It was suggested by me in 2020. From 2012 to 2020 it was MUST, so I believe that is what most implementors/users consider v1. |
@peterdesmet BTW even before the change it had been using
|
This reverts commit f670614.
Regarding v1: ok, seems it was SHOULD. But I think a number of data consumers assumed MUST (including frictionless-r) and did offer |
I would vote for any, both I'm curious if there are better names to mirror it and use |
@roll, we have some precedents of
I think the positive properties (orderedFields, exactFields) are easier to understand than the negative ones (unorderedFields, ...) |
@peterdesmet Upd. |
schema.exact/orderedFields
properties; clarified extra/non-specified fields in Table Schema schema.fieldsMatch
property; clarified extra/non-specified fields in Table Schema
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
Co-authored-by: Peter Desmet <[email protected]>
With Paul's support expressed in the original issue we got the quorum. It was one of the most awaited features in the Data Package, and it's really great that we finally made it! Esp. in a form that beautifully aligned with the set theory that became real only after a great discussion (the initial proposal was way different). 🎉 |
ACCEPTED by WG (6/9) |
Overview
It's the first attempt at implementing a new
partial/syncSchema/schemaComplete
feature that also clarifies a default field-column mapping behavior in Table Schema. Also, it naturally makesfield.name
unique amongst other fields in Table Schema with backward-compatibility note for implementors.I propose to add this property to the Table Schema level as it regulates field-column mapping rules where both field and column concepts are defined on the Table Schema level (not on the Data Resource level). Also, because Table Schema can be shared or constitute catalogs, I think it will be right if compliance rules configuration belongs to abstract Table Schema rather than concrete Data Resources.
For example, if I publish a partial Table Schema I declare that any data source having fields A, B, C in any order will be valid to this schema. If I publish non-partial (default) Table Schema I declare that only a data source having only 3 fields A, B, C in exact order is valid to this schema. I think it's important for e.g. government regulation Table Schemas etc