-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extend summary to support list of objects / Better describe summary values #1045
Comments
The values in the summary array can be objects, but the summary values need to be compliant with the original types as defined by STAC. So your platform example is invalid, it can only be an array of strings. But a field that defines an object could be summarized with an array of objects, of course. That Stats Object for summaries allows objects, but in this case it's a single object per field and it must contain minimum and maximum values. So that's not so helpful for your use case. Now that 1.0.0 RC1 has been released, we also can't really change the summaries anymore (before STAC 2.0). So you need to put your additional details somewhere else to be STAC-compliant. That could be a new extension or maybe for APIs it would be covered by the queryables endpoint from OGC APIs? |
Ok but in this example (https://github.com/radiantearth/stac-spec/blob/master/examples/collection-only/collection.json), the summaries property contains a "eo:bands" entry that does not fulfill the Stats Object |
eo:bands (defined as object in the spec) is contained in an array there. I don't understand the issue with the example? |
Sorry I misunderstood the first explanation, i.e. that the properties in the summaries "must be compliant with the original types as defined by STAC". This is not clear from the specification (https://github.com/radiantearth/stac-spec/blob/master/collection-spec/collection-spec.md#summaries). Perhaps this should be explicitly added in the summary chapter ? |
Okay, I've planned this to be improved for RC2 and add something like "summaries must be compliant with the original types as defined by STAC". Of course, you can come up with your own field specifications, but then tooling will likely not understand it. It's probably better to figure out a way to make this work in a general-purpose extension. Although I'm not sure myself where this could be placed best. |
I do think we should have some mechanism to report 'counts' - a couple catalogs do it, and it's definitely a valuable thing to see. I'm very open as to how we do it, and seems to make sense to start in an extension, though I'm potentially open to an optional field as we are going to do RC2 so can put in minor changes. |
@jjrom - can you describe the use case for needing this? Counts make sense to me, and I could see an extension just for that, but want to understand your parentId and name bits, so we're sure to suggestion a mechanism that can fully meet your needs. |
To clarify my initial requirement, forget the other properties in my example (parentId, etc.) and just keep the count property. The use case is to provide the count of the number of items in a stac catalog for each property value. For instance I want to provide the number of items for platform "sentinel-2a" and for "sentinel-2b". With the constraint that the "summaries must be compliant with the original types as defined by STAC", it will never be possible to add this information within the summary property, because under "platform" only an array of string is allowed i.e. ["sentinel-2a", "sentinel-2b"]. The additional count information for "sentinel-2a" and "sentinel-2b" should then be provided in another property (e.g. "summary_extension" - forget the dummy name :). This extension would complexify the implementation both server side and client side. Server side, we need to duplicate platform information: the "STAC core" compliant "summary" property with an array of platform name; and an additional "summary_extension" property with an array of platform objects indicating the count attached to each platform value. Client side, the platform information can be retrieved in two places: the core "summary" with limited information (i.e. the array of string) or the "summary_extension" property with richer information. So I propose to make the summary property a bit more generic in the core specification as follows (wording to be refined) : Summaries must be compliant with the original types as defined by STAC. In the case of a basic data type (i.e. string or number), summaries can be expressed either 1) with the type value or 2) with an object where the type value is provided within a "value" property. For instance the summary "platform" could be expressed either as From an implementation point of view this is not a big deal - just add a if/else case on the summary type. But with a huge advantage that we can add additional fields to the object structure through extension directly within the summaries property. |
I'm opposed to change this for 1.0 as this is a rather huge breaking change (at least for my implementations), which we shouldn't do any more in a release candidate phase (rc should be feature freeze). I understand that there is a use case for it, but it seems that right now the only non-breaking option is a new extension, unfortunately. |
I understand your view but before rc2, the "summaries must be compliant with the original types as defined by STAC" was an implicit constraint...so until making it explicit in rc2, at least my implementations were perfectly valid from the specification. And now they are broken :) |
Yes, currently we would end up with two fields, which is unfortunate, I agree. On the other hand at some point we need to stop introducing breaking changes (in this case only for clients, indeed). For me that is the RC stage, but there are other opinions out there. |
@m-mohr "No, sorry, that is not correct. Although the written documentation was incomplete, the schemas (which are part of the spec) were already enforcing the arrays so your implementation would have been reported as invalid in previous versions already" My implementation is correct from the definition (
Since an object is "Any data type", I have the right to write the platform values as an array of objects: Only the "summaries must be compliant with the original types as defined by STAC" sentence enforce that in the case of "platform" the array must be an array of string and not an array of object. Nowhere in the schema this is specified |
Yes, sorry, I missed that you had encapsulated things in an array. I thought somehow it's an object similar to the Stats Object. I guess it would have been helpful to use markdown code formatting above that makes it easier to read. And I should take more time to read. Indeed, we can't enforce the "same datatype" thing in JSON Schema without a huge amount of additional/duplicated code. Still, I'd argue several things in the spec at least strongly imply that you would need to specify just the values, e.g.
Anyway, I'm against yet another breaking change in the RC stage of the spec. But that is just my opinion and others may vote for it. ;-) |
The funny thing is that i exactly did what was written in the specification i.e. "See the examples folder for Collections with summaries to get a sense of how to use them"...and I found the "eo:bands" property which is an array of objects (with the value being in a property called "name" :) So i generalized that to all my properties to be able to add the "count" property. I'm deeply sorry to raise this issue so lately in the process, but I think that is worth a larger discussion. I understand that this would have an impact on client implementation but it would provide a huge benefit to make the summary property easily extendable in the future. |
Hmm, eo:bands follows the specified data type (a merged array of objects), of course. So that is to be expected that it contains an object with a name for example. If we'd have got the feedback earlier, I guess I could have been talked into the object approach although I still like that the arrays as they are are more compact for most use cases. |
@m-mohr - do you have a clean alternative idea for the 'counts' use case? |
What is clean? I only have the solution to add a new field, there's no other solution I think. To be clear: Breaking summaries will make openEO stick to 0.9 for one or two years. We can't adopt the change during the openEO 1.0 release cycle. The type field though will also not be required there, it will be optional. 🤷♂️ I'm not convinced Jerome's implementation is valid. Adding random objects doesn't allow you to get the values because it's undefined where they are stored then. Implementations can only guess which property contains the value. How should that work? "Collections are strongly recommended to provide summaries of the values of fields that they can expect from the properties of STAC Items contained in this Collection. [...] A set of all distinct values in an array...." -> the values in item properties are clearly defined and they should be listed as a set. |
@m-mohr My implementation is a valid interpretation of what is written in the specification and it conforms to the json schema. It does not add random object but gives context to the summaries values. My proposal does not break any existing server implementations nor any validation tools. It only impacts client to check if the values in an array are a list of basic types or a list of objects. And in the the latter case, if the client only supports a list of basic types, it can easily convert the list of objects to a list of basic types values using the "value" property of that object (i.e. in javascript var propertyAsList = propertyAsObject.map(x => x.value);) Concerning the "count" use case : I want to provide users a summary of what is stored in the catalog - not only how many items are stored, but how many items per platform, per location, per date, etc. This allows to display on the landing page of the catalog a comprehensive set of statistics to indicates the repartition of the items per property. This is a very common use case among data provider (at least space agencies) to indicates on the landing page how many products they have. Another example of the array of basic type limitation is that it supposes an implicit understanding of the values provided. The platform property contains obvious values (i.e. "sentinel-2a" and "sentinel-2b"). But this is not the case for proj:epsg property. For instance it is far more convenient to work with named projection (e.g. for France, I use the "Lambert-93" projection - i never remember its epsg code). So I see a strong interest to be able to provide a human readable title to be used in rendered displays of the property. This would be possible with an array of object instead of an array of basic types. As @cholmes proposes, we should discuss this with everyone involved. |
But several clients.
No, not yet, because the value property is not defined anywhere yet. And thus a client currently can't find any of the values. The spec says you should provide a set of values from the item properties, but you don't provide a set of values from properties. You are providing a set of objects which contain the value somewhere, but it's not clear where it is for clients. So I still think the interpretation is not correct and the spec was specifying it without the clarification about the data types that those objects are not allowed. It's implicitly, but I don't think there's really much room for making it some kind of object. Surely, this could be possible to be implemented in the future and I see the value in your use case, but for me it just doesn't work in summaries. Another question: Is your use case more aimed towards APIs? Like updating counts in a server is easy, but updating a collection for every new item in a static catalog would probably be problematic. For APIs, I'm thinking that you could use OGC API Queryables (see https://portal.ogc.org/files/96288#filter-queryables), which allow quite a bit more for your use case (i.e. full JSON Schema). Doesn't that seem more reasonable to re-use this existing vocabulary instead of defining a new vocabulary here? We could also backport the OGC queryables spec to a queryables/schema extension (we could also call it schemas or something else, because queryables may not fit completely) and be added as new field to collections that would just have what APIs also define for search anyway. It also works for static catalogs, of course. Your platform example there could look like this: "platform": {
"oneOf": [
{
"const": "sentinel-2a",
"title": "Sentinel - 2A",
"parentId": "root",
"count": 211199
},
{
"const": "sentinel-2b",
"title": "Sentinel - 2B",
"parentId": "root",
"count": 208632
}
]
} For EPSG codes some of my clients have a pre-defined list of EPSG codes and their titles and then can just use those without having them in the metadata. |
I don't know how many clients are in the field. And I don't either know how many of them support the optional summaries option. I don't know how many client could be affected. This is why we need a larger discussion with other implementors. |
The following clients and libraries implement summaries as defined in RC2 and are affected:
Plus:
|
To give an example, I've migrated the collection-only example given above (https://github.com/radiantearth/stac-spec/blob/master/examples/collection-only/collection.json) to the proposed API with consistently all objects. Before:
After:
|
Matthias would you be available for a meeting in the next days ? I'm sure that this will allow us to reach a mutual understanding that we obviously are unable to achieve here. |
Yes, sure. Monday is a public holiday for us, Wednesday has some meetings, but otherwise, I'm pretty flexible. |
After a nice discussion with @jjrom this morning, we have come up with a potential solution that at least in my STAC implementations and openEO would be the least breaking. It basically leaves the array structure as it is and only allows native data types, but instead it uses the extensible structure of the Stats Object and evolves it a bit further. We could make it a Schema Object instead of a Stats Object and align it with JSON Schema. As JSON Schema allows new keywords to be defined, we'd define a new keyword "values", which is an array of objects with details about every value, see variant 3. The breaking change would be that the Stats Object, would either require "minimum" and "maximum" (to specify an extent) or allow the "values", which is similar to the array, but can only be an array of objects with a "value" and probably at least one additional property (we have not discussed this requirement yet, Jerome, but I think that would help to distinguish it from the array of native typed values). Example: {
"summaries": {
// variant 1: Array with Stac-native data types, including array merging - no change to rc2
"platform": [
"Sentinel-1A",
"Sentinel-1B",
"Sentinel-1C"
],
// variant 2: Schema Object - formerly known as the "Stats" Object (not very meaningful example)
"platform": {
"minimum": "Sentinel-1A",
"maxmimum": "Sentinel-1C"
}
// variant 3: Schema Object - addition in rc3 is that either "values" (array of objects) or "minimum" and "maximum" are required
"platform": {
"type": "string", // You can optionally add whatever JSON Schema keywords that helps you describe things better
"values": [ // New JSON Schema keyword
{
"value": "Sentinel-1A",
"count": 321
},
{
"value": "Sentinel-1B",
"count": 123
},
{
"value": "Sentinel-1C",
"count": 222
}
]
}
}
} One thing to keep in mind is that this would be yet another breaking change in RC stage of the specification with potential to break things and have side effects that we are not aware of yet. The clean alternative without breaking things (but requiring a bit of duplication in collections) is to define a separate "schema" extension, e.g. {
// variant 4: new extension to specify JSON Schema per field
"schemas": {
"platform": {
"type": "string", // You can optionally add whatever JSON Schema keywords that helps you describe things better
"values": [ // New JSON Schema keyword
{
"value": "Sentinel-1A",
"count": 321
},
{
"value": "Sentinel-1B",
"count": 123
},
{
"value": "Sentinel-1C",
"count": 222
}
]
}
}
} So we need feedback on whether another breaking change by adding in variant 3 would be worth it and supported by the STAC community or whether we should be conservative and evolve new things in STAC extensions first (Alternative 4). |
Thanks you again @m-mohr for the great discussion this morning ! It's true that variant 3 introduces a breaking change which is not exactly what we want in a RC release. However this change should have zero to a a limited impact to existing implementations and it will allow future extension to the "summaries" property which brings a huge benefit to the specification. |
Glad you guys had a good discussion and got to some solid ideas. I can get behind option 3 and doing a breaking change, as it is nicely minimized to just the stats object. I do hear you on it potentially introducing things we're not yet sure of. But it does feel a bit duplicative to have to list all the fields twice. Using JSON Schema I do think is fairly elegant. Though also ironic in light of the comment I made when we first added summaries - #413 (comment) But I think I actually like this approach, of using a subset of JSON Schema. It seems to let us pick which parts of JSON Schema to use, instead of us re-inventing something that people want JSON Schema for. I'm +1 on either option, and I think overall I do like option 3 better than 4, since I feel like a new field called 'schema' that is a json schema could lead to confusion and people trying to fully implement the JSON schema for their collection there. So I like that it's in a 'summary', but we're reusing JSON schema to describe a summary. |
I actually quoted you in the call with Jerome. You said in the SF sprint at Planet that you don't want yet another schema language in summaries... ;-)
Limiting JSON Schema to a subset was not discussed yet by @jjrom and me, but probably makes sense. On the other hand, we discussed requiring (
Of course, the name was just an example. We can name it whatever we like to avoid confusion. |
I'm also ok to just have it evolve through use, and just have the recommended fields to use in the spec, with examples. Key to me would just be not expecting tooling to have to fully support JSON schema parsing to be 'compliant' with the spec.
I'm not sure exactly what you mean but I think I had a similar thought, but was hesitant to suggest it. My thought was to do something like make the stats object an extension, and have the 'core' for now just be the array, and evolve the object option in an extension. I think maybe you're saying to keep an object in core, but it minimal and extensible, and evolve the semantics in an extension? Whatever the exact path is I think I do like it. A more extreme option would be to move summaries as a whole to be an extension, in line with the same instinct to push all the extensions out of core, so the core spec is as slim and extensible as possible. But I don't need to push for it, just throwing it out there. |
tbh I don't really understand what was wrong with the original implementation. "Any data type" means any data type, so in Re:breaking changes, I agree with @cholmes that this is the kind of feedback that RCs are for -- the spec was unclear, which made an apparently innocuous change actually a breaking change, so that should be resolved. Refining the type of summaries to me seems like the right direction for resolving it. |
My intention was not to move summaries to an extension! Also, it still needs a way to specify ranges, but we could loosen the requirement for having minimum and maximum as required properties and simply allow JSON Schema in there with a recommendation to use minimum/maximum for ranges? Not sure. Anyway, if most people are fine with breaking changes, then we can simply PR variant 3 IMHO. The PR should also fix the schema to say
instead of
|
While writing up the JSON Schema for this, I realized that we have an inconsistency with JSON Schema in our spec: |
On the other hand, we may get away without defining our own keyword. We can use "oneOf" (or "anyOf") and "const" instead of from JSON Schema. Example:
|
In the Collection "summaries" property, a summary for a field can be specified in two ways - either 1) an array of "all distinct values in an array", or 2) "a Stats Object".
With this definition, the summary for the platform field with two platforms (sentinel-2a and sentinel-2b) would be:
"platform": [
"sentinel-2a",
"sentinel-2b"
]
In resto, we have additional statistical information for each fields (including platform). In particular, the number of items for each platforms can be provided to the user. It sounds natural to provide this information through the "summaries" property. As a consequence in resto, the previous example is written this way:
"platform": [
{
"id": "sentinel-2a",
"name": "Sentinel - 2A",
"parentId": "root",
"count": 211199
},
{
"id": "sentinel-2b",
"name": "Sentinel - 2B",
"parentId": "root",
"count": 208632
}
]
Does the "all distinct values in an array" mean that the "value" is a string or a number or does it also mean that the "value" can be any object ? In the latter case, the resto approach is valid against the specification. Otherwise, could we update the specification to allow this approach ?
The text was updated successfully, but these errors were encountered: