Dataset metadata spec #164

jpoehnelt · 2018-08-15T17:15:12Z

Just getting it down somewhere...

cholmes · 2018-08-21T05:43:06Z

dataset-spec/README.md

+| provider      | [Provider Object]                     | Data Provider                   | The organization that creates the content of the dataset.    |
+| host          | Host Object                           | Storage Provider                | The organization that hosts the dataset.                     |
+| geometry      | [GeoJSON Object](http://geojson.org/) | Spatial extent (required)       | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
+| datetime      | string                                | Temporal extent (required)      | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. Open date ranges are not supported by ISO 8601 and MUST be encoded as proposed by [Dublin Core Collection Description: Open Date Range Format](http://www.ukoln.ac.uk/metadata/dcmi/date-dccd-odrf/2005-08-13/). |


I think it could be good here to not reuse datetime and geometry, as those are item specific, and instead really talk about the extent.

Ideal to me would be to align with WFS3, so we can add the dataset fields directly to their 'collection' metadata. They have it like:

"extent": {
"spatial": [ 7.01, 50.63, 7.22, 50.78 ],
"temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ]
},

See 'example 4' on https://rawgit.com/opengeospatial/WFS_FES/master/docs/17-069.html

I could be up for a different form, like flatten to spatial_extent and temporal_extent - but I don't think we should reuse the item fields, and I don't believe a dataset should need to conform to the Item schema. If we go with something different I'd want to advocate to WFS3 group to align with us. And I'd see a WFS 3 compliant STAC implementation using these dataset fields in their /collections/{collectionId}/ endpoint.

I'm fine with renaming it to spatial extent and temporal extent, we just used those names for consistency within STAC. (Edit: Changed.)
I don't like the the way temporal extents are specified in WFS though as it doesn't allow to leave the end time out (i.e. open date ranges).

cholmes · 2018-08-21T05:44:41Z

dataset-spec/README.md

+| host          | Host Object                           | Storage Provider                | The organization that hosts the dataset.                     |
+| geometry      | [GeoJSON Object](http://geojson.org/) | Spatial extent (required)       | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
+| datetime      | string                                | Temporal extent (required)      | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. Open date ranges are not supported by ISO 8601 and MUST be encoded as proposed by [Dublin Core Collection Description: Open Date Range Format](http://www.ukoln.ac.uk/metadata/dcmi/date-dccd-odrf/2005-08-13/). |
+| process_graph | Process Graph Object                  | Processing chain                | ...                                                          |


What is this? And what's the definition of it? May make sense as an extension?

Process graph was added for provenance, but will definitely be an extension. The same applies for dimensions. Changed that, too.

matthewhanson · 2018-08-21T08:36:58Z

dataset-spec/README.md

+| description    | string  | Description           | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
+| scheme         | string  | Scheme (required)     | Values: S3, GCS, URL, OTHER                                  |
+| id             | string  | Identifier (required) | Host-specific identifier such as an URL or asset id.         |
+| region         | string  | Region                | Provider specific region where the data is stored.           |


Is region primarily an AWS thing or is it general to all cloud providers?

The schema was written before we knew about the idea of storage profiles in #148. I would really like to have profiles instead of having the storage details directly baked into the dataset spec.

region would probably be an AWS specific thing, which should be in a separate profile as proposed in #148. If others have regions, too, then they should have it separately in their profiles aswell.

GCS has regions too

Ok, that makes sense though - each cloud provider should define it's own storage profile (extension?)

I'd go that route, yes. Is that dataset specific or could that be also catalog or item specific?

matthewhanson · 2018-08-21T08:38:34Z

dataset-spec/README.md

+
+| Element      | Type     | Name               | Description           |
+| ------------ | -------- | ------------------ | --------------------- |
+| unit         | ?        |                    |                       |


If some sort of unit field is provided it should probably be per-band. But I'm not convinced it's needed, what are the possible values here?

Yes, makes sense to have it per band (and per dimension). Moved it to the bands. An example would be °C for temperatures, but could be any unit of measurement. Preferably SI, but could also link to UDUNITS or the dictionary of UoM.

matthewhanson · 2018-08-21T08:40:01Z

dataset-spec/README.md

+| Element   | Type     | Name          | Description                                                  |
+| --------- | -------- | ------------- | ------------------------------------------------------------ |
+| nodata    | [number] | Nodata values | The no data value(s).                                        |
+| data_type | string   | Data Type     | Data type for band values including its bit size. Values: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float32, float64 |


I do think nodata, scale, and offset are useful to have per band, but not sure about data_type. The other fields may not be provided in the actual datafile (although providers should be encouraged to set them), but data_type will always be available.

data_type was a field I ported from openEO and got spec'ed by my predecessor. I agree, I think it doesn't really make sense. Removed it for now.

matthewhanson · 2018-08-21T08:43:05Z

dataset-spec/README.md

+*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).*
+
+## Core
+


In the EO spec we have gsd (Ground Sample Distance) at the top level, and it's also provided per band because resolution may vary by band. At the top level it represents the best resolution to enable searching. I think it makes sense at the Dataset level rather than the Item level.

@matthewhanson What do you think? Would it make sense to extend and share the EO extension across datasets and items or to have it separated? Or should there be one extension, which has sections on items and datasets? I think I'd prefer to share the same extension... some definitions probably make sense for items and datasets equally.

I'm not sure I entirely get what you are saying.
I think datasets are a concept which will be used in general with STAC - although I'm still not sure if the intention is that datasets are always present and if they are themselves part of core. I personally think they should part of core, the include core fields: temporal and spatial extent are the unions of core fields, license, provider...

So the EO extension, or any extension, should define additions to both the Dataset and the Item.

I hope datasets are part of the core and always present, but not sure what others think.

I agree with what you are saying about extensions and that answers basically my question. So I would expect that the additional EO fields we are proposing for datasets will be incorporated into the EO extension. If you are okay with that, I'd already move them to the eo extension in the branch we are currently working on. They are currently a bit badly located in the dataset-spec.

Yes, I think those additional fields belong in the EO extension...although I think that I'd still like to see some of the non-varying asset info in Datasets, but I'll make the case and provide some examples in the EO extension.

Not sure whether speaking about the same. I am speaking about an EO extension that - whenever meaningful - is shared between Dataset and Item and can be used in both locations! Are you just talking about an EO extension limited to items? Otherwise I don't get the point you make with "I'd still like to see some of the non-varying asset info in Datasets".

I think we are talking about the same thing, EO extension is shared between Dataset and Item.

What I'm saying is that some of the Asset information, such as the list of possible assets and what their types are, can be added at the Dataset level. I think I explained it a bit better elsewhere. Maybe need to make a new issue for it, these PR is getting a bit hard to follow.

Great, then I'd like to have your support here (regarding global extensions): #186 ;-)

Sure, I think more issues would make sense! I think we are also discussing that in #174. I'd like a proposal on this. I don't think it's simply copy and paste (minus url) from the assets spec in the items?

matthewhanson · 2018-08-21T14:34:23Z

dataset-spec/README.md

+| ------- | -------- | ------------- | ------------------------------------------------------------ |
+| nodata  | [number] | Nodata values | The no data value(s).                                        |
+| offset  | number   | Offset        | Offset to convert band values to the actual measurement scale. |
+| scale   | number   | Scale         | Scale to convert band values to the actual measurement scale. |


Been thinking about this more and scale and offset definitely need to be specified per asset, not per band. Individual items can have different scale and offsets. While Landsat collection 1 uses a consistent scale/offset for all scenes, this was not always the case and a provider may certainly provide them per Item, it is not really a band characteristic, it is an asset characteristic.
I had thought of it because there are other reasons too - for instance I might take the RGB bands and scale them to a Byte and provide that as a visible True Color image, or other bands to make a False Color image, but still want to reference the bands that make up that 3-band image.

In the same vain, units really is per asset as well. I might provide a TOA datafile as well as a Surface reflectance file.

And in those cases I might want to use different nodata values as well, so they really should all be per-asset. Units and nodata can be specified per asset in the Dataset, but scale/offset needs to be specified for each asset in each Item.

@m-mohr @matthewhanson gain (scale) and offsets may have distinct values depending on the date, see the case for LS4 and LS5. How should that cases be represented? I also prefer gain instead of scale, but this may be a Landsat thing.

@matthewhanson : I don't quite follow. What is an asset in this case? My understanding that by assets STAC people usually mean files on disk. Do you mean a specific file for a specific item? But a gdal file can have two bands that have different gain/offset.

For reference, datasets in EE almost always have the same gain/offset for each band across all items. The two exceptions are Landsat and SeaWiFS OceanColor dataset that has different processing for different items in the same dataset. We did not dare to change Landsat, but for SeaWiFS we applied gain/offset before ingestion to make the data easier to use in ARD-like manner.

So in general, yes, each item can have its own gain/offset. I've never seen nodata values differ across items, though. If they do, it might make more sense to split the dataset in two or more.

@matthewhanson In the normal "STAC way" with Items and Assets that might be feasible, but for openEO (and GEE?), we still need those information - whenever meaningful - per dataset or band, as we don't have any items/assets where this information is available! So we probably need some of these properties in all locations.

@fredliporace

Honestly, it's just a name, but I am more used to scale. Don't really care whether its gain or scale. It has the same meaning, correct?

Not sure whether we can cover all of that, currently it can't represented. Maybe just leave it out if it depends on another property? I don't have a proper solution for it yet - but open for proposals.

@fredliporace Right, that's why I'm saying scale and offset need to be provided per asset rather than at some higher level like the dataset.

I actually prefer gain as well, but scale and offset are the terms used by GDAL which make them a bit more universal I think.

@simonff What I'm saying is that the scales and offsets need to be provided for each specific asset, and they need to be arrays because as you say the asset could have multiple bands.

I agree nodata values probably won't be different across items, it would probably be fine to specify them per asset at the Dataset level.

Well, than it's easy to solve, right? If scale/offset are per asset, define it in the asset. If it's per band, define it in the band (any maybe also in the asset, not sure).

And again, openEO can't use it per item, so I'd need both ways anyway.

fredliporace · 2018-08-22T02:05:53Z

dataset-spec/README.md

+| license_url     | string                                | Dataset License URL             | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. |
+| provider        | [Provider Object]                     | Data Provider                   | The organizations that created the content of the dataset.   |
+| host            | Host Object                           | Storage Provider                | The organization that hosts the dataset.                     |
+| spatial_extent  | [GeoJSON Object](http://geojson.org/) | Spatial extent (required)       | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |


@m-mohr Is this to be interpreted as 'possible extent' or as 'current extent'? My concern here is that some missions have the capacity to image most of the Earth but do not make a systematic acquisitions - CBERS for instance.
The footprint for CBERS-4 MUX scenes may be obtained here, just select CBERS on the upper right panel. You'll see that most of northern Canada is not covered yet by the dataset, but a new scene may be acquired anytime in the future. In that case should the spatial_extent be changed? If is is changed could the dataset version be the same?

I would prefer possible extent to keep search results (over datasets, not over items in a dataset) more consistent.

For openEO we had it defined as current extent and so I had that in mind, but I am open to both. Whoever has the best arguments wins. ;-)

I think I prefer possible extent. That's a lot easier to implement for static providers. And someone who wants it to be 'current extent' can do that if they want - the current extent is certainly within the possible extent.

I don't want catalogs to feel like they have to always be updating this field.

I lean towards 'possible extent'. An implementor could choose to make theirs 'current extent', since the current should be a subset of the possible. But I think it's better to not require static catalogs to keep updating their extent every single time there's new data outside their current.

"possible" vs "current" also applies for the temporal_extent as open date ranges would always be "possible". I would like to have them and for consistency and the reasons mentioned above, I now slightly prefer "possible", too. But with open date ranges we are not compatible with WFS, see also opengeospatial/ogcapi-features#155.

Added "potential" to both extents.

fredliporace · 2018-08-22T02:15:42Z

dataset-spec/README.md

+| ------- | -------- | ------------- | ------------------------------------------------------------ |
+| nodata  | [number] | Nodata values | The no data value(s).                                        |
+| offset  | number   | Offset        | Offset to convert band values to the actual measurement scale. |
+| scale   | number   | Scale         | Scale to convert band values to the actual measurement scale. |


@m-mohr @matthewhanson gain (scale) and offsets may have distinct values depending on the date, see the case for LS4 and LS5. How should that cases be represented? I also prefer gain instead of scale, but this may be a Landsat thing.

simonff · 2018-08-22T04:23:50Z

dataset-spec/README.md

+| version         | string                                | Dataset Version                 | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
+| license         | string                                | Dataset License Name (required) | Dataset's license(s) as a [SPDX License identifier or expression](https://spdx.org/licenses/) or `proprietary` if the license is not on the SPDX license list. See `license_url` for more information. |
+| license_url     | string                                | Dataset License URL             | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. |
+| provider        | [Provider Object]                     | Data Provider                   | The organizations that created the content of the dataset.   |


I suggest a single provider here. I added support for multiple providers in the EE catalog, and never found any use for it other than expressing the processing chain, which we want to do more systematically elsewhere.

I am hesitant to only rely on the process chain extension (or however we call it). It is still a long way to have standard to properly define these processing information apart from maybe a provider and a dataset url. Still, we should have something small in the core, I think. It should be easy to users to get at least some information about the history. I think the processing chain information would be much much harder to express and then it will just be left out. We also discussed the field derived_from. Can we combine that somehow?

And for me it's also to give proper credit and having them all makes clear what to put here. A single provider again leaves it open whether it's the RAW data provider or the last one processing it.

Maybe we could also just have something like "history" in the core, which has a list of provider name + provider homepage + dataset url (derived_from). As we don't need the dataset url for the last provider (it's that catalog) the last provider would be the provider in the dataset. A process chain extension just extends the History Object and the Dataset Object, so that a process_chain can simply be added to each history element, too.

Example follows...

{ "id":"sentinel2-processed", "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.", "spatial_extent":{ }, "temporal_extent":"2015/2018", "license":"Apache-2.0", "provider":{ "name":"Final processor, Inc.", "url":"http://www.final-corporation.com" }, "pc:process_chain":{ "process":"max_time_composite" }, "history":[ { "provider":{ "organization":"Another Processing Comp, Inc.", "url":"http://processing.inc" }, "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json", "pc:process_chain":{ "process":"ndvi" } }, { "provider":{ "organization":"ESA", "url":"http://esa.eu" }, "dataset_url":"http://esa.eu/data/sentinel-2" } ] }

We could also put the last provider directly into the history, dataset_url would be to self or omitted. Then we would have no direct provider in the top-level, but that would be okay for me.

{ "id":"sentinel2-processed", "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.", "spatial_extent":{ }, "temporal_extent":"2015/2018", "license":"Apache-2.0", "history":[ { "provider":{ "name":"Final processor, Inc.", "url":"http://www.final-corporation.com" }, "pc:process_chain":{ "process":"max_time_composite" } }, { "provider":{ "organization":"Another Processing Comp, Inc.", "url":"http://processing.inc" }, "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json", "pc:process_chain":{ "process":"ndvi" } }, { "provider":{ "organization":"ESA", "url":"http://esa.eu" }, "dataset_url":"http://esa.eu/data/sentinel-2" } ] }

simonff · 2018-08-22T04:31:30Z

dataset-spec/README.md

+| ------------ | -------- | ------------------ | --------------------- |
+| asset_schema | object   | Asset Schema       | TODO                  |
+| nodata       | [number] | Nodata values      | The no data value(s). |
+| pyramid      | object   | Pyramid parameters | TODO                  |


So far I only used this as en enum describing the way averaging is down during pyramiding: MEAN for normal continuous bands like EO observations or temperature/DEM/etc, MODE for categorical values (eg, landcover classes), SAMPLE for the rest (typically this means bitmask bands). SAMPLE means just take one possibly random value, as nothing else makes sense. (In reality in this case we always take UL value for each 2x2 grid).

@simonff, could you come up with a proposal for this field? And maybe also asset_schema? These initially were your ideas and I think you have a much better idea to define them.

m-mohr · 2018-08-22T07:34:52Z

Added a first draft for the dimension extension, is open for review.
Was partly proposed by a partner of openEO on basis of the OGC Coverage Implementation Schema.

…ents to dataset spec.

…hema. Fixed tables.

m-mohr · 2018-08-24T09:59:24Z

Another issue we should have in mind: #150

cholmes · 2018-08-24T14:35:12Z

dataset-spec/README.md

+| provider        | [Provider Object]                     | Data Provider                   | The organizations that created the content of the dataset.   |
+| host            | Host Object                           | Storage Provider                | The organization that hosts the dataset.                     |
+| spatial_extent  | [GeoJSON Object](http://geojson.org/) | Spatial extent (required)       | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
+| temporal_extent | string                                | Temporal extent (required)      | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. ToDo: Support open date ranges |


It could be really good to try to get compatibility with WFS on the extent fields. They use:

"extent": { "spatial": [ 7.01, 50.63, 7.22, 50.78 ], "temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ] },

If we want we can try to influence them to adopt our convention, but we should have good reasoning.

I'm perfectly fine with the additional level of hierarchy (i.e. putting them both in an object with the key extent).

Spatial is fine, too.

Temporal on the other hand is a problem. Open date ranges are not supported. I already opened an issue on that: Issues with temporal data, including (open) ranges / periods opengeospatial/ogcapi-features#155

What is missing from your example is trs and crs. Do we need to support them?

I just used 'example 4' from the WFS spec. It notes: 'Coordinate reference system information is not provided as the service provides geometries only in the default system (WGS84 longitude/latitude)'. So seems like we could just say for dataset we require default.

Okay, than that's fine. Curious what the WFS crew is coming up with for the other issues mentioned. Will change that in the dataset spec. Temporal extent is still a string for now and need to find out how they define 3D bboxes (= incl. the z-axis).

Should be mostly WFS compatible now.

fredliporace · 2018-08-26T23:12:53Z

dataset-spec/README.md

+
+| Element     | Type              | Name                            | Description                                                  |
+| ----------- | ----------------- | ------------------------------- | ------------------------------------------------------------ |
+| name        | string            | Identifier (required)           | Identifier for the dataset that is unique across the provider. The identi |


@m-mohr possible accidental text 'cut'

Fixed, thank you for the hint.

…ns, added rel types, ...), updated JSON schema and added an example.

… a separate PR.

m-mohr · 2018-08-27T11:28:28Z

I fleshed out the spec a bit more.
Additionally, I cleaned up this PR to just contain the dimension extension and the dataset spec for now.
I think it is time for a review to get this in so we can develop and discuss better based on this first proposal.

ghost · 2018-08-28T19:39:23Z

This looks pretty good to me. I think reading through this got me a little closer helped me understand how the concepts of "catalog", "collection" and "dataset" are being handled.

I'm not sure it's clear where the extensions fit in? Are they just added as additional properties in the root dataset object? Is dimensions a dataset specific extension or does it also apply to Items?

It sounds like any extension (eo in particular was discussed) can be applied at the dataset level? If so, do we need to indicate that item metadata takes precedence over dataset metadata?

cholmes

Looks great. Added a few typo corrections, but I don't see any blocking - would be good to get this in.

cholmes · 2018-08-29T00:27:06Z

dataset-spec/README.md

@@ -0,0 +1,103 @@
+# STAC Dataset Spec
+
+[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset.  The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.


Is it that STAC Items 'use' the spec? I guess 'use' is less clear to me, it seems to imply that they should maybe implement the fields? Maybe describe what 'use' means? Like 'STAC Items are recommended to provide a link to a dataset definition'?

Improved that as suggested.

cholmes · 2018-08-29T00:31:43Z

dataset-spec/README.md

+
+[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset.  The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.
+
+The Datasets Spec is a superset of the [Catalog Spec](../static-catalog/). I shares the same fields and therefore every Dataset is also a valid Catalog. Datasets can have both parent Catalogs and Datasets and child Items, Catalogs and Datasets. 


I -> It

Also not sure if 'is a superset' is the best way to explain it. I'd say more like 'extends the catalog spec'. Probably would add a bit of color, like 'extends the catalog spec with additional fields to describe the set of items in the catalog'.

Both fixed/added.

cholmes · 2018-08-29T00:34:06Z

dataset-spec/README.md

+
+### Provider Object
+
+The object provides information about a provider. A provider is any of the organizations that created or processed the content of the dataset and therefore influenced the data offered by this dataset.


I think you said somewhere that multiple providers are allowed? If so I'd make it a bit more explicit here. If not then we should provide guidance on which one of the 'any...that created or processed'.

The Provider object provides information about a single provider, but the Dataset object can hold multiple providers in an array. I made that more clear in the provider field.

Makes sent, sounds great.

cholmes · 2018-08-29T00:35:07Z

dataset-spec/README.md

+
+The objects provides information about the storage provider hosting the data. 
+
+**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec once this concept id introduced to STAC.


typo: id -> is

cholmes · 2018-08-29T00:37:50Z

dataset-spec/example-s2.json

+  "name": "COPERNICUS/S2",
+  "title": "Sentinel-2 MSI: MultiSpectral Instrument, Level-1C",
+  "description": "Sentinel-2 is a wide-swath, high-resolution, multi-spectral\nimaging mission supporting Copernicus Land Monitoring studies,\nincluding the monitoring of vegetation, soil and water cover,\nas well as observation of inland waterways and coastal areas.\n\nThe Sentinel-2 data contain 13 UINT16 spectral bands representing\nTOA reflectance scaled by 10000. See the [Sentinel-2 User Handbook](https://sentinel.esa.int/documents/247904/685211/Sentinel-2_User_Handbook)\nfor details. In addition, three QA bands are present where one\n(QA60) is a bitmask band with cloud mask information. For more\ndetails, [see the full explanation of how cloud masks are computed.](https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-1c/cloud-masks)\n\nEach Sentinel-2 product (zip archive) may contain multiple\ngranules. Each granule becomes a separate Earth Engine asset.\nEE asset ids for Sentinel-2 assets have the following format:\nCOPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the\nfirst numeric part represents the sensing date and time, the\nsecond numeric part represents the product generation date and\ntime, and the final 6-character string is a unique granule identifier\nindicating its UTM grid reference (see [MGRS](https://en.wikipedia.org/wiki/Military_Grid_Reference_System)).\n\nFor more details on Sentinel-2 radiometric resoltuon, [see this page](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/radiometric).\n",
+  "license": "proprietary",


Is the S2 license 'proprietary'? If it is in our current definition maybe we should expand the definition a bit. Or ideally find the ideal spdx license. For landsat I think we just used https://spdx.org/licenses/PDDL-1.0.html as getting across the intent of US public domain stuff...

Honestly, I don't really know and don't feel in the position to decide that. That example is taken from GEE and I handled it the way GEE does.

Ok, cool. Super minor issue, so let's just leave it, and following GEE sounds good.

m-mohr · 2018-08-29T08:19:25Z

Mostly fixed the issues mentioned by Chris. One more review required.

@hgs-truthe01 We still need to elaborate more on the extensions and "inheritance".

cholmes · 2018-08-29T15:18:09Z

Looks good to me, and looks like my green review checkmark is still there. Just need one more review - @matthewhanson or @hgs-truthe01 ?

m-mohr · 2018-08-29T15:22:04Z

I just made another compatibility change, which improves the WFS 3 compatibility as we are now using their specification for temporal extents with one exception. We were using ISO8601 date ranges before and WFS3 uses a two element array. We changed to two element arrays, but here it is allowed to set one of the elements to null to support open date ranges. Proposed that also in the WFS issue tracker, maybe we get that into WFS (but doesn't seem anybody is currently very interested in temporal extents ;-) ).

ghost

I guess I will save any extension/inheritance discussion for another thread. At the very least this cleans up the root catalog stuff nicely.

ghost · 2018-08-29T17:11:38Z

One last comment.

We have so far been standardizing on GeoJSON and RCF3339 for space and time. The extent stuff is introducing a new way to specify similar information.

It is a slightly different case, so I could be convinced that the extent definitions should be different, but it is less developer friendly to use two competing formats for representing information. One of the issues I have seen with geospatial api's is around extents and projections. I would prefer to only represent geospatial coordinates in one projection to keep clients from needing a full projection engine.

I only bring this up because Space and Time are the core parts of the spec. If we are not consistent, it feels weird to me.

m-mohr · 2018-08-29T17:18:11Z

@hgs-truthe01 We are using RFC3339 now. That's what I wanted to say with my latest comment. ;)
GeoJSON is not applicable for temporal extents. For spatial extents we are following WFS3, which is also WGS84, but doesn't allow GeoJSON. So I think we got it combined mostly and can't do much more.

simonff · 2018-08-29T16:41:06Z

dataset-spec/README.md

+| keywords    | [string]          | List of keywords describing the dataset.                     |
+| version     | string            | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
+| license     | string            | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. |
+| provider    | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. |


I think a single provider is enough for the vast majority of the cases.

Cool. Are you saying we should just mention that this will usually just be one provider? Or that we should shift from array to a single object?

I suggest shifting to single object to reduce compelxity. (In the internal EE catalog, I started with providers a list and found that this is not necessary in 99.9% of cases, and the remaining 0.1% is better handled by provenance.) I'm open to revisiting this, but I'd like to see more examples first.

I think this need more discussion, but I think an issue would be the better place as this PR is already very messy.

simonff · 2018-08-29T16:42:36Z

dataset-spec/README.md

+| version     | string            | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
+| license     | string            | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. |
+| provider    | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. |
+| host        | Host Object       | Storage provider, the organization that hosts the dataset.   |


In addition to the 'host' field, I suggest the 'source' field of the same type Host that points at the canonical source of the data. This is needed for 99% of datasets in the EE catalog, as we are mostly a mirror.

Open to add that, but I think we can add that with another PR shortly after this PR.

simonff · 2018-08-29T16:43:29Z

dataset-spec/README.md

+
+* [EO extension](../extensions/stac-eo-spec.md)
+  Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets.
+* [Dimensions extension](../extensions/dimensions)


we don't have any examples of data with extra dimensions yet, so I suggest dropping dimensions from this version (and just mentioning it as 'planned" here, like we do for the provenance extension)

We have at least a draft for this extension in contrast to provenance. And I have data for the dimension extension. I'll ask my colleague to publish that. But we can move that to a separate PR, if that's preferred by you or others.

Right, I'd rather reduce this PR and then calmly discuss the change separately.

Okay, will separate when I'm back in the office in 9 hours ;)

Dimensions extension has been moved to PR #227.

simonff · 2018-08-29T16:44:15Z

dataset-spec/README.md

+| parent  | URL to the parent [STAC Catalog](../static-catalog/) or Dataset. |
+| child   | URL to a child [STAC Catalog](../static-catalog/) or Dataset. |
+| item    | URL to a [STAC Item](../json-spec/).                         |
+| license | The license URL for the dataset SHOULD be specified if the `license` field is set to `proprietary`. If there is no public license URL available, it is RECOMMENDED to supplement the STAC catalog with the license text in separate file and link to this file. |


Should we distinguish between links to official license pages vs local copies? Eg, rel="license_copy"

Why should we distinguish them? Don't see a good reason yet and we don't do it for items, too.

Having to create a local license file means that the data provider has not organized their licenses well and may change them later, making local copies obsolete. But this is speculative, so I'm okay with not changing this now.

Okay, great. Please, make sure to open an issue on this one so we don't forget.

m-mohr · 2018-08-30T08:11:14Z

Moved Dimensions extension as requested. Can be merged now, I think.

jpoehnelt changed the title ~~[WIP] ~Collections~ Datasets~~ [WIP] Collections, Now Datasets Aug 15, 2018

jpoehnelt changed the base branch from master to dev August 15, 2018 22:45

cholmes changed the title ~~[WIP] Collections, Now Datasets~~ [WIP] Dataset metadata spec Aug 18, 2018

m-mohr mentioned this pull request Aug 20, 2018

Storage profiles #148

Closed

cholmes reviewed Aug 21, 2018

View reviewed changes

matthewhanson reviewed Aug 21, 2018

View reviewed changes

cholmes added this to the 0.6.0 milestone Aug 21, 2018

fredliporace reviewed Aug 22, 2018

View reviewed changes

simonff reviewed Aug 22, 2018

View reviewed changes

jpoehnelt and others added 7 commits August 22, 2018 11:10

dataset schema

4ff110f

Added human-readable specification for datasets.

2980c07

Renamed extents, improved docs and schema, moved extensions.

c4b6e94

Improvements to EO extension, restructured extensions, minor improvem…

8dab2ac

…ents to dataset spec.

Added first draft of the dimensions extension.

9e9414b

Merge branch 'dev' into feat/dataset

431fe02

Moved EO extension and improved Dataset spec and adapted and fixed sc…

d32c1e2

…hema. Fixed tables.

cholmes mentioned this pull request Aug 23, 2018

Create a generic 'dataset' / 'collection' specification, and use that to replace 'root catalog' concept. #194

Closed

Merge branch 'dev' into feat/dataset

c31422e

cholmes reviewed Aug 24, 2018

View reviewed changes

Changes to EO extension, made dataset spec more compliant with WFS3.

6224a83

fredliporace reviewed Aug 26, 2018

View reviewed changes

Merge branch 'dev' into feat/dataset

4fef59f

m-mohr added 2 commits August 27, 2018 13:09

Improved the dataset spec (descriptions, adopted to current discussio…

28d25fc

…ns, added rel types, ...), updated JSON schema and added an example.

Reverted changes to EO and collection extension. Will propose them in…

f4ccca6

… a separate PR.

m-mohr requested a review from cholmes August 27, 2018 11:28

m-mohr changed the title ~~[WIP] Dataset metadata spec~~ Dataset metadata spec Aug 27, 2018

m-mohr requested review from jeffnaus and matthewhanson August 27, 2018 11:33

m-mohr and others added 2 commits August 27, 2018 18:03

Changes to formatting (trying to make all specs look the same).

e7a9e5c

Merge branch 'dev' into feat/dataset

032cf97

cholmes approved these changes Aug 29, 2018

View reviewed changes

m-mohr and others added 2 commits August 29, 2018 10:17

Improved descriptions and fixed several minor issues,

89e35a9

Merge branch 'dev' into feat/dataset

e7d7641

Minor change to the license spec.

6a32aca

Changed temporal extent to be more WFS3 like.

af1b16a

ghost approved these changes Aug 29, 2018

View reviewed changes

simonff reviewed Aug 29, 2018

View reviewed changes

Moved dimensions extension to a separate PR

a592be6

cholmes merged commit 7684895 into radiantearth:dev Aug 30, 2018

m-mohr deleted the feat/dataset branch August 31, 2018 15:42

m-mohr mentioned this pull request Sep 7, 2018

[WIP] Adding more fields to the bands + Added per asset EO data. #245

Closed

		\ There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).*

		## Core

		@@ -0,0 +1,103 @@
		# STAC Dataset Spec

		[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.


		[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.

		The Datasets Spec is a superset of the [Catalog Spec](../static-catalog/). I shares the same fields and therefore every Dataset is also a valid Catalog. Datasets can have both parent Catalogs and Datasets and child Items, Catalogs and Datasets.


		### Provider Object

		The object provides information about a provider. A provider is any of the organizations that created or processed the content of the dataset and therefore influenced the data offered by this dataset.


		The objects provides information about the storage provider hosting the data.

		Note: The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec once this concept id introduced to STAC.

Dataset metadata spec #164

Dataset metadata spec #164

Conversation

jpoehnelt commented Aug 15, 2018

Choose a reason for hiding this comment

m-mohr Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

matthewhanson Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 21, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

m-mohr Aug 22, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr commented Aug 22, 2018

m-mohr commented Aug 24, 2018

Choose a reason for hiding this comment

m-mohr Aug 24, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr commented Aug 27, 2018 • edited Loading

ghost commented Aug 28, 2018

cholmes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

m-mohr Aug 21, 2018 •

edited

Loading

m-mohr Aug 21, 2018 •

edited

Loading

matthewhanson Aug 21, 2018 •

edited

Loading

m-mohr Aug 21, 2018 •

edited

Loading

m-mohr Aug 21, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 22, 2018 •

edited

Loading

m-mohr Aug 24, 2018 •

edited

Loading

m-mohr commented Aug 27, 2018 •

edited

Loading

m-mohr commented Aug 29, 2018 •

edited

Loading

m-mohr commented Aug 29, 2018 •

edited

Loading

m-mohr commented Aug 29, 2018 •

edited

Loading