Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset metadata spec #164

Merged
merged 19 commits into from
Aug 30, 2018
Merged

Dataset metadata spec #164

merged 19 commits into from
Aug 30, 2018

Conversation

jpoehnelt
Copy link
Contributor

Just getting it down somewhere...

@jpoehnelt jpoehnelt changed the title [WIP] ~Collections~ Datasets [WIP] Collections, Now Datasets Aug 15, 2018
@jpoehnelt jpoehnelt changed the base branch from master to dev August 15, 2018 22:45
@cholmes cholmes changed the title [WIP] Collections, Now Datasets [WIP] Dataset metadata spec Aug 18, 2018
@m-mohr m-mohr mentioned this pull request Aug 20, 2018
| provider | [Provider Object] | Data Provider | The organization that creates the content of the dataset. |
| host | Host Object | Storage Provider | The organization that hosts the dataset. |
| geometry | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
| datetime | string | Temporal extent (required) | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. Open date ranges are not supported by ISO 8601 and MUST be encoded as proposed by [Dublin Core Collection Description: Open Date Range Format](http://www.ukoln.ac.uk/metadata/dcmi/date-dccd-odrf/2005-08-13/). |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it could be good here to not reuse datetime and geometry, as those are item specific, and instead really talk about the extent.

Ideal to me would be to align with WFS3, so we can add the dataset fields directly to their 'collection' metadata. They have it like:

"extent": {
"spatial": [ 7.01, 50.63, 7.22, 50.78 ],
"temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ]
},

See 'example 4' on https://rawgit.com/opengeospatial/WFS_FES/master/docs/17-069.html

I could be up for a different form, like flatten to spatial_extent and temporal_extent - but I don't think we should reuse the item fields, and I don't believe a dataset should need to conform to the Item schema. If we go with something different I'd want to advocate to WFS3 group to align with us. And I'd see a WFS 3 compliant STAC implementation using these dataset fields in their /collections/{collectionId}/ endpoint.

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with renaming it to spatial extent and temporal extent, we just used those names for consistency within STAC. (Edit: Changed.)
I don't like the the way temporal extents are specified in WFS though as it doesn't allow to leave the end time out (i.e. open date ranges).

| host | Host Object | Storage Provider | The organization that hosts the dataset. |
| geometry | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
| datetime | string | Temporal extent (required) | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. Open date ranges are not supported by ISO 8601 and MUST be encoded as proposed by [Dublin Core Collection Description: Open Date Range Format](http://www.ukoln.ac.uk/metadata/dcmi/date-dccd-odrf/2005-08-13/). |
| process_graph | Process Graph Object | Processing chain | ... |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? And what's the definition of it? May make sense as an extension?

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Process graph was added for provenance, but will definitely be an extension. The same applies for dimensions. Changed that, too.

| description | string | Description | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| scheme | string | Scheme (required) | Values: S3, GCS, URL, OTHER |
| id | string | Identifier (required) | Host-specific identifier such as an URL or asset id. |
| region | string | Region | Provider specific region where the data is stored. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is region primarily an AWS thing or is it general to all cloud providers?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema was written before we knew about the idea of storage profiles in #148. I would really like to have profiles instead of having the storage details directly baked into the dataset spec.

region would probably be an AWS specific thing, which should be in a separate profile as proposed in #148. If others have regions, too, then they should have it separately in their profiles aswell.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS has regions too

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that makes sense though - each cloud provider should define it's own storage profile (extension?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go that route, yes. Is that dataset specific or could that be also catalog or item specific?


| Element | Type | Name | Description |
| ------------ | -------- | ------------------ | --------------------- |
| unit | ? | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some sort of unit field is provided it should probably be per-band. But I'm not convinced it's needed, what are the possible values here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense to have it per band (and per dimension). Moved it to the bands. An example would be °C for temperatures, but could be any unit of measurement. Preferably SI, but could also link to UDUNITS or the dictionary of UoM.

| Element | Type | Name | Description |
| --------- | -------- | ------------- | ------------------------------------------------------------ |
| nodata | [number] | Nodata values | The no data value(s). |
| data_type | string | Data Type | Data type for band values including its bit size. Values: uint8, uint16, uint32, uint64, int8, int16, int32, int64, float16, float32, float64 |
Copy link
Collaborator

@matthewhanson matthewhanson Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think nodata, scale, and offset are useful to have per band, but not sure about data_type. The other fields may not be provided in the actual datafile (although providers should be encouraged to set them), but data_type will always be available.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

data_type was a field I ported from openEO and got spec'ed by my predecessor. I agree, I think it doesn't really make sense. Removed it for now.

*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).*

## Core

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the EO spec we have gsd (Ground Sample Distance) at the top level, and it's also provided per band because resolution may vary by band. At the top level it represents the best resolution to enable searching. I think it makes sense at the Dataset level rather than the Item level.

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhanson What do you think? Would it make sense to extend and share the EO extension across datasets and items or to have it separated? Or should there be one extension, which has sections on items and datasets? I think I'd prefer to share the same extension... some definitions probably make sense for items and datasets equally.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I entirely get what you are saying.
I think datasets are a concept which will be used in general with STAC - although I'm still not sure if the intention is that datasets are always present and if they are themselves part of core. I personally think they should part of core, the include core fields: temporal and spatial extent are the unions of core fields, license, provider...

So the EO extension, or any extension, should define additions to both the Dataset and the Item.

Copy link
Collaborator

@m-mohr m-mohr Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope datasets are part of the core and always present, but not sure what others think.

I agree with what you are saying about extensions and that answers basically my question. So I would expect that the additional EO fields we are proposing for datasets will be incorporated into the EO extension. If you are okay with that, I'd already move them to the eo extension in the branch we are currently working on. They are currently a bit badly located in the dataset-spec.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think those additional fields belong in the EO extension...although I think that I'd still like to see some of the non-varying asset info in Datasets, but I'll make the case and provide some examples in the EO extension.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure whether speaking about the same. I am speaking about an EO extension that - whenever meaningful - is shared between Dataset and Item and can be used in both locations! Are you just talking about an EO extension limited to items? Otherwise I don't get the point you make with "I'd still like to see some of the non-varying asset info in Datasets".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we are talking about the same thing, EO extension is shared between Dataset and Item.

What I'm saying is that some of the Asset information, such as the list of possible assets and what their types are, can be added at the Dataset level. I think I explained it a bit better elsewhere. Maybe need to make a new issue for it, these PR is getting a bit hard to follow.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, then I'd like to have your support here (regarding global extensions): #186 ;-)

Sure, I think more issues would make sense! I think we are also discussing that in #174. I'd like a proposal on this. I don't think it's simply copy and paste (minus url) from the assets spec in the items?

| ------- | -------- | ------------- | ------------------------------------------------------------ |
| nodata | [number] | Nodata values | The no data value(s). |
| offset | number | Offset | Offset to convert band values to the actual measurement scale. |
| scale | number | Scale | Scale to convert band values to the actual measurement scale. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Been thinking about this more and scale and offset definitely need to be specified per asset, not per band. Individual items can have different scale and offsets. While Landsat collection 1 uses a consistent scale/offset for all scenes, this was not always the case and a provider may certainly provide them per Item, it is not really a band characteristic, it is an asset characteristic.
I had thought of it because there are other reasons too - for instance I might take the RGB bands and scale them to a Byte and provide that as a visible True Color image, or other bands to make a False Color image, but still want to reference the bands that make up that 3-band image.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same vain, units really is per asset as well. I might provide a TOA datafile as well as a Surface reflectance file.

And in those cases I might want to use different nodata values as well, so they really should all be per-asset. Units and nodata can be specified per asset in the Dataset, but scale/offset needs to be specified for each asset in each Item.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-mohr @matthewhanson gain (scale) and offsets may have distinct values depending on the date, see the case for LS4 and LS5. How should that cases be represented? I also prefer gain instead of scale, but this may be a Landsat thing.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhanson : I don't quite follow. What is an asset in this case? My understanding that by assets STAC people usually mean files on disk. Do you mean a specific file for a specific item? But a gdal file can have two bands that have different gain/offset.

For reference, datasets in EE almost always have the same gain/offset for each band across all items. The two exceptions are Landsat and SeaWiFS OceanColor dataset that has different processing for different items in the same dataset. We did not dare to change Landsat, but for SeaWiFS we applied gain/offset before ingestion to make the data easier to use in ARD-like manner.

So in general, yes, each item can have its own gain/offset. I've never seen nodata values differ across items, though. If they do, it might make more sense to split the dataset in two or more.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matthewhanson In the normal "STAC way" with Items and Assets that might be feasible, but for openEO (and GEE?), we still need those information - whenever meaningful - per dataset or band, as we don't have any items/assets where this information is available! So we probably need some of these properties in all locations.

@fredliporace

  • Honestly, it's just a name, but I am more used to scale. Don't really care whether its gain or scale. It has the same meaning, correct?
  • Not sure whether we can cover all of that, currently it can't represented. Maybe just leave it out if it depends on another property? I don't have a proper solution for it yet - but open for proposals.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fredliporace Right, that's why I'm saying scale and offset need to be provided per asset rather than at some higher level like the dataset.

I actually prefer gain as well, but scale and offset are the terms used by GDAL which make them a bit more universal I think.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonff What I'm saying is that the scales and offsets need to be provided for each specific asset, and they need to be arrays because as you say the asset could have multiple bands.

I agree nodata values probably won't be different across items, it would probably be fine to specify them per asset at the Dataset level.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, than it's easy to solve, right? If scale/offset are per asset, define it in the asset. If it's per band, define it in the band (any maybe also in the asset, not sure).

And again, openEO can't use it per item, so I'd need both ways anyway.

@cholmes cholmes added this to the 0.6.0 milestone Aug 21, 2018
| license_url | string | Dataset License URL | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. |
| provider | [Provider Object] | Data Provider | The organizations that created the content of the dataset. |
| host | Host Object | Storage Provider | The organization that hosts the dataset. |
| spatial_extent | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-mohr Is this to be interpreted as 'possible extent' or as 'current extent'? My concern here is that some missions have the capacity to image most of the Earth but do not make a systematic acquisitions - CBERS for instance.
The footprint for CBERS-4 MUX scenes may be obtained here, just select CBERS on the upper right panel. You'll see that most of northern Canada is not covered yet by the dataset, but a new scene may be acquired anytime in the future. In that case should the spatial_extent be changed? If is is changed could the dataset version be the same?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer possible extent to keep search results (over datasets, not over items in a dataset) more consistent.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For openEO we had it defined as current extent and so I had that in mind, but I am open to both. Whoever has the best arguments wins. ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I prefer possible extent. That's a lot easier to implement for static providers. And someone who wants it to be 'current extent' can do that if they want - the current extent is certainly within the possible extent.

I don't want catalogs to feel like they have to always be updating this field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I lean towards 'possible extent'. An implementor could choose to make theirs 'current extent', since the current should be a subset of the possible. But I think it's better to not require static catalogs to keep updating their extent every single time there's new data outside their current.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"possible" vs "current" also applies for the temporal_extent as open date ranges would always be "possible". I would like to have them and for consistency and the reasons mentioned above, I now slightly prefer "possible", too. But with open date ranges we are not compatible with WFS, see also opengeospatial/ogcapi-features#155.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "potential" to both extents.

| ------- | -------- | ------------- | ------------------------------------------------------------ |
| nodata | [number] | Nodata values | The no data value(s). |
| offset | number | Offset | Offset to convert band values to the actual measurement scale. |
| scale | number | Scale | Scale to convert band values to the actual measurement scale. |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-mohr @matthewhanson gain (scale) and offsets may have distinct values depending on the date, see the case for LS4 and LS5. How should that cases be represented? I also prefer gain instead of scale, but this may be a Landsat thing.

| version | string | Dataset Version | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
| license | string | Dataset License Name (required) | Dataset's license(s) as a [SPDX License identifier or expression](https://spdx.org/licenses/) or `proprietary` if the license is not on the SPDX license list. See `license_url` for more information. |
| license_url | string | Dataset License URL | Dataset's license URL SHOULD be specified if `license` is set to `proprietary`. |
| provider | [Provider Object] | Data Provider | The organizations that created the content of the dataset. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest a single provider here. I added support for multiple providers in the EE catalog, and never found any use for it other than expressing the processing chain, which we want to do more systematically elsewhere.

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am hesitant to only rely on the process chain extension (or however we call it). It is still a long way to have standard to properly define these processing information apart from maybe a provider and a dataset url. Still, we should have something small in the core, I think. It should be easy to users to get at least some information about the history. I think the processing chain information would be much much harder to express and then it will just be left out. We also discussed the field derived_from. Can we combine that somehow?

And for me it's also to give proper credit and having them all makes clear what to put here. A single provider again leaves it open whether it's the RAW data provider or the last one processing it.

Maybe we could also just have something like "history" in the core, which has a list of provider name + provider homepage + dataset url (derived_from). As we don't need the dataset url for the last provider (it's that catalog) the last provider would be the provider in the dataset. A process chain extension just extends the History Object and the Dataset Object, so that a process_chain can simply be added to each history element, too.

Example follows...

Copy link
Collaborator

@m-mohr m-mohr Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{
  "id":"sentinel2-processed",
  "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.",
  "spatial_extent":{

  },
  "temporal_extent":"2015/2018",
  "license":"Apache-2.0",
  "provider":{
    "name":"Final processor, Inc.",
    "url":"http://www.final-corporation.com"
  },
  "pc:process_chain":{
    "process":"max_time_composite"
  },
  "history":[
    {
      "provider":{
        "organization":"Another Processing Comp, Inc.",
        "url":"http://processing.inc"
      },
      "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json",
      "pc:process_chain":{
        "process":"ndvi"
      }
    },
    {
      "provider":{
        "organization":"ESA",
        "url":"http://esa.eu"
      },
      "dataset_url":"http://esa.eu/data/sentinel-2"
    }
  ]
}

We could also put the last provider directly into the history, dataset_url would be to self or omitted. Then we would have no direct provider in the top-level, but that would be okay for me.

{
  "id":"sentinel2-processed",
  "description":"Sentinel 2 NDVI max composite processed by Final Processor and Another Processing Comp, data originates from ESA.",
  "spatial_extent":{

  },
  "temporal_extent":"2015/2018",
  "license":"Apache-2.0",
  "history":[
    {
      "provider":{
        "name":"Final processor, Inc.",
        "url":"http://www.final-corporation.com"
      },
      "pc:process_chain":{
        "process":"max_time_composite"
      }
    },
    {
      "provider":{
        "organization":"Another Processing Comp, Inc.",
        "url":"http://processing.inc"
      },
      "dataset_url":"http://processing.inc/datasets/sentinel2-processed/catalog.json",
      "pc:process_chain":{
        "process":"ndvi"
      }
    },
    {
      "provider":{
        "organization":"ESA",
        "url":"http://esa.eu"
      },
      "dataset_url":"http://esa.eu/data/sentinel-2"
    }
  ]
}

| ------------ | -------- | ------------------ | --------------------- |
| asset_schema | object | Asset Schema | TODO |
| nodata | [number] | Nodata values | The no data value(s). |
| pyramid | object | Pyramid parameters | TODO |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far I only used this as en enum describing the way averaging is down during pyramiding: MEAN for normal continuous bands like EO observations or temperature/DEM/etc, MODE for categorical values (eg, landcover classes), SAMPLE for the rest (typically this means bitmask bands). SAMPLE means just take one possibly random value, as nothing else makes sense. (In reality in this case we always take UL value for each 2x2 grid).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simonff, could you come up with a proposal for this field? And maybe also asset_schema? These initially were your ideas and I think you have a much better idea to define them.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 22, 2018

Added a first draft for the dimension extension, is open for review.
Was partly proposed by a partner of openEO on basis of the OGC Coverage Implementation Schema.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 24, 2018

Another issue we should have in mind: #150

| provider | [Provider Object] | Data Provider | The organizations that created the content of the dataset. |
| host | Host Object | Storage Provider | The organization that hosts the dataset. |
| spatial_extent | [GeoJSON Object](http://geojson.org/) | Spatial extent (required) | The spatial extent covered by the dataset as [GeoJSON](http://geojson.org/) object. |
| temporal_extent | string | Temporal extent (required) | Temporal extent covered by the dataset. Date/time intervals MUST be formatted according to ISO 8601. ToDo: Support open date ranges |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be really good to try to get compatibility with WFS on the extent fields. They use:

      "extent": {
        "spatial": [ 7.01, 50.63, 7.22, 50.78 ],
        "temporal": [ "2010-02-15T12:34:56Z", "2018-03-18T12:11:00Z" ]
      },

If we want we can try to influence them to adopt our convention, but we should have good reasoning.

Copy link
Collaborator

@m-mohr m-mohr Aug 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just used 'example 4' from the WFS spec. It notes: 'Coordinate reference system information is not provided as the service provides geometries only in the default system (WGS84 longitude/latitude)'. So seems like we could just say for dataset we require default.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, than that's fine. Curious what the WFS crew is coming up with for the other issues mentioned. Will change that in the dataset spec. Temporal extent is still a string for now and need to find out how they define 3D bboxes (= incl. the z-axis).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be mostly WFS compatible now.


| Element | Type | Name | Description |
| ----------- | ----------------- | ------------------------------- | ------------------------------------------------------------ |
| name | string | Identifier (required) | Identifier for the dataset that is unique across the provider. The identi |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@m-mohr possible accidental text 'cut'

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thank you for the hint.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 27, 2018

I fleshed out the spec a bit more.
Additionally, I cleaned up this PR to just contain the dimension extension and the dataset spec for now.
I think it is time for a review to get this in so we can develop and discuss better based on this first proposal.

@m-mohr m-mohr requested a review from cholmes August 27, 2018 11:28
@m-mohr m-mohr changed the title [WIP] Dataset metadata spec Dataset metadata spec Aug 27, 2018
@ghost
Copy link

ghost commented Aug 28, 2018

This looks pretty good to me. I think reading through this got me a little closer helped me understand how the concepts of "catalog", "collection" and "dataset" are being handled.

I'm not sure it's clear where the extensions fit in? Are they just added as additional properties in the root dataset object? Is dimensions a dataset specific extension or does it also apply to Items?

It sounds like any extension (eo in particular was discussed) can be applied at the dataset level? If so, do we need to indicate that item metadata takes precedence over dataset metadata?

Copy link
Contributor

@cholmes cholmes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Added a few typo corrections, but I don't see any blocking - would be good to get this in.

@@ -0,0 +1,103 @@
# STAC Dataset Spec

[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it that STAC Items 'use' the spec? I guess 'use' is less clear to me, it seems to imply that they should maybe implement the fields? Maybe describe what 'use' means? Like 'STAC Items are recommended to provide a link to a dataset definition'?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved that as suggested.


[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to use. Other parties can also independently use this spec to describe datasets in a lightweight way.

The Datasets Spec is a superset of the [Catalog Spec](../static-catalog/). I shares the same fields and therefore every Dataset is also a valid Catalog. Datasets can have both parent Catalogs and Datasets and child Items, Catalogs and Datasets.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I -> It

Also not sure if 'is a superset' is the best way to explain it. I'd say more like 'extends the catalog spec'. Probably would add a bit of color, like 'extends the catalog spec with additional fields to describe the set of items in the catalog'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both fixed/added.


### Provider Object

The object provides information about a provider. A provider is any of the organizations that created or processed the content of the dataset and therefore influenced the data offered by this dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you said somewhere that multiple providers are allowed? If so I'd make it a bit more explicit here. If not then we should provide guidance on which one of the 'any...that created or processed'.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Provider object provides information about a single provider, but the Dataset object can hold multiple providers in an array. I made that more clear in the provider field.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sent, sounds great.


The objects provides information about the storage provider hosting the data.

**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec once this concept id introduced to STAC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: id -> is

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

"name": "COPERNICUS/S2",
"title": "Sentinel-2 MSI: MultiSpectral Instrument, Level-1C",
"description": "Sentinel-2 is a wide-swath, high-resolution, multi-spectral\nimaging mission supporting Copernicus Land Monitoring studies,\nincluding the monitoring of vegetation, soil and water cover,\nas well as observation of inland waterways and coastal areas.\n\nThe Sentinel-2 data contain 13 UINT16 spectral bands representing\nTOA reflectance scaled by 10000. See the [Sentinel-2 User Handbook](https://sentinel.esa.int/documents/247904/685211/Sentinel-2_User_Handbook)\nfor details. In addition, three QA bands are present where one\n(QA60) is a bitmask band with cloud mask information. For more\ndetails, [see the full explanation of how cloud masks are computed.](https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-1c/cloud-masks)\n\nEach Sentinel-2 product (zip archive) may contain multiple\ngranules. Each granule becomes a separate Earth Engine asset.\nEE asset ids for Sentinel-2 assets have the following format:\nCOPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the\nfirst numeric part represents the sensing date and time, the\nsecond numeric part represents the product generation date and\ntime, and the final 6-character string is a unique granule identifier\nindicating its UTM grid reference (see [MGRS](https://en.wikipedia.org/wiki/Military_Grid_Reference_System)).\n\nFor more details on Sentinel-2 radiometric resoltuon, [see this page](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/radiometric).\n",
"license": "proprietary",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the S2 license 'proprietary'? If it is in our current definition maybe we should expand the definition a bit. Or ideally find the ideal spdx license. For landsat I think we just used https://spdx.org/licenses/PDDL-1.0.html as getting across the intent of US public domain stuff...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't really know and don't feel in the position to decide that. That example is taken from GEE and I handled it the way GEE does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cool. Super minor issue, so let's just leave it, and following GEE sounds good.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 29, 2018

Mostly fixed the issues mentioned by Chris. One more review required.

@hgs-truthe01 We still need to elaborate more on the extensions and "inheritance".

@cholmes
Copy link
Contributor

cholmes commented Aug 29, 2018

Looks good to me, and looks like my green review checkmark is still there. Just need one more review - @matthewhanson or @hgs-truthe01 ?

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 29, 2018

I just made another compatibility change, which improves the WFS 3 compatibility as we are now using their specification for temporal extents with one exception. We were using ISO8601 date ranges before and WFS3 uses a two element array. We changed to two element arrays, but here it is allowed to set one of the elements to null to support open date ranges. Proposed that also in the WFS issue tracker, maybe we get that into WFS (but doesn't seem anybody is currently very interested in temporal extents ;-) ).

Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I will save any extension/inheritance discussion for another thread. At the very least this cleans up the root catalog stuff nicely.

@ghost
Copy link

ghost commented Aug 29, 2018

One last comment.

We have so far been standardizing on GeoJSON and RCF3339 for space and time. The extent stuff is introducing a new way to specify similar information.

It is a slightly different case, so I could be convinced that the extent definitions should be different, but it is less developer friendly to use two competing formats for representing information. One of the issues I have seen with geospatial api's is around extents and projections. I would prefer to only represent geospatial coordinates in one projection to keep clients from needing a full projection engine.

I only bring this up because Space and Time are the core parts of the spec. If we are not consistent, it feels weird to me.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 29, 2018

@hgs-truthe01 We are using RFC3339 now. That's what I wanted to say with my latest comment. ;)
GeoJSON is not applicable for temporal extents. For spatial extents we are following WFS3, which is also WGS84, but doesn't allow GeoJSON. So I think we got it combined mostly and can't do much more.

| keywords | [string] | List of keywords describing the dataset. |
| version | string | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
| license | string | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. |
| provider | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single provider is enough for the vast majority of the cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Are you saying we should just mention that this will usually just be one provider? Or that we should shift from array to a single object?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest shifting to single object to reduce compelxity. (In the internal EE catalog, I started with providers a list and found that this is not necessary in 99.9% of cases, and the remaining 0.1% is better handled by provenance.) I'm open to revisiting this, but I'd like to see more examples first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need more discussion, but I think an issue would be the better place as this PR is already very messy.

| version | string | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
| license | string | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. |
| provider | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. |
| host | Host Object | Storage provider, the organization that hosts the dataset. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the 'host' field, I suggest the 'source' field of the same type Host that points at the canonical source of the data. This is needed for 99% of datasets in the EE catalog, as we are mostly a mirror.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to add that, but I think we can add that with another PR shortly after this PR.


* [EO extension](../extensions/stac-eo-spec.md)
Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets.
* [Dimensions extension](../extensions/dimensions)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't have any examples of data with extra dimensions yet, so I suggest dropping dimensions from this version (and just mentioning it as 'planned" here, like we do for the provenance extension)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have at least a draft for this extension in contrast to provenance. And I have data for the dimension extension. I'll ask my colleague to publish that. But we can move that to a separate PR, if that's preferred by you or others.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I'd rather reduce this PR and then calmly discuss the change separately.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, will separate when I'm back in the office in 9 hours ;)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimensions extension has been moved to PR #227.

| parent | URL to the parent [STAC Catalog](../static-catalog/) or Dataset. |
| child | URL to a child [STAC Catalog](../static-catalog/) or Dataset. |
| item | URL to a [STAC Item](../json-spec/). |
| license | The license URL for the dataset SHOULD be specified if the `license` field is set to `proprietary`. If there is no public license URL available, it is RECOMMENDED to supplement the STAC catalog with the license text in separate file and link to this file. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish between links to official license pages vs local copies? Eg, rel="license_copy"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we distinguish them? Don't see a good reason yet and we don't do it for items, too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to create a local license file means that the data provider has not organized their licenses well and may change them later, making local copies obsolete. But this is speculative, so I'm okay with not changing this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, great. Please, make sure to open an issue on this one so we don't forget.

@m-mohr
Copy link
Collaborator

m-mohr commented Aug 30, 2018

Moved Dimensions extension as requested. Can be merged now, I think.

@cholmes cholmes merged commit 7684895 into radiantearth:dev Aug 30, 2018
@m-mohr m-mohr deleted the feat/dataset branch August 31, 2018 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants