Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset metadata spec #164

Merged
merged 19 commits into from
Aug 30, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 103 additions & 0 deletions dataset-spec/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# STAC Dataset Spec

[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to provide a link to a dataset definition. Other parties can also independently use this spec to describe datasets in a lightweight way.

The Datasets Spec extends the [Catalog Spec](../static-catalog/) with additional fields to describe the set of items in the catalog. It shares the same fields and therefore every Dataset is also a valid Catalog. Datasets can have both parent Catalogs and Datasets and child Items, Catalogs and Datasets.

A Dataset can be represented in JSON format. Any JSON object that contains all the required fields is a valid STAC Dataset and Catalog.

* [Example (Sentinel 2)](example-s2.json)
* [JSON Schema](json-schema/dataset.json)

*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).*

## WARNING

**This is still an early version of the STAC spec, expect that there may be some changes before everything is finalized.**

Implementations are encouraged, however, as good effort will be made to not change anything too drastically. Using the specification now will ensure that needed changes can be made before everything is locked in. So now is an ideal time to implement, as your feedback will be directly incorporated.

## Dataset fields

| Element | Type | Description |
| ----------- | ----------------- | ------------------------------------------------------------ |
| name | string | **REQUIRED.** Identifier for the dataset that is unique across the provider. |
| title | string | A short descriptive one-line title for the dataset. |
| description | string | **REQUIRED.** Detailed multi-line description to fully explain the entity. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| keywords | [string] | List of keywords describing the dataset. |
| version | string | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. |
| license | string | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. |
| provider | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single provider is enough for the vast majority of the cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. Are you saying we should just mention that this will usually just be one provider? Or that we should shift from array to a single object?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest shifting to single object to reduce compelxity. (In the internal EE catalog, I started with providers a list and found that this is not necessary in 99.9% of cases, and the remaining 0.1% is better handled by provenance.) I'm open to revisiting this, but I'd like to see more examples first.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this need more discussion, but I think an issue would be the better place as this PR is already very messy.

| host | Host Object | Storage provider, the organization that hosts the dataset. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the 'host' field, I suggest the 'source' field of the same type Host that points at the canonical source of the data. This is needed for 99% of datasets in the EE catalog, as we are mostly a mirror.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to add that, but I think we can add that with another PR shortly after this PR.

| extent | [Extent Object] | **REQUIRED.** Spatial and temporal extents. |
| links | [Link Object] | **REQUIRED.** A list of references to other documents. |

### Extent Object

The object describes the spatio-temporal extents of the dataset. Both spatial and temporal extents are required to be specified.

**Note:** STAC datasets tries to be compliant to [WFS 3.0](https://github.com/opengeospatial/WFS_FES), but there are still issues to be solved. The WFS specification is in draft state any may change, especially regarding [3D support](https://github.com/opengeospatial/WFS_FES/issues/143) for spatial extents or the handling of [open date ranges](https://github.com/opengeospatial/WFS_FES/issues/155) for temporal extents. Therefore, It is also likely that the following fields change over time.

| Element | Type | Description |
| -------- | -------- | ------------------------------------------------------------ |
| spatial | [number] | **REQUIRED.** Potential *spatial extent* covered by the dataset. West, north, east, south edges of the spatial extent. Only WGS84 longitude/latitude is supported. The list of four numbers can be extended to six numbers to support a 3D spatial extent. |
| temporal | [string\|null] | **REQUIRED.** Potential *temporal extent* covered by the dataset. A list of two timestamps, which MUST be formatted according to [RFC 3339, section 5.6](https://tools.ietf.org/html/rfc3339#section-5.6). Open date ranges are supported by setting either the start or the end time to `null`. Example for data from the beginning of 2019 until now: `["2009-01-01T00:00:00Z", null]`. |

### Provider Object

The object provides information about a provider. A provider is any of the organizations that created or processed the content of the dataset and therefore influenced the data offered by this dataset.

| Field Name | Type | Description |
| ---------- | ------ | ------------------------------------------------------------ |
| name | string | **REQUIRED.** The name of the organization or the individual. |
| url | string | Homepage of the provider. |

### Host Object

The objects provides information about the storage provider hosting the data.

**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec once this concept is introduced to STAC.

| Field Name | Type | Description |
| -------------- | ------- | ------------------------------------------------------------ |
| name | string | **REQUIRED.** The name of the organization or the individual hosting the data. |
| description | string | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. |
| scheme | string | **REQUIRED.** The protocol/scheme used to access the data. Any of: `S3`, `GCS`, `URL`, `OTHER` |
| id | string | **REQUIRED.** Host-specific identifier such as an URL or asset id. |
| region | string | Provider specific region where the data is stored. |
| requester_pays | boolean | `true` if requester pays, `false` if host pays. Defaults to `false`. |

### Link Object

This object describes a relationship with another entity. Data providers are advised to be liberal with links.

| Field Name | Type | Description |
| ---------- | ------ | ------------------------------------------------------------ |
| href | string | **REQUIRED.** The actual link in the format of an URL. Relative and absolute links are both allowed. |
| rel | string | **REQUIRED.** Relationship between the current document and the linked document. See chapter "Relation types" for more information. |
| type | string | MIME-type of the referenced entity. |

#### Relation types

The following types are commonly used as `rel` types in the Link Object of a Dataset:

| Type | Description |
| ------- | ------------------------------------------------------------ |
| self | **REQUIRED.** *Absolute* URL to the dataset file itself. This is required, to represent the location that the file can be found online. This is particularly useful when in a download package that includes metadata, so that the downstream user can know where the data has come from. |
| root | URL to the root [STAC Catalog](../static-catalog/) or Dataset. |
| parent | URL to the parent [STAC Catalog](../static-catalog/) or Dataset. |
| child | URL to a child [STAC Catalog](../static-catalog/) or Dataset. |
| item | URL to a [STAC Item](../json-spec/). |
| license | The license URL for the dataset SHOULD be specified if the `license` field is set to `proprietary`. If there is no public license URL available, it is RECOMMENDED to supplement the STAC catalog with the license text in separate file and link to this file. |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we distinguish between links to official license pages vs local copies? Eg, rel="license_copy"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we distinguish them? Don't see a good reason yet and we don't do it for items, too.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having to create a local license file means that the data provider has not organized their licenses well and may change them later, making local copies obsolete. But this is speculative, so I'm okay with not changing this now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, great. Please, make sure to open an issue on this one so we don't forget.


## Extensions

Important related extensions for the dataset spec:

* [EO extension](../extensions/stac-eo-spec.md)
Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets.
* Dimensions extension (proposed, see [PR #227](https://github.com/radiantearth/stac-spec/pull/227))
* [Scientific extension](../extensions/scientific)
* Provenance extension (planned, see [issue #179](https://github.com/radiantearth/stac-spec/issues/179))

The [extensions page](../extensions/) gives a full overview about relevant extensions for STAC Datasets.
50 changes: 50 additions & 0 deletions dataset-spec/example-s2.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
{
"name": "COPERNICUS/S2",
"title": "Sentinel-2 MSI: MultiSpectral Instrument, Level-1C",
"description": "Sentinel-2 is a wide-swath, high-resolution, multi-spectral\nimaging mission supporting Copernicus Land Monitoring studies,\nincluding the monitoring of vegetation, soil and water cover,\nas well as observation of inland waterways and coastal areas.\n\nThe Sentinel-2 data contain 13 UINT16 spectral bands representing\nTOA reflectance scaled by 10000. See the [Sentinel-2 User Handbook](https://sentinel.esa.int/documents/247904/685211/Sentinel-2_User_Handbook)\nfor details. In addition, three QA bands are present where one\n(QA60) is a bitmask band with cloud mask information. For more\ndetails, [see the full explanation of how cloud masks are computed.](https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-1c/cloud-masks)\n\nEach Sentinel-2 product (zip archive) may contain multiple\ngranules. Each granule becomes a separate Earth Engine asset.\nEE asset ids for Sentinel-2 assets have the following format:\nCOPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the\nfirst numeric part represents the sensing date and time, the\nsecond numeric part represents the product generation date and\ntime, and the final 6-character string is a unique granule identifier\nindicating its UTM grid reference (see [MGRS](https://en.wikipedia.org/wiki/Military_Grid_Reference_System)).\n\nFor more details on Sentinel-2 radiometric resoltuon, [see this page](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/radiometric).\n",
"license": "proprietary",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the S2 license 'proprietary'? If it is in our current definition maybe we should expand the definition a bit. Or ideally find the ideal spdx license. For landsat I think we just used https://spdx.org/licenses/PDDL-1.0.html as getting across the intent of US public domain stuff...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't really know and don't feel in the position to decide that. That example is taken from GEE and I handled it the way GEE does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, cool. Super minor issue, so let's just leave it, and following GEE sounds good.

"keywords": [
"copernicus",
"esa",
"eu",
"msi",
"radiance",
"sentinel"
],
"provider": [
{
"name": "European Union/ESA/Copernicus",
"url": "https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi"
}
],
"extent": {
"spatial": [
180.0,
-56.0,
-180.0,
83.0
],
"temporal": [
"2015-06-23T00:00:00",
null
]
},
"links": [
{
"rel": "self",
"href": "https://storage.cloud.google.com/earthengine-test/catalog/COPERNICUS_S2.json"
},
{
"rel": "parent",
"href": "https://storage.cloud.google.com/earthengine-test/catalog/catalog.json"
},
{
"rel": "root",
"href": "https://storage.cloud.google.com/earthengine-test/catalog/catalog.json"
},
{
"rel": "license",
"href": "https://scihub.copernicus.eu/twiki/pub/SciHubWebPortal/TermsConditions/Sentinel_Data_Terms_and_Conditions.pdf"
}
]
}
157 changes: 157 additions & 0 deletions dataset-spec/json-schema/dataset.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,157 @@
{
"$schema": "http://json-schema.org/draft-06/schema#",
"id": "dataset.json#",
"title": "Dataset Item",
"description": "This object represents the dataset in a SpatioTemporal Asset Catalog.",
"type": "object",
"required": [
"name",
"description",
"license",
"extent",
"links"
],
"additionalProperties": true,
"properties": {
"name": {
"title": "Identifier",
"type": "string"
},
"title": {
"title": "Title",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"keywords": {
"title": "Keywords",
"type": "array",
"items": {
"type": "string"
}
},
"version": {
"title": "Dataset Version",
"type": "string"
},
"license": {
"title": "Dataset License Name",
"type": "string"
},
"provider": {
"type": "array",
"items": {
"properties": {
"name": {
"title": "Organization Name",
"type": "string"
},
"url": {
"title": "Organization homepage",
"type": "string",
"format": "url"
}
}
}
},
"host": {
"required": [
"name",
"scheme",
"id"
],
"properties": {
"name": {
"title": "Organization name",
"type": "string"
},
"description": {
"title": "Description",
"type": "string"
},
"scheme": {
"title": "Scheme",
"type": "string",
"enum": [
"S3",
"GCS",
"URL",
"OTHER"
]
},
"id": {
"title": "Identifirer",
"type": "string"
},
"region": {
"title": "Region",
"type": "string"
},
"requester_pays": {
"title": "Requester Pays",
"type": "boolean",
"default": false
}
},
"additionalProperties": true
},
"extent": {
"title": "Extents",
"type": "object",
"required": [
"spatial",
"temporal"
],
"properties": {
"spatial": {
"title": "Spatial extent",
"type": "array",
"items": {
"type": "number"
}
},
"temporal": {
"title": "Temporal extent",
"type": "array",
"minItems": 2,
"maxItems": 2,
"items": {
"type": [
"string",
"null"
],
"format": "date-time"
}
}
},
"additionalProperties": true
},
"links": {
"type": "array",
"items": {
"type": "object",
"required": [
"href",
"rel"
],
"properties": {
"href": {
"title": "Link",
"type": "string"
},
"rel": {
"title": "Relation",
"type": "string"
},
"type": {
"title": "type",
"type": "string"
}
},
"additionalProperties": true
}
}
}
}
14 changes: 7 additions & 7 deletions extensions/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,13 +11,13 @@ them they can create a shared extension and include it in the STAC repository.

## List of official extensions

| Extension Name (Prefix) | Scope | Description |
| ------------------------------------------------------------- | ------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| [Collection](stac-collection-spec.md) (`c`) | Item | Provides a way to specify data fields that are common across a collection of STAC Items, so that each does not need to repeat all the same information. |
| [EO](stac-eo-spec.md) (`eo`) | Item | Covers data that represents a snapshot of the earth for a single date and time. It could consist of multiple spectral bands in any part of the electromagnetic spectrum. Examples of EO data include sensors with visible bands, IR bands as well as SAR instruments. The extension provides common fields like bands, cloud cover, off nadir, sun angle + elevation, gsd and more. |
| [Scientific](scientific/) (`sci`) | Catalog | Scientific metadata is considered to be data that indicate from which publication a dataset originates and how the dataset itself should be cited or referenced. |
| [Start end datetime](stac-start-end-datetime-spec.md) (`set`) | Item | An extension to provide start and end datetime stamps in a consistent way. |
| [Transaction](transaction/) | API | Provides an API extension to support the creation, editing, and deleting of items on a specific WFS3 collection. |
| Extension Name (Prefix) | Scope | Description |
| ------------------------------------------------------------ | ---------------- | ------------------------------------------------------------ |
| [Collection](stac-collection-spec.md) (`c`) | Item | Provides a way to specify data fields that are common across a collection of STAC Items, so that each does not need to repeat all the same information. |
| [EO](stac-eo-spec.md) (`eo`) | Item | Covers data that represents a snapshot of the earth for a single date and time. It could consist of multiple spectral bands in any part of the electromagnetic spectrum. Examples of EO data include sensors with visible bands, IR bands as well as SAR instruments. The extension provides common fields like bands, cloud cover, off nadir, sun angle + elevation, gsd and more. |
| [Scientific](scientific/) (`sci`) | Catalog +Dataset | Scientific metadata is considered to be data that indicate from which publication a dataset originates and how the dataset itself should be cited or referenced. |
| [Start end datetime](stac-start-end-datetime-spec.md) (`set`) | Item | An extension to provide start and end datetime stamps in a consistent way. |
| [Transaction](transaction/) | API | Provides an API extension to support the creation, editing, and deleting of items on a specific WFS3 collection. |

## Third-party / vendor extensions

Expand Down