-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset metadata spec #164
Changes from all commits
4ff110f
2980c07
c4b6e94
8dab2ac
9e9414b
431fe02
d32c1e2
c31422e
6224a83
4fef59f
28d25fc
f4ccca6
e7a9e5c
032cf97
89e35a9
e7d7641
6a32aca
af1b16a
a592be6
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
# STAC Dataset Spec | ||
|
||
[STAC Items](https://github.com/radiantearth/stac-spec/json-spec/) are focused on search within a dataset*. Another topic of interest is the search of datasets, instead of within a dataset. The Dataset Spec is an independent spec that STAC Items are *strongly recommended* to provide a link to a dataset definition. Other parties can also independently use this spec to describe datasets in a lightweight way. | ||
|
||
The Datasets Spec extends the [Catalog Spec](../static-catalog/) with additional fields to describe the set of items in the catalog. It shares the same fields and therefore every Dataset is also a valid Catalog. Datasets can have both parent Catalogs and Datasets and child Items, Catalogs and Datasets. | ||
|
||
A Dataset can be represented in JSON format. Any JSON object that contains all the required fields is a valid STAC Dataset and Catalog. | ||
|
||
* [Example (Sentinel 2)](example-s2.json) | ||
* [JSON Schema](json-schema/dataset.json) | ||
|
||
*\* There is no standardized name for the concept we are describing here. Others called it: dataset series (ISO 19115), collection (CNES, NASA), dataset (JAXA), dataset series (ESA), product (JAXA).* | ||
|
||
## WARNING | ||
|
||
**This is still an early version of the STAC spec, expect that there may be some changes before everything is finalized.** | ||
|
||
Implementations are encouraged, however, as good effort will be made to not change anything too drastically. Using the specification now will ensure that needed changes can be made before everything is locked in. So now is an ideal time to implement, as your feedback will be directly incorporated. | ||
|
||
## Dataset fields | ||
|
||
| Element | Type | Description | | ||
| ----------- | ----------------- | ------------------------------------------------------------ | | ||
| name | string | **REQUIRED.** Identifier for the dataset that is unique across the provider. | | ||
| title | string | A short descriptive one-line title for the dataset. | | ||
| description | string | **REQUIRED.** Detailed multi-line description to fully explain the entity. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. | | ||
| keywords | [string] | List of keywords describing the dataset. | | ||
| version | string | Version of the dataset. [Semantic Versioning (SemVer)](https://semver.org/) SHOULD be followed. | | ||
| license | string | **REQUIRED.** Dataset's license(s) as a SPDX [License identifier](https://spdx.org/licenses/) or [expression](https://spdx.org/spdx-specification-21-web-version#h.jxpfx0ykyb60) or `proprietary` if the license is not on the SPDX license list. Proprietary licensed data SHOULD add a link to the license text, see the `license` relation type. | | ||
| provider | [Provider Object] | A list of data providers, the organizations which influenced the content of the dataset. Providers should be listed in chronological order with the most recent provider being the last element of the list. | | ||
| host | Host Object | Storage provider, the organization that hosts the dataset. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In addition to the 'host' field, I suggest the 'source' field of the same type Host that points at the canonical source of the data. This is needed for 99% of datasets in the EE catalog, as we are mostly a mirror. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open to add that, but I think we can add that with another PR shortly after this PR. |
||
| extent | [Extent Object] | **REQUIRED.** Spatial and temporal extents. | | ||
| links | [Link Object] | **REQUIRED.** A list of references to other documents. | | ||
|
||
### Extent Object | ||
|
||
The object describes the spatio-temporal extents of the dataset. Both spatial and temporal extents are required to be specified. | ||
|
||
**Note:** STAC datasets tries to be compliant to [WFS 3.0](https://github.com/opengeospatial/WFS_FES), but there are still issues to be solved. The WFS specification is in draft state any may change, especially regarding [3D support](https://github.com/opengeospatial/WFS_FES/issues/143) for spatial extents or the handling of [open date ranges](https://github.com/opengeospatial/WFS_FES/issues/155) for temporal extents. Therefore, It is also likely that the following fields change over time. | ||
|
||
| Element | Type | Description | | ||
| -------- | -------- | ------------------------------------------------------------ | | ||
| spatial | [number] | **REQUIRED.** Potential *spatial extent* covered by the dataset. West, north, east, south edges of the spatial extent. Only WGS84 longitude/latitude is supported. The list of four numbers can be extended to six numbers to support a 3D spatial extent. | | ||
| temporal | [string\|null] | **REQUIRED.** Potential *temporal extent* covered by the dataset. A list of two timestamps, which MUST be formatted according to [RFC 3339, section 5.6](https://tools.ietf.org/html/rfc3339#section-5.6). Open date ranges are supported by setting either the start or the end time to `null`. Example for data from the beginning of 2019 until now: `["2009-01-01T00:00:00Z", null]`. | | ||
|
||
### Provider Object | ||
|
||
The object provides information about a provider. A provider is any of the organizations that created or processed the content of the dataset and therefore influenced the data offered by this dataset. | ||
|
||
| Field Name | Type | Description | | ||
| ---------- | ------ | ------------------------------------------------------------ | | ||
| name | string | **REQUIRED.** The name of the organization or the individual. | | ||
| url | string | Homepage of the provider. | | ||
|
||
### Host Object | ||
|
||
The objects provides information about the storage provider hosting the data. | ||
|
||
**Note:** The idea of storage profiles is currently [discussed](https://github.com/radiantearth/stac-spec/issues/148). Therefore, scheme, id and region may be removed from the final spec once this concept is introduced to STAC. | ||
|
||
| Field Name | Type | Description | | ||
| -------------- | ------- | ------------------------------------------------------------ | | ||
| name | string | **REQUIRED.** The name of the organization or the individual hosting the data. | | ||
| description | string | Detailed description to explain the hosting details. [CommonMark 0.28](http://commonmark.org/) syntax MAY be used for rich text representation. | | ||
| scheme | string | **REQUIRED.** The protocol/scheme used to access the data. Any of: `S3`, `GCS`, `URL`, `OTHER` | | ||
| id | string | **REQUIRED.** Host-specific identifier such as an URL or asset id. | | ||
| region | string | Provider specific region where the data is stored. | | ||
| requester_pays | boolean | `true` if requester pays, `false` if host pays. Defaults to `false`. | | ||
|
||
### Link Object | ||
|
||
This object describes a relationship with another entity. Data providers are advised to be liberal with links. | ||
|
||
| Field Name | Type | Description | | ||
| ---------- | ------ | ------------------------------------------------------------ | | ||
| href | string | **REQUIRED.** The actual link in the format of an URL. Relative and absolute links are both allowed. | | ||
| rel | string | **REQUIRED.** Relationship between the current document and the linked document. See chapter "Relation types" for more information. | | ||
| type | string | MIME-type of the referenced entity. | | ||
|
||
#### Relation types | ||
|
||
The following types are commonly used as `rel` types in the Link Object of a Dataset: | ||
|
||
| Type | Description | | ||
| ------- | ------------------------------------------------------------ | | ||
| self | **REQUIRED.** *Absolute* URL to the dataset file itself. This is required, to represent the location that the file can be found online. This is particularly useful when in a download package that includes metadata, so that the downstream user can know where the data has come from. | | ||
| root | URL to the root [STAC Catalog](../static-catalog/) or Dataset. | | ||
| parent | URL to the parent [STAC Catalog](../static-catalog/) or Dataset. | | ||
| child | URL to a child [STAC Catalog](../static-catalog/) or Dataset. | | ||
| item | URL to a [STAC Item](../json-spec/). | | ||
| license | The license URL for the dataset SHOULD be specified if the `license` field is set to `proprietary`. If there is no public license URL available, it is RECOMMENDED to supplement the STAC catalog with the license text in separate file and link to this file. | | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we distinguish between links to official license pages vs local copies? Eg, rel="license_copy" There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why should we distinguish them? Don't see a good reason yet and we don't do it for items, too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Having to create a local license file means that the data provider has not organized their licenses well and may change them later, making local copies obsolete. But this is speculative, so I'm okay with not changing this now. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Okay, great. Please, make sure to open an issue on this one so we don't forget. |
||
|
||
## Extensions | ||
|
||
Important related extensions for the dataset spec: | ||
|
||
* [EO extension](../extensions/stac-eo-spec.md) | ||
Please note that some fields such as `eo:sun_elevation ` or `eo:sun_azimuth` are only meaningful on the item level and MUST not be used in datasets. | ||
* Dimensions extension (proposed, see [PR #227](https://github.com/radiantearth/stac-spec/pull/227)) | ||
* [Scientific extension](../extensions/scientific) | ||
* Provenance extension (planned, see [issue #179](https://github.com/radiantearth/stac-spec/issues/179)) | ||
|
||
The [extensions page](../extensions/) gives a full overview about relevant extensions for STAC Datasets. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
{ | ||
"name": "COPERNICUS/S2", | ||
"title": "Sentinel-2 MSI: MultiSpectral Instrument, Level-1C", | ||
"description": "Sentinel-2 is a wide-swath, high-resolution, multi-spectral\nimaging mission supporting Copernicus Land Monitoring studies,\nincluding the monitoring of vegetation, soil and water cover,\nas well as observation of inland waterways and coastal areas.\n\nThe Sentinel-2 data contain 13 UINT16 spectral bands representing\nTOA reflectance scaled by 10000. See the [Sentinel-2 User Handbook](https://sentinel.esa.int/documents/247904/685211/Sentinel-2_User_Handbook)\nfor details. In addition, three QA bands are present where one\n(QA60) is a bitmask band with cloud mask information. For more\ndetails, [see the full explanation of how cloud masks are computed.](https://sentinel.esa.int/web/sentinel/technical-guides/sentinel-2-msi/level-1c/cloud-masks)\n\nEach Sentinel-2 product (zip archive) may contain multiple\ngranules. Each granule becomes a separate Earth Engine asset.\nEE asset ids for Sentinel-2 assets have the following format:\nCOPERNICUS/S2/20151128T002653_20151128T102149_T56MNN. Here the\nfirst numeric part represents the sensing date and time, the\nsecond numeric part represents the product generation date and\ntime, and the final 6-character string is a unique granule identifier\nindicating its UTM grid reference (see [MGRS](https://en.wikipedia.org/wiki/Military_Grid_Reference_System)).\n\nFor more details on Sentinel-2 radiometric resoltuon, [see this page](https://earth.esa.int/web/sentinel/user-guides/sentinel-2-msi/resolutions/radiometric).\n", | ||
"license": "proprietary", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the S2 license 'proprietary'? If it is in our current definition maybe we should expand the definition a bit. Or ideally find the ideal spdx license. For landsat I think we just used https://spdx.org/licenses/PDDL-1.0.html as getting across the intent of US public domain stuff... There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Honestly, I don't really know and don't feel in the position to decide that. That example is taken from GEE and I handled it the way GEE does. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, cool. Super minor issue, so let's just leave it, and following GEE sounds good. |
||
"keywords": [ | ||
"copernicus", | ||
"esa", | ||
"eu", | ||
"msi", | ||
"radiance", | ||
"sentinel" | ||
], | ||
"provider": [ | ||
{ | ||
"name": "European Union/ESA/Copernicus", | ||
"url": "https://sentinel.esa.int/web/sentinel/user-guides/sentinel-2-msi" | ||
} | ||
], | ||
"extent": { | ||
"spatial": [ | ||
180.0, | ||
-56.0, | ||
-180.0, | ||
83.0 | ||
], | ||
"temporal": [ | ||
"2015-06-23T00:00:00", | ||
null | ||
] | ||
}, | ||
"links": [ | ||
{ | ||
"rel": "self", | ||
"href": "https://storage.cloud.google.com/earthengine-test/catalog/COPERNICUS_S2.json" | ||
}, | ||
{ | ||
"rel": "parent", | ||
"href": "https://storage.cloud.google.com/earthengine-test/catalog/catalog.json" | ||
}, | ||
{ | ||
"rel": "root", | ||
"href": "https://storage.cloud.google.com/earthengine-test/catalog/catalog.json" | ||
}, | ||
{ | ||
"rel": "license", | ||
"href": "https://scihub.copernicus.eu/twiki/pub/SciHubWebPortal/TermsConditions/Sentinel_Data_Terms_and_Conditions.pdf" | ||
} | ||
] | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
{ | ||
"$schema": "http://json-schema.org/draft-06/schema#", | ||
"id": "dataset.json#", | ||
"title": "Dataset Item", | ||
"description": "This object represents the dataset in a SpatioTemporal Asset Catalog.", | ||
"type": "object", | ||
"required": [ | ||
"name", | ||
"description", | ||
"license", | ||
"extent", | ||
"links" | ||
], | ||
"additionalProperties": true, | ||
"properties": { | ||
"name": { | ||
"title": "Identifier", | ||
"type": "string" | ||
}, | ||
"title": { | ||
"title": "Title", | ||
"type": "string" | ||
}, | ||
"description": { | ||
"title": "Description", | ||
"type": "string" | ||
}, | ||
"keywords": { | ||
"title": "Keywords", | ||
"type": "array", | ||
"items": { | ||
"type": "string" | ||
} | ||
}, | ||
"version": { | ||
"title": "Dataset Version", | ||
"type": "string" | ||
}, | ||
"license": { | ||
"title": "Dataset License Name", | ||
"type": "string" | ||
}, | ||
"provider": { | ||
"type": "array", | ||
"items": { | ||
"properties": { | ||
"name": { | ||
"title": "Organization Name", | ||
"type": "string" | ||
}, | ||
"url": { | ||
"title": "Organization homepage", | ||
"type": "string", | ||
"format": "url" | ||
} | ||
} | ||
} | ||
}, | ||
"host": { | ||
"required": [ | ||
"name", | ||
"scheme", | ||
"id" | ||
], | ||
"properties": { | ||
"name": { | ||
"title": "Organization name", | ||
"type": "string" | ||
}, | ||
"description": { | ||
"title": "Description", | ||
"type": "string" | ||
}, | ||
"scheme": { | ||
"title": "Scheme", | ||
"type": "string", | ||
"enum": [ | ||
"S3", | ||
"GCS", | ||
"URL", | ||
"OTHER" | ||
] | ||
}, | ||
"id": { | ||
"title": "Identifirer", | ||
"type": "string" | ||
}, | ||
"region": { | ||
"title": "Region", | ||
"type": "string" | ||
}, | ||
"requester_pays": { | ||
"title": "Requester Pays", | ||
"type": "boolean", | ||
"default": false | ||
} | ||
}, | ||
"additionalProperties": true | ||
}, | ||
"extent": { | ||
"title": "Extents", | ||
"type": "object", | ||
"required": [ | ||
"spatial", | ||
"temporal" | ||
], | ||
"properties": { | ||
"spatial": { | ||
"title": "Spatial extent", | ||
"type": "array", | ||
"items": { | ||
"type": "number" | ||
} | ||
}, | ||
"temporal": { | ||
"title": "Temporal extent", | ||
"type": "array", | ||
"minItems": 2, | ||
"maxItems": 2, | ||
"items": { | ||
"type": [ | ||
"string", | ||
"null" | ||
], | ||
"format": "date-time" | ||
} | ||
} | ||
}, | ||
"additionalProperties": true | ||
}, | ||
"links": { | ||
"type": "array", | ||
"items": { | ||
"type": "object", | ||
"required": [ | ||
"href", | ||
"rel" | ||
], | ||
"properties": { | ||
"href": { | ||
"title": "Link", | ||
"type": "string" | ||
}, | ||
"rel": { | ||
"title": "Relation", | ||
"type": "string" | ||
}, | ||
"type": { | ||
"title": "type", | ||
"type": "string" | ||
} | ||
}, | ||
"additionalProperties": true | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single provider is enough for the vast majority of the cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Are you saying we should just mention that this will usually just be one provider? Or that we should shift from array to a single object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest shifting to single object to reduce compelxity. (In the internal EE catalog, I started with providers a list and found that this is not necessary in 99.9% of cases, and the remaining 0.1% is better handled by provenance.) I'm open to revisiting this, but I'd like to see more examples first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this need more discussion, but I think an issue would be the better place as this PR is already very messy.