This repository contains a copy of Pangeo's cloud data catalog, formatted to follow the SpatioTemporal Asset Catalog (STAC) specification. The root STAC catalog can be found at:
https://raw.githubusercontent.com/pangeo-data/pangeo-datastore-stac/master/master/catalog.json
Currently the catalogs contain:
- Consolidated metadata Zarr group/arrays (represented through STAC Collections with assets)
In time they should be able to hold:
- Earth System Model (ESM) collections (through a pending ESM extension for STAC)
- CSV-based dataframes
The motivation behind this project is to have a version of the current cloud data catalog which can be searched and browsed regardless of language. At the moment, the current YAML-based catalogs are only accessible through Python using intake. This means that any server-side code accessing these catalogs must be written in Python, which has historically played a big role in how we have generated the website containing previews of all catalogged data:
- Originally, the website was created using a static site generator; however, this approach ran into issues once we began catalogging data which authentication to be accessed, which could not be done through GitHub.
- We later moved to a dynamic Flask-based website, powered by Google App Engine; this allowed us to get the proper authentication to load dataset previews on-demand, but frequently ran into memory issues which made many datasets impossible to view.
With the introduction of intake-stac, an intake extension which allows Python users to browse STAC catalogs, there is no longer a need to for the catalogs themselves to be tied to intake. Thus, a move to JSON-based STAC catalogs allows a variety of new languages (in particular JavaScript, Ruby, and PHP) access to the catalogs, without leaving behind initial Python users.
All of the Pangeo STAC catalogs are working with version 1.0.0-beta.2 of the STAC specification.
Currently, the Pangeo STAC catalog follows STAC specifications for an absolute published catalog.
All preexisting intake catalogs correspond to STAC catalogs, while datasets and data collections correspond to STAC collections with extensions required to access the data being listed under the stac_extensions
field.
There is still a lot of work to be done before this catalog can be considered equivalent to the current cloud data catalog. In particular:
- Representing Zarr stores using the collection-assets extension
- Finishing the specifications for the ESM extension to allow ESM collections to be represented
- Filling in metadata fields in the catalog/collections with relevant information (such as
description
,extent
,providers
, andlicense
) - Making sure the catalogs validate using stac-validator in conjunction with continuous integration
- Making sure that validated catalogs can be accessed using intake-stac