-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STAC and other Prior Art #3
Comments
@rsignell-usgs - when the government reopens (or on your "personal time"), it would be great if you could link us to some other standards you think we should be paying attention to. |
THREDDS catalog XML SPEC: Process for adding new types of data:
https://www.unidata.ucar.edu/software/thredds/v4.6/tds/catalog/InvCatalogSpec.html#Enumerations |
@rabernat, okay, here goes! We've been using the Open Geospatial Consortium's Catalog Service for the Web (CSW) for several years for cataloging here at USGS CMG and also for the IOOS (Integrated Ocean Observing System). Both the service we use for distributing model output (THREDDS) and the service we use for distributing sensor data (ERDDAP) can generate ISO 19119-2 metadata records on the fly for each dataset, and the This allows us to generate workflows that automatically pick up new measurements and models as they become available. I have a short (5 min) lightning talk at SciPy 2016 on "catalog driven workflows" that shows all this: I have a SciPy lightning talk that describes the whole system in 5 minutes! 😸 The pycsw page is a good place to visit if you want to dig deeper: https://pycsw.org Here's a simple example in a Jupyter notebook: http://ioos.github.io/notebooks_demos/notebooks/2017-12-15-finding_HFRadar_currents/ We also have a couple of papers demonstrating the power of this approach: What @apawloski and I were exploring on at the Pangeo Developer's meeting last year was creating ISO metadata records from xarray objects and allow the GCS or S3 datasets supporting them to be discoverable using the same approach. |
Rich unfortunately that youtube video seems corrupted. I can't get it to play. |
So I am trying to take seriously the suggestion that @rsignell-usgs continues to pose: that we catalog our cloud-based datasets via ISO 19119-2 metadata records and OGC CSW services. The first step towards this is for me to educate myself about the specs themselves. The first hurdle I have hit is that ISO 19119-2 is not free! It costs $200 to even read it. Does anyone have a copy of this document that they can share? It makes me quite uncomfortable to standardize around spec that you have to pay to read. Is this really how things work in ISO world? OGC CSW is a free and open spec. However, I find it intimidatingly complex. Furthermore, based on my understanding, it describes a query service: to have a OGC CSW catalog, you need a server which provides and API to respond to queries. (Please correct me if I am misinterpreting this.) In contrast, with both THREDDS XML and STAC, the catalog can be a static file. pycsw seems like a good product. Based on my reading of the docs, it seems like it could provide a CSW-compliant cataloging service. It would be an additional service to stand up (and maintain, and pay for), but it doesn't look too hard. The crux seems to be loading records. I don't understand how we will generate these records from our existing and future cloud datasets. I suppose that is the exact project that @rsignell-usgs and @apawloski were working on. Do you have any example code that you can point us to for how this might work? |
@rabernat , I discussed generating ISO records from xarray objects with @kwilcox (Kyle Wilcox) yesterday, and he thought it would be straightforward to implement. He does a lot of python work for IOOS, and he's worked on owslib, which facilitates interaction with CSW. (also I just tried the youtube link and it worked okay for me) |
I'd recommend against generating ISO records from Regarding the catalog implementation... it depends on what your requirements are. Do you need service side filtering and indexing? Full-text search? How will users interact with the catalog - API? Website? Code? Notebooks? Might be a bigger discussion? |
@kwilcox this is super helpful! I wish I had known about cf-json when I implemented this PR in xarray. It is exactly what I was looking for at the time. Can you explain the relationship, if any, between cf-json and netcdf-ld? As for the catalog, for now we are just looking for a static catalog, i.e. a text file that can be parsed by humans and machines. We will eventually want to load it into intake for ingestion in python. |
Also, you might be interested in the parallel discussion happening in radiantearth/stac-spec#361, about how to represent "data cubes" using STAC. |
I see |
Just to be clear, |
Has anybody experience with CovJSON? https://covjson.org/ |
There is a discussion of these various JSON representations of NetCDF/CF metadata here: |
When I say |
As long as the file (must) hold the actual data in the JSON file as it seems to be the case for cf-json or CovJSON, it doesn't seem to be an appropriate catalog (metadata) file. So either STAC (note: I may be biased as a STAC contributor) with the upcoming datacube extension or netcdf-ld sound more appropriate for a catalog file. |
👍, I've started following the STAC datacube extension conversation. Just a note, while |
Excellent conversation folks! @jonblower FYI
I am not sure about cf-json but for CovJSON encoding of the actual range array values can be achieved by representing them in a separate document in a more efficient format. Various formats for the range are possible, but one attractive possibility is CovJSON itself, which provides a JSON encoding for a standalone multidimensional array (this can be compressed during data transfer for much greater efficiency). It may also, of course, be possible to use binary formats like NetCDF for this purpose. But many of these formats encode the full coverage (not just the range), and care must be taken to ensure that the RDF representation of the domain is consistent with that in the linked file. It is my understanding that use cases demonstrating this behavior are not overly common however the CovJSON specification does accommodate this in principle. |
Also folks for more discussion on the CovJSON side you will want to consult http://ceur-ws.org/Vol-1777/paper2.pdf. |
Thanks @lewismc for bringing me in here. I'm one of the authors of the CovJSON spec so happy to answer questions on this. Lewis is right that CovJSON was designed to accommodate the possibility of having metadata and data in separate files (which can be linked of course). The CoverageCollection object might answer @kwilcox's need for "a static catalog file describing an N dimensional dataset". One thing to bear in mind is the kind of metadata you want in your catalogue. CovJSON contains the same kind of metadata as a NetCDF file, e.g. a detailed description of the domain of the data, including the exact form of all the spatiotemporal axes, CRS definitions, variable definitions etc. Currently it does not contain "summary" metadata, such as the rough spatiotemporal bounding box, which can be useful for discovery. (However such information could be deduced from the CovJSON file.) |
Reviving this discussion to see if there are any decisions or further considerations that have arisen since last Feb. I'm building a data catalog now for climate/xarray type data. I'm most familiar with STAC, but have also started dabbling with Intake. And now I need to read up on all the *json systems mentioned above. Is there any new consensus on how pangeo-data will be implementing data catalog? |
I advocate for My cataloging requirements are often at a higher level and I need to capture many different access points for the same data package. For example, a zarr data access url, analyses on the data output as static PNG images, a Jupyter notebook showing usage of the data, and a presentation given about the data. |
There have been a lot of developments. To handle the CMIP6 data in the cloud for the CMIP6 hackathon, we hacked together the ESM collection spec: https://github.com/NCAR/esm-collection-spec/ This is not STAC, but it is inspired by STAC. The hope is that we can eventually find a way to merge with STAC. Right now, the ESM collection spec couples very tightly with intake-esm: https://intake-esm.readthedocs.io/. Some basic usage example can also be found at https://discourse.pangeo.io/t/using-ocean-pangeo-io-for-the-cmip6-hackathon/291 cc @andersy005 and @matt-long, who were instrumental in the development of ESM collection spec. |
@talldave - we have hacked together something called ESM collection spec: https://github.com/NCAR/esm-collection-spec/. It is STAC-like but it is its own thing. We came up with this to catalog the CMIP6 cloud data (https://pangeo-data.github.io/pangeo-datastore/cmip6_pangeo.html). Some flavor of it is implemented by intake-esm We would love to have more people working on this, and we welcome your involvement. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
@rsignell-usgs mentioned that that are already a lot of standards / catalogs / services etc. in this space. It would be useful to enumerate these so we don't go about re-inventing the wheel too much.
I'll kick things off by linking to @cholmes's blog posts about why they decided to invent STAC:
A STAC Static catalog:
The text was updated successfully, but these errors were encountered: