Storage profiles #148

mojodna · 2018-08-14T05:26:14Z

Static STAC Catalog entries include a provider element which isn't particularly well-specified. The current example is S3-specific:

    "provider": {
        "scheme": "s3",
        "region": "us-east-1",
        "requesterPays": "true"
    }

I propose that storage profiles be defined for each type of object store (or just plain HTTP) so that keys have more meaningful values. The above would then be:

"provider": {
  "s3:region": "us-east-1",
  "s3:requesterPays": true
}

However, this presumes that all Items falling beneath such a parent catalog assume the same attributes, which may not be correct. Instead, perhaps these should be asset properties and the provider block omitted:

"assets": [
  {
    "href": "s3://.../...",
    "type": "...",
    "name": "...",
    "s3:region": "us-east-1", // can be inferred from the bucket name, but explicit is better than implicit
    "s3:requesterPays": true
  }
]

Proposed keys:

s3:region - name of the AWS region that a bucket is in, e.g. us-west-2 (enum)
s3:requesterPays - whether the S3 Requester Pays feature is enabled for this item; implies that AWS credentials must be used to access it (boolean)
s3:public - whether the item can be retrieved without S3 credentials; this allows clients to create HTTP(S) URLs and make items downloadable directly (boolean)

The text was updated successfully, but these errors were encountered:

matthewhanson · 2018-08-14T10:54:49Z

@mojodna a few things

We discussed having object fields but decided to keep things simple originally and currently we don't have any fields that are objects under properties, which simplifies searching. Object fields, such as links and assets, we moved out of properties to the top level for this region. I think there are multiple fields that make sense to make object types, but we need to define how a user searches these fields.
This potentially duplicates a lot of info across items. In sat-api collections don't include assets, and there isn't anything specifically that excludes parts of assets being defined, but we need to specify how these would end up getting merged. I will start a new ticket for this.
Looks like we need to specify an s3 extension that adds these additional fields, and likewise create one for google storage.

jeffnaus · 2018-08-14T23:23:36Z

Can this be extended to include the access control that private data set come with. For example DG can catalog assets like our images, but we will not be making these publicly accessible. Can I make my custom "DG" provider? If so what would something like this look like?

m-mohr · 2018-08-19T06:33:01Z

I like the idea of profiles for storage. We were thinking about storage for the datasets, too, but we had defined a fixed set, which I am not so happy with. What do you think @simonff? Would that be a more flexible way to go for datasets, too?

simonff · 2018-08-20T04:41:57Z

Can we first list the top possibilities for storage? S3, GCS (Google Cloud Storage), MS Azure Storage, FTP/HTTP/HTTPS, DG's GBDX, Earth Engine, ... what else?

@jeffnaus: how would DG's clients expect to see storage information?

Note that the dataset subteam seemed to agree that it's better to separate the notion of Provider (the author/producer of the dataset) from the notion of Host/Storage (the physical location where the bytes are stored). Eg, in the case of the HTTP storage the Provider section would point at the human-readable homepage of the dataset, while the Host section would contain download links.

m-mohr · 2018-08-20T04:52:40Z

I'm not sure whether we need much information for hosts such as DG's GBDX, Earth Engine and openEO as they don't offer data for direct download, but only for processing purposes within their own infrastructure. I don't really know what I would expect there for openEO. What would you expect there for GEE @simonff?

#164 has information on how the dataset subteam defined Hosts and Provider in the datasets as of now.

simonff · 2018-08-20T05:06:54Z

While EE etc don't offer data for direct download, it still seems valuable to support listing of assets in some standard format to make future catalog viewers more broadly useful and provide compliance with open standards. (Note that EE can already export any assets into geotiffs for download. Right now there is no foolproof way to export the exact original bytes with the exact original extent and projection - one would need to tune the export parameters just so. But we keep the original bytes and metadata internallt, so reconstructing an exact copy of the input asset is not impossible.)

m-mohr · 2018-08-20T05:21:53Z

So you would have stored a STAC catalog and would point to its location, but the STAC items would not refer to any downloadable assets? One asset would need to be referenced though (required by the spec). These catalogs would probably be stored on S3, GCS, HTTP and would not need a native GEE profile for storage, right? Or did I not get your point?

simonff · 2018-08-20T05:43:32Z

Right, a static STAC catalog is a file or a family of files that can be browsed over HTTP, AFAIU. EE would not know about them, though EE might provide STAC API exposing the same information as static files.

Sure, it makes sense for each item to reference at least one asset, but why do these assets have to be downloadable? Eg, one could imagine some future tool comparing listings of Landsat collections in EE and in DG to see who has the most recent data.

matthewhanson · 2018-08-20T09:03:26Z

Same thing with DG, you can't directly download initially, but after you order it you can download it. I've implemented this for DG as STAC item that gets updated locally with the download information once the user orders it.

But I don't see why a STAC item should be required to include any assets, perhaps all the assets are only retrievable via some function call. Granted, providers should be strongly encouraged to provide a thumbnail, but for some datasets even thumbnails aren't all that useful.

m-mohr · 2018-08-20T09:12:13Z

When I once asked about the requirement of at least one assets, I got those reponses: https://gitter.im/SpatioTemporal-Asset-Catalog/Lobby?at=5ae2f6b61130fe3d36219434
Also, the catalog spec claims:

the point of the SpatioTemporal Asset Catalog is to be link to actual data, not to just reference metadata

matthewhanson · 2018-08-20T09:15:25Z

My own thinking on this has changed after working with GBDX. There are likely to be even more platforms in the future where assets aren't directly downloadable but require API calls through a RESTful API or an API library.

m-mohr · 2018-08-20T09:24:47Z

Sure, there are GEE, openEO, GDBX and more to come. So should we open an issue to remove this restriction?

Edit: Did so: #187

matthewhanson · 2018-08-20T10:46:06Z

+1

…ring datasets required etc.) Related issues: #225, #194, #148, #136, #78, #36.

matthewhanson · 2023-04-04T16:23:47Z

This functionality is implemented in the storage extension:
https://github.com/stac-extensions/storage

mojodna added stac-static stac-sprint-3-discuss labels Aug 14, 2018

This was referenced Aug 20, 2018

Make assets optional #187

Closed

Dataset metadata spec #164

Merged

cholmes added the prio: should-have would be very good to have in the release label Aug 23, 2018

cholmes added this to the 0.6.0-RC1 milestone Aug 24, 2018

m-mohr added a commit that referenced this issue Oct 5, 2018

Implemented changes discussed in the last meeting (removing host, mak…

9ec7183

…ring datasets required etc.) Related issues: #225, #194, #148, #136, #78, #36.

m-mohr mentioned this issue Oct 5, 2018

Dataset improvements #262

Merged

m-mohr modified the milestones: 0.6.0-RC1, future Oct 9, 2018

matthewhanson mentioned this issue Nov 14, 2018

Implement storage profiles sat-utils/sat-stac#12

Closed

m-mohr added new extension and removed stac-sprint-3-discuss prio: should-have would be very good to have in the release labels Jul 18, 2019

m-mohr linked a pull request Feb 19, 2021 that will close this issue

requester pays and other storage details in Asset and Item #991

Closed

4 tasks

cholmes modified the milestones: future, new extensions Feb 26, 2021

matthewhanson closed this as completed Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Storage profiles #148

Storage profiles #148

mojodna commented Aug 14, 2018

matthewhanson commented Aug 14, 2018

jeffnaus commented Aug 14, 2018

m-mohr commented Aug 19, 2018

simonff commented Aug 20, 2018

m-mohr commented Aug 20, 2018

simonff commented Aug 20, 2018

m-mohr commented Aug 20, 2018 •

edited

Loading

simonff commented Aug 20, 2018

matthewhanson commented Aug 20, 2018

m-mohr commented Aug 20, 2018

matthewhanson commented Aug 20, 2018

m-mohr commented Aug 20, 2018 •

edited

Loading

matthewhanson commented Aug 20, 2018

matthewhanson commented Apr 4, 2023

Storage profiles #148

Storage profiles #148

Comments

mojodna commented Aug 14, 2018

matthewhanson commented Aug 14, 2018

jeffnaus commented Aug 14, 2018

m-mohr commented Aug 19, 2018

simonff commented Aug 20, 2018

m-mohr commented Aug 20, 2018

simonff commented Aug 20, 2018

m-mohr commented Aug 20, 2018 • edited Loading

simonff commented Aug 20, 2018

matthewhanson commented Aug 20, 2018

m-mohr commented Aug 20, 2018

matthewhanson commented Aug 20, 2018

m-mohr commented Aug 20, 2018 • edited Loading

matthewhanson commented Aug 20, 2018

matthewhanson commented Apr 4, 2023

m-mohr commented Aug 20, 2018 •

edited

Loading

m-mohr commented Aug 20, 2018 •

edited

Loading