-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Storage profiles #148
Comments
@mojodna a few things
|
Can this be extended to include the access control that private data set come with. For example DG can catalog assets like our images, but we will not be making these publicly accessible. Can I make my custom "DG" provider? If so what would something like this look like? |
I like the idea of profiles for storage. We were thinking about storage for the datasets, too, but we had defined a fixed set, which I am not so happy with. What do you think @simonff? Would that be a more flexible way to go for datasets, too? |
Can we first list the top possibilities for storage? S3, GCS (Google Cloud Storage), MS Azure Storage, FTP/HTTP/HTTPS, DG's GBDX, Earth Engine, ... what else? @jeffnaus: how would DG's clients expect to see storage information? Note that the dataset subteam seemed to agree that it's better to separate the notion of Provider (the author/producer of the dataset) from the notion of Host/Storage (the physical location where the bytes are stored). Eg, in the case of the HTTP storage the Provider section would point at the human-readable homepage of the dataset, while the Host section would contain download links. |
I'm not sure whether we need much information for hosts such as DG's GBDX, Earth Engine and openEO as they don't offer data for direct download, but only for processing purposes within their own infrastructure. I don't really know what I would expect there for openEO. What would you expect there for GEE @simonff? #164 has information on how the dataset subteam defined Hosts and Provider in the datasets as of now. |
While EE etc don't offer data for direct download, it still seems valuable to support listing of assets in some standard format to make future catalog viewers more broadly useful and provide compliance with open standards. (Note that EE can already export any assets into geotiffs for download. Right now there is no foolproof way to export the exact original bytes with the exact original extent and projection - one would need to tune the export parameters just so. But we keep the original bytes and metadata internallt, so reconstructing an exact copy of the input asset is not impossible.) |
So you would have stored a STAC catalog and would point to its location, but the STAC items would not refer to any downloadable assets? One asset would need to be referenced though (required by the spec). These catalogs would probably be stored on S3, GCS, HTTP and would not need a native GEE profile for storage, right? Or did I not get your point? |
Right, a static STAC catalog is a file or a family of files that can be browsed over HTTP, AFAIU. EE would not know about them, though EE might provide STAC API exposing the same information as static files. Sure, it makes sense for each item to reference at least one asset, but why do these assets have to be downloadable? Eg, one could imagine some future tool comparing listings of Landsat collections in EE and in DG to see who has the most recent data. |
Same thing with DG, you can't directly download initially, but after you order it you can download it. I've implemented this for DG as STAC item that gets updated locally with the download information once the user orders it. But I don't see why a STAC item should be required to include any assets, perhaps all the assets are only retrievable via some function call. Granted, providers should be strongly encouraged to provide a thumbnail, but for some datasets even thumbnails aren't all that useful. |
When I once asked about the requirement of at least one assets, I got those reponses: https://gitter.im/SpatioTemporal-Asset-Catalog/Lobby?at=5ae2f6b61130fe3d36219434
|
My own thinking on this has changed after working with GBDX. There are likely to be even more platforms in the future where assets aren't directly downloadable but require API calls through a RESTful API or an API library. |
Sure, there are GEE, openEO, GDBX and more to come. So should we open an issue to remove this restriction? Edit: Did so: #187 |
+1 |
This functionality is implemented in the storage extension: |
Static STAC Catalog entries include a
provider
element which isn't particularly well-specified. The current example is S3-specific:I propose that storage profiles be defined for each type of object store (or just plain HTTP) so that keys have more meaningful values. The above would then be:
However, this presumes that all Items falling beneath such a parent catalog assume the same attributes, which may not be correct. Instead, perhaps these should be
asset
properties and theprovider
block omitted:Proposed keys:
s3:region
- name of the AWS region that a bucket is in, e.g.us-west-2
(enum)s3:requesterPays
- whether the S3 Requester Pays feature is enabled for this item; implies that AWS credentials must be used to access it (boolean)s3:public
- whether the item can be retrieved without S3 credentials; this allows clients to create HTTP(S) URLs and make items downloadable directly (boolean)The text was updated successfully, but these errors were encountered: