design for pangeo cloud datastore service #1

rabernat · 2018-11-29T22:20:01Z

People seem generally very pleased with the experience of using zarr + xarray in the cloud environment. However, different groups are uploading more and more zarr data, and it is becoming hard to keep track of what is where.

Currently our approach within the Pangeo NSF-funded group is the following:

Data is prepared locally from netCDF files, written to zarr on local disk, and uploaded to the pangeo-data bucket on GCS (process is described in these docs)
Datasets are recorded in an intake catalog here: https://github.com/pangeo-data/pangeo/blob/master/gce/catalog.yaml. This is pretty ad-hoc, and it is incomplete--not updated regularly.
That catalog is rendered into a pretty web page here: http://pangeo.io/catalog.html (sphinx source template)

I feel that the time has come for a more organized approach. What I have in mind is a standalone "Pangeo Data Store" service that provides the following features:

A consistent master database of all the datasets under its control (could just be an intake catalog)
Some sort of ingest mechanism, such as a staging bucket that people can upload to, separate from the production bucket
Some basic validation of uploaded datasets
A nice web interface to browse the data and perhaps provide metadata to search engines

I don't think this has to be very heavy-handed or complex. (We can likely recycle functionality from existing libraries--intake, pydap, etc.--for lots of this.) But I think it is a problem we have to solve as we move beyond the experimentation phase.

Relevant to many ongoing discussions, e.g. #420, #365, #502.

The text was updated successfully, but these errors were encountered:

rabernat · 2018-12-04T12:39:30Z

@NicWayand suggested we might also want to obtain DOIs for the datasets.

kmpaul · 2018-12-06T18:43:19Z

I guess this is a related topic, so I'll post it here, but this topic (i.e., something like "Data Management") could be one of the potential Pangeo Technical/Topical Subgroups, right?

I suppose we should wait for the Steering Council to decide what the technical groups should be, but I would say that I and @darothen would be very interested in being a part of this subgroup, were it created. (See #507)

scottyhq · 2018-12-20T19:24:08Z

Thanks for opening up the discussion here Ryan. Just wanted to draw attention to the STAC effort (SpatioTemporal Asset Catalog). This is aimed at minimal metadata descriptions with the long term vision of having data discoverable by standard web search rather than maintaining specialized servers.

Some implementations are also exploring ways to automatically update catalogs when new data appears in Cloud buckets:
https://github.com/fredliporace/cbers-2-stac

We are currently exploring ways to integrate STAC and Intake w/ NASA data:
intake/intake-datasets#2 (comment)

Could be interesting to try this out with the existing pangeo catalog (or just one of the zarr datasets for starters)

rabernat · 2018-12-20T21:38:37Z

@scottyhq -- great idea to use STAC. This is the exact sort of project that I am hoping to leverage to build our catalogs

Do you think the STAC spec is generic enough to accommodate all the pangeo use cases? What would be a concrete step we could take to evaluate STAC?

rabernat · 2019-01-07T08:59:35Z

I have created a new repo to hold my ideas and experiments related to a pangeo datastore:
https://github.com/pangeo-data/pangeo-stac

jacobtomlinson · 2019-01-07T15:07:32Z

This is definitely an interesting area for us. I've been exploring the remote catalog functionality in intake, which would be another approach to this.

Whenever we chat about this in our team this xkcd always comes up. We should put some thought into an approach where we add pointers to other catalogs as well as hosting a pangeo one. Perhaps some kind of aggregator.

rabernat · 2019-01-07T15:14:08Z

@jacobtomlinson - glad to hear you are interested! Your engineering expertise would be very valuable here.

The bottom line for me is that our catalog must

follow some sort of language-independent standard. We can't be python only. I love intake, but it's only part of the solution. There is no spec for intake catalogs, and they're not designed to work outside of the intake library.
have some sort of web front end.

STAC addresses both of these issues. I am confident we will be able translate STAC to intake (see intake/intake#224 (comment)), but the reverse is not true.

Beyond that, I am still in whiteboarding stage. This is my latest monstrosity.

I see a bot like opsdroid playing a big role in this. It would be great if the catalog could just be static json files in a git repository. But, given the complexity, I don't know what is the right software architecture to manage this. I would love to get your thoughts.

rabernat · 2019-01-07T15:22:54Z

Also note that STAC catalogs can be nested (like intake ones). This would support our federated philosophy. We could have one "master" pangeo catalog that pointed to any number of sub-catalogs maintained independently by different groups.

jacobtomlinson · 2019-01-07T15:29:19Z

That sounds like a pretty neat workflow, and something we could definitely implement with opsdroid. STAC also looks good.

From a technical viewpoint that all looks perfectly do-able. My preference is also version controlled static files.

There are other questions I would have from a governance point of view, such as commitments from the Pangeo group (how long should we agree to host data) and the criteria for accepting data. For example we are publishing data MO on AWS Earth and have agreed that AWS will review this arrangement every two years to see if both parties wish to continue. The bot could always open a new issue after a certain amount of time for discussion of dataset retirement.

The other thing would be signposting existing data, like AWS Earth. We could maintain a STAC catalog for our data, which could be pointed to by Pangeo. However what about people who are publishing data but not using STAC?

rabernat · 2019-01-07T15:39:24Z

Your governance questions are good ones.

In the long term, I don't imagine that "Pangeo" as currently constituted will become a primary data provider. As an organization, we are two amorphous to make that sort of commitment.

Instead, my intended audience for this effort is organizations like NCAR, MO, Lamont, NASA, or companies like ClimaCell, Jupyter Intel, who already host data and are convinced by our arguments that they should be hosting data in the cloud in zarr format. I think we have persuaded many people that this is a good idea in principle, but we don't have a good answer to the question, "now what?" By creating a datastore product, we will have something concrete to point such organizations to.

As for how to pay for it, I imagine that some catalogs will be part of public dataset programs on AWS and GC, while others will be paid by the organizations.

What I want to do here is create a product that will solve the problem of managing these datasets.

jacobtomlinson · 2019-01-07T16:08:08Z

Ok that's interesting. I'm seeing two different strands in this conversation. One is about creating a way of automatically approving and ingesting new datasets into some cloud provider using a bot and GitHub. The other is about creating some standard tooling for managing datasets (including these new ones) for the Pangeo community. Perhaps it would be good to separate those concerns.

I can also imagine a couple of different use cases. One would be large institute level initiatives to publish data. For example NCAR may want to publish something, they are happy to pay for it, we want them to access it from Pangeo, and we need a way to join up the dots and your catalogging implementation makes sense. The other use case would be an individual researcher (potentially also at NCAR for example) who has some data they want to publish but doesn't have access to the skills or funding to do so. In that case I see your ingestion also being useful.

rabernat · 2019-01-08T14:15:33Z

I would like to transfer this issue to the new https://github.com/pangeo-data/pangeo-datastore repository. Any objections?

martindurant · 2019-01-12T15:02:26Z

POC Code in https://github.com/NCAR/intake-siphon/pull/2/files shows that, generally speaking, it is easy to coerce well-defined catalogue protocols to intake, in that case thredds links via siphon (siphon is basically a thin layer which decodes XML, could have been done by hand).

scottyhq · 2019-01-14T08:40:08Z

Just posted a Binder demo with some initial experiments with STAC, intake, and Landsat data:
https://github.com/scottyhq/stac-intake-landsat

An important point is that a STAC catalog is very generic - could describe a global gridded n-D zarr dataset from modeling, or could describe millions of images from various sensors in different projections and resolutions.

I like the idea of pre-configured searches that give back a set of compatible STAC items that can be loaded directly into an xarray DataArray or DataSet. Maybe with sat-api:
https://github.com/sat-utils/sat-api

For Landsat it's pretty straightforward to get the temporal stack of colocated images for a specific Path and Row. That is enough for many applications, but if someone is wanting to study a regional process, it would be great to have a way to construct virtual mosaics or composites from many scenes. Do we want intake to do that behind the scenes? Another example, what if someone tries to load Landsat Band1 (30m) and Band8 (15m), the image resolutions are optional metadata in STAC, so intake could just raise a 'Not allowed' error, or the plugin could do the necessary resampling.

Would be great to get feedback from @apawloski on this!

f

Bring up to date

github-actions · 2020-12-09T12:18:39Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

github-actions · 2020-12-16T12:19:37Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

rabernat transferred this issue from pangeo-data/pangeo Jan 8, 2019

This was referenced Jan 11, 2019

WIP: html repr pydata/xarray#1820

Closed

Uploading GPM dataset to pangeo-data cloud storage pangeo-data/pangeo#531

Closed

charlesbluca added a commit that referenced this issue Apr 22, 2020

Merge pull request #1 from pangeo-data/master

eee836c

f

charlesbluca pushed a commit that referenced this issue Dec 1, 2020

Merge pull request #1 from pangeo-data/master

32601bb

Bring up to date

github-actions bot added the Stale label Dec 9, 2020

github-actions bot closed this as completed Dec 16, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

design for pangeo cloud datastore service #1

design for pangeo cloud datastore service #1

rabernat commented Nov 29, 2018

rabernat commented Dec 4, 2018

kmpaul commented Dec 6, 2018

scottyhq commented Dec 20, 2018

rabernat commented Dec 20, 2018

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 7, 2019

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 8, 2019

martindurant commented Jan 12, 2019

scottyhq commented Jan 14, 2019

github-actions bot commented Dec 9, 2020

github-actions bot commented Dec 16, 2020

design for pangeo cloud datastore service #1

design for pangeo cloud datastore service #1

Comments

rabernat commented Nov 29, 2018

rabernat commented Dec 4, 2018

kmpaul commented Dec 6, 2018

scottyhq commented Dec 20, 2018

rabernat commented Dec 20, 2018

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 7, 2019

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 7, 2019

jacobtomlinson commented Jan 7, 2019

rabernat commented Jan 8, 2019

martindurant commented Jan 12, 2019

scottyhq commented Jan 14, 2019

github-actions bot commented Dec 9, 2020

github-actions bot commented Dec 16, 2020