Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design for pangeo cloud datastore service #1

Closed
rabernat opened this issue Nov 29, 2018 · 16 comments
Closed

design for pangeo cloud datastore service #1

rabernat opened this issue Nov 29, 2018 · 16 comments
Labels

Comments

@rabernat
Copy link
Member

People seem generally very pleased with the experience of using zarr + xarray in the cloud environment. However, different groups are uploading more and more zarr data, and it is becoming hard to keep track of what is where.

Currently our approach within the Pangeo NSF-funded group is the following:

I feel that the time has come for a more organized approach. What I have in mind is a standalone "Pangeo Data Store" service that provides the following features:

  • A consistent master database of all the datasets under its control (could just be an intake catalog)
  • Some sort of ingest mechanism, such as a staging bucket that people can upload to, separate from the production bucket
  • Some basic validation of uploaded datasets
  • A nice web interface to browse the data and perhaps provide metadata to search engines

I don't think this has to be very heavy-handed or complex. (We can likely recycle functionality from existing libraries--intake, pydap, etc.--for lots of this.) But I think it is a problem we have to solve as we move beyond the experimentation phase.

Relevant to many ongoing discussions, e.g. #420, #365, #502.

@rabernat
Copy link
Member Author

rabernat commented Dec 4, 2018

@NicWayand suggested we might also want to obtain DOIs for the datasets.

@kmpaul
Copy link

kmpaul commented Dec 6, 2018

I guess this is a related topic, so I'll post it here, but this topic (i.e., something like "Data Management") could be one of the potential Pangeo Technical/Topical Subgroups, right?

I suppose we should wait for the Steering Council to decide what the technical groups should be, but I would say that I and @darothen would be very interested in being a part of this subgroup, were it created. (See #507)

@scottyhq
Copy link
Member

Thanks for opening up the discussion here Ryan. Just wanted to draw attention to the STAC effort (SpatioTemporal Asset Catalog). This is aimed at minimal metadata descriptions with the long term vision of having data discoverable by standard web search rather than maintaining specialized servers.

Some implementations are also exploring ways to automatically update catalogs when new data appears in Cloud buckets:
https://github.com/fredliporace/cbers-2-stac

We are currently exploring ways to integrate STAC and Intake w/ NASA data:
intake/intake-datasets#2 (comment)

Could be interesting to try this out with the existing pangeo catalog (or just one of the zarr datasets for starters)

@rabernat
Copy link
Member Author

@scottyhq -- great idea to use STAC. This is the exact sort of project that I am hoping to leverage to build our catalogs

Do you think the STAC spec is generic enough to accommodate all the pangeo use cases? What would be a concrete step we could take to evaluate STAC?

@rabernat
Copy link
Member Author

rabernat commented Jan 7, 2019

I have created a new repo to hold my ideas and experiments related to a pangeo datastore:
https://github.com/pangeo-data/pangeo-stac

@jacobtomlinson
Copy link
Member

This is definitely an interesting area for us. I've been exploring the remote catalog functionality in intake, which would be another approach to this.

Whenever we chat about this in our team this xkcd always comes up. We should put some thought into an approach where we add pointers to other catalogs as well as hosting a pangeo one. Perhaps some kind of aggregator.

@rabernat
Copy link
Member Author

rabernat commented Jan 7, 2019

@jacobtomlinson - glad to hear you are interested! Your engineering expertise would be very valuable here.

The bottom line for me is that our catalog must

  • follow some sort of language-independent standard. We can't be python only. I love intake, but it's only part of the solution. There is no spec for intake catalogs, and they're not designed to work outside of the intake library.
  • have some sort of web front end.

STAC addresses both of these issues. I am confident we will be able translate STAC to intake (see intake/intake#224 (comment)), but the reverse is not true.

Beyond that, I am still in whiteboarding stage. This is my latest monstrosity.
schematic

I see a bot like opsdroid playing a big role in this. It would be great if the catalog could just be static json files in a git repository. But, given the complexity, I don't know what is the right software architecture to manage this. I would love to get your thoughts.

@rabernat
Copy link
Member Author

rabernat commented Jan 7, 2019

Also note that STAC catalogs can be nested (like intake ones). This would support our federated philosophy. We could have one "master" pangeo catalog that pointed to any number of sub-catalogs maintained independently by different groups.

@jacobtomlinson
Copy link
Member

That sounds like a pretty neat workflow, and something we could definitely implement with opsdroid. STAC also looks good.

From a technical viewpoint that all looks perfectly do-able. My preference is also version controlled static files.

There are other questions I would have from a governance point of view, such as commitments from the Pangeo group (how long should we agree to host data) and the criteria for accepting data. For example we are publishing data MO on AWS Earth and have agreed that AWS will review this arrangement every two years to see if both parties wish to continue. The bot could always open a new issue after a certain amount of time for discussion of dataset retirement.

The other thing would be signposting existing data, like AWS Earth. We could maintain a STAC catalog for our data, which could be pointed to by Pangeo. However what about people who are publishing data but not using STAC?

@rabernat
Copy link
Member Author

rabernat commented Jan 7, 2019

Your governance questions are good ones.

In the long term, I don't imagine that "Pangeo" as currently constituted will become a primary data provider. As an organization, we are two amorphous to make that sort of commitment.

Instead, my intended audience for this effort is organizations like NCAR, MO, Lamont, NASA, or companies like ClimaCell, Jupyter Intel, who already host data and are convinced by our arguments that they should be hosting data in the cloud in zarr format. I think we have persuaded many people that this is a good idea in principle, but we don't have a good answer to the question, "now what?" By creating a datastore product, we will have something concrete to point such organizations to.

As for how to pay for it, I imagine that some catalogs will be part of public dataset programs on AWS and GC, while others will be paid by the organizations.

What I want to do here is create a product that will solve the problem of managing these datasets.

@jacobtomlinson
Copy link
Member

Ok that's interesting. I'm seeing two different strands in this conversation. One is about creating a way of automatically approving and ingesting new datasets into some cloud provider using a bot and GitHub. The other is about creating some standard tooling for managing datasets (including these new ones) for the Pangeo community. Perhaps it would be good to separate those concerns.

I can also imagine a couple of different use cases. One would be large institute level initiatives to publish data. For example NCAR may want to publish something, they are happy to pay for it, we want them to access it from Pangeo, and we need a way to join up the dots and your catalogging implementation makes sense. The other use case would be an individual researcher (potentially also at NCAR for example) who has some data they want to publish but doesn't have access to the skills or funding to do so. In that case I see your ingestion also being useful.

@rabernat
Copy link
Member Author

rabernat commented Jan 8, 2019

I would like to transfer this issue to the new https://github.com/pangeo-data/pangeo-datastore repository. Any objections?

@rabernat rabernat transferred this issue from pangeo-data/pangeo Jan 8, 2019
@martindurant
Copy link

POC Code in https://github.com/NCAR/intake-siphon/pull/2/files shows that, generally speaking, it is easy to coerce well-defined catalogue protocols to intake, in that case thredds links via siphon (siphon is basically a thin layer which decodes XML, could have been done by hand).

@scottyhq
Copy link
Member

Just posted a Binder demo with some initial experiments with STAC, intake, and Landsat data:
https://github.com/scottyhq/stac-intake-landsat

An important point is that a STAC catalog is very generic - could describe a global gridded n-D zarr dataset from modeling, or could describe millions of images from various sensors in different projections and resolutions.

I like the idea of pre-configured searches that give back a set of compatible STAC items that can be loaded directly into an xarray DataArray or DataSet. Maybe with sat-api:
https://github.com/sat-utils/sat-api

For Landsat it's pretty straightforward to get the temporal stack of colocated images for a specific Path and Row. That is enough for many applications, but if someone is wanting to study a regional process, it would be great to have a way to construct virtual mosaics or composites from many scenes. Do we want intake to do that behind the scenes? Another example, what if someone tries to load Landsat Band1 (30m) and Band8 (15m), the image resolutions are optional metadata in STAC, so intake could just raise a 'Not allowed' error, or the plugin could do the necessary resampling.

Would be great to get feedback from @apawloski on this!

charlesbluca added a commit that referenced this issue Apr 22, 2020
charlesbluca pushed a commit that referenced this issue Dec 1, 2020
@github-actions
Copy link

github-actions bot commented Dec 9, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the Stale label Dec 9, 2020
@github-actions
Copy link

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants