Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add hypergrib as as a grib reader #238

Open
TomNicholas opened this issue Sep 13, 2024 · 16 comments
Open

Add hypergrib as as a grib reader #238

TomNicholas opened this issue Sep 13, 2024 · 16 comments
Labels
enhancement New feature or request references generation Reading byte ranges from archival files upstream issue

Comments

@TomNicholas
Copy link
Member

I just learned about hypergrib (via this blog post).

I don't know much about how GRIB files are structured, but if @JackKelly makes a very performant GRIB reader that can:

  1. be called from python,
  2. return ChunkManifest objects and associated metadata (ideally using ChunkManifest.from_arrays() to avoid any intermediate representation),

then we could add it as a new reader to virtualizarr.readers.

In fact FYI my intention now is to probably make an entrypoint system for virtualizarr, in addition to plugging in to xarray's backend entrypoint system. That would allow this syntax for creating virtual xr.Dataset objects containing ManifestArrays:

vds = xr.open_dataset(grib_paths, engine='virtualizarr', reader='hypergrib')
@JackKelly
Copy link

Sounds awesome! My intention is definitely for hypergrib to play nicely with the existing ecosystem, including xarray and virtualizarr.

But hypergrib won't do anything vaguely useful for at least a few months. When I'm a bit further down the line, I'll get back in touch, so we can discuss the optimal way to plug hypergrib into virtualizarr 🙂

@JackKelly
Copy link

JackKelly commented Oct 1, 2024

Just to flag that, after @emfdavid pointed out that some GRIB datasets contain billions of GRIB messages (chunks), I've changed the design for hypergrib so that it no longer stores a manifest of chunks. After doing some more research, it seems likely that we'll soon see NWP GRIB datasets with trillions of chunks.

A kerchunk JSON manifest of a trillion chunks would be about 50 TB!

Instead, the plan is that hypergrib will generate GRIB filenames by exploiting the fact that most public GRIB datasets have well-defined file naming conventions. So hypergrib will load the .idx files on-the-fly. If that turns out to be too slow then we'll explore ways to store the .idx metadata in a more compact and cloud-friendly form.

We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels. And maybe some additional metadata for when NWPs are upgraded and, for example, more ensemble members are generated per run.

I'm guessing the lack of a manifest makes hypergrib less interesting to VirtualiZarr?? (Sorry!)

The discussion that triggered this design change is here: JackKelly/hypergrib#14 (comment)

The new design is sketched out here: https://github.com/JackKelly/hypergrib/blob/main/design.md

@TomNicholas
Copy link
Member Author

datasets with trillions of chunks

Yikes! It's not impossible to imagine handling this with the virtualizarr stack - it just becomes territory where you need out-of-core-computation just to concatenate / save the manifests (i.e. dask/cubed). It's the same scale of problem as processing TB's of numerical array data now.

Instead, the plan is that hypergrib will generate GRIB filenames by exploiting the fact that most public GRIB datasets have well-defined file naming conventions.

We will still store some metadata for each NWP dataset. But it'll just be the array shape, dimension names, and coordinate labels.

I wonder if this could be handled by having a ChunkManifest backed by numpy-like arrays which are actually lazy generators. Similar in spirit to the "functional/analytic coordinates" being discussed in xarray. I think this is the same idea as your linked discussion about "algorithmically" calculating the paths/offsets/lengths, I'm just suggesting wrapping that algorithm inside a numpy-like array / ChunkManifest object, such that you can still take advantage of virtualizarr & xarray for their IO & concatenation features.

@mdsumner
Copy link
Contributor

mdsumner commented Oct 1, 2024

If grib interleaving was predictable that would be interesting. I thought it was pretty random given what we'd seen in AMPS.

Why trillions of chunks? Is each time chunk tiny? It seems like it must be a poor format choice (for a specific case) if the index must be so big. Certainly there's a lot of array data out there that should just be a table.

@emfdavid
Copy link

emfdavid commented Oct 3, 2024

I'm just suggesting wrapping that algorithm inside a numpy-like array / ChunkManifest object, such that you can still take advantage of virtualizarr & xarray for their IO & concatenation features.

I realize the xarray dataset is required to have a finite number of chunks and finite dimensions (or maybe we can work around this?) but with the ability to compute all the chunk that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate?

I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 3, 2024

finite number of chunks and finite dimensions

Finite yes, but if you can cheaply compute the chunk references then how is finite a limitation?

with the ability to compute all the chunks that will ever exist for a dataset like HRRR or ERA5, what is left to concatenate?

You personally might not need to concatenate further for these particular datasets, but I bet you someone is immediately going to be like "I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..." 😄

I want to be able to call xr.open and have every variable for all time and all forecast horizons available. I think we can do that without loading any data.

I know, I'm not suggesting we load data. I'm saying that if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework. The alternative would be creating a 3rd API for creating, joining, and writing virtual references.

@JackKelly
Copy link

Why trillions of chunks?

Well, I'll admit that I'm not aware of any NWP datasets today that have trillions of GRIB messages 🙂. But there datasets today that have tens of billions of GRIB messages (see the calculations here), and I could well imagine seeing datasets with trillions of chunks in the next few years.

In general, each "chunk" in GRIB is a 2-dimensional image, where the two dimensions are the horizontal spatial dimensions. For example, a single GRIB message might contain a 2D image which represents a prediction for temperature at the Earth's surface at a particular time.

@JackKelly
Copy link

I wonder if this could be handled by having a ChunkManifest backed by numpy-like arrays which are actually lazy generators.

I like the sound of this!

if you're going to build optimized readers and representations of grib datasets, then as you're going to want to save the references out to the same formats as virtualizarr does, and you're likely going to want to merge/concatenate/drop virtual variables at some point, you may as well see if you can fit the algorithmic inflation part into the virtualizarr framework.

Yup, sounds good to me! Not least as a nice way to generate VirtualiZarr / kerchunk manifests from hypergrib!

For @emfdavid's use-case (which is the same as my use-case!), the plan is that hypergrib will plug directly into xarray as a new backend (and will never "materialise" a manifest... instead hypergrib will always generate GRIB filenames on-the-fly). But I do see @TomNicholas' point that some folks will want to merge/concat/drop virtual variables and output kerchunk/VirtualiZarr manifests; and I'm very keen for hypergrib to plug into the existing Pangeo stack as easily as possible 🙂.

@emfdavid
Copy link

emfdavid commented Oct 3, 2024

"I have this weird dataset with multiple sets of GRIB files that I want to form part of the same Zarr store..."

heh - I have been that guy more often than not!

how is finite a limitation?

When you open the dataset you need to pick an end date, that defines the number of chunks and the dimensions of the dataset. Could we just pick something like a year in the future from when you call open? I hacked xreds xpublish to server the database backed grib manifest we build in big query. Reloading it every few minutes to get updates is a major drag. Reloading this new lazy dataset would much less onerous... but we still need to pick an end date for something we know goes into the future to varying degrees.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 3, 2024

hypergrib will plug directly into xarray as a new backend (and will never "materialise" a manifest... instead hypergrib will always generate GRIB filenames on-the-fly)

Sounds good, but it's worth pointing out here how similar the problems of creating a new xarray backend to read data and writing a virtualizarr "reader" to generate references are. In both cases you look at files, create an array-like lazy metadata-only representation of the actual data and its location (either xarray's BackendArray or VZ's ManifestArray), and have a mechanism for materializing that data when you actually want it (for virtualizarr this mechanism usually involves writing those virtual references to some persistent kerchunk/zarr format first).

In fact, because its also useful for virtualizarr readers to also be able to actually load variables into memory (see the loadable_variables kwarg), the virtualizarr reader for a given filetype ends up kind of being a superset of the corresponding xarray backend for that filetype. As mentioned above, I plan to make this correspondence more explicit in the future by actually exposing an entrypoint system in virtualizarr - see #245.

merge/concat/drop virtual variables

Thinking about this more it's perhaps not trivial... The reason concatenation is possible in virtualizarr is because np.concat(list[ManifestArray, ...]) -> ManifestArray, which is ultimately possible because you can concatenate two ChunkManifest objects into a single new ChunkManifest object. But it's not clear to me that your analytic inflatable representation will be capable of that. That doesn't mean that this approach isn't useful, just that it's plausible that you end up with a representation where virtual variables can be opened, merged, dropped and serialized, but not concatenated. Note that xarray does not have a general solution to this problem either (pydata/xarray#4628). See also #246 for a generalized discussion of Generic Manifests.

but we still need to pick an end date for something we know goes into the future to varying degrees.

So IIUC you have datasets that are appended to frequently? And by reloading you mean that you don't want to reindex the entire thing just to append the new chunks? Cheaply appending the virtual references for the new chunks to an existing store is the same problem as discussed in #21 (and something the Earthmover folks should have better solution for soon...) I don't see why this requires picking an arbitrary end date in the distant future that hopefully includes all the data you plan to write - instead the data schema should be updated only when you actually do write (/create virtual references to) new data.

@emfdavid
Copy link

emfdavid commented Oct 4, 2024

If you are creating your own dataset with thousands of chunks having an explicit manifest is great and working to make it easy to append/concat is a critical feature for the xarray/zarr toolset. Manifest building and appending datasets are problems that I worked really hard to solve for Camus over the past year. Last week @JackKelly turned the problem on it's head, pointing out the the manifest could be computed algorithmically and I look at the problem completely differently now.

Our goal is working with enormous data sets we don't control, with millions of grib files, billions of chunks put in the cloud by NODD and ECMWF. They add 100,000 chunks a day to some of these datasets and for operational use, I want the latest data as it becomes available. Directly building native zarr datasets seems out of reach for these organizations at the moment, so we are looking for solutions to work with what we have got. I was planning to advocate for building a public database backed manifest that could be used with virtualizarr or with kerchunk as I have been doing privately at Camus for over 6 months now. Based on operational experience, this architecture is difficult and costly to manage operationally. I am sure my work leaves much to could be improved, but I am talking about datasets with variables have more than 100,000 chunks with more to append each hour as they land in the cloud bucket. I am using the dataset across dozens of distributed workers running big ML jobs.

Lazily computing the manifest itself is just a very different way of thinking about these big NWP archival datasets that could be much easier. I fully support and appreciate virtualizarr and we should definitely build bridges between these solutions (materializing a manifest where needed) but I suspect working with an xarray or zarr dataset free of the materialized manifest could be a very powerful solution in it's own right. There a lots of pitfalls that will make this paradigm shift difficult working with the existing APIs so having insight from the folks that build the existing stack would be fantastic. Can we do this - how?

@rabernat
Copy link
Collaborator

rabernat commented Oct 4, 2024

One alternative to chunk-level dataset aggregation is file-level dataset aggregation. This is actually the traditional approach that has been in use for decades (see e.g. ncml). Chunk-level aggregation emerged in the context of cloud computing, where the cost of peeking at each file is high due to the latency of HTTP requests. But if you somehow know a-priori exactly where the chunks are in each file, you can avoid that.

@TomNicholas
Copy link
Member Author

TomNicholas commented Oct 4, 2024

file-level dataset aggregation.

I'm not really sure what you mean in this context @rabernat . The client still needs to know where inside the file to read from, so we're back to chunks aren't we?

If anyone wants to chat about this we do have a bi-weekly virtualizarr meeting starting in a couple of minutes!

EDIT: In the unlikely event that you are trying to join today please say so - having some zoom issues.

@emfdavid
Copy link

emfdavid commented Oct 4, 2024

Can't make it this week - but I will try to join next time.

@mpiannucci
Copy link
Contributor

mpiannucci commented Oct 4, 2024

I was trying to join and couldnt get in and just saw this. So if you happen to see this i am willking to join!

@rabernat
Copy link
Collaborator

rabernat commented Oct 4, 2024

The client still needs to know where inside the file to read from, so we're back to chunks aren't we?

Yes, but there are many ways you can provide that information:

  • Via explicit chunk references (how Kerchunk / VirtualiZarr work)
  • By opening the file and looking at the metadata at read time rather than as a preprocessing step. This is what happens if you do open_mfdataset(list_list_of_hdf_files, engine='h5netcdf'). It's slow, but could be optimized further (see e.g. hydefix).
  • By somehow knowing a-priori where the chunks are without having to open the file first (as proposed by hypergrib).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request references generation Reading byte ranges from archival files upstream issue
Projects
None yet
Development

No branches or pull requests

6 participants