-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using kerchunk to reference large sets of netcdf4 files #240
Comments
The total number of chunks in the target matters, as (for now) we are storing the references as inefficient JSON and loading them into python objects. If your JSON is 1GB, I assume you must truly have a huge number of references. Yes, this will be loaded by every worker; I am not sure (to be tested) whether the file location or the in-memory reference set is being passed. It is plausible that we could provide a way to only pass only those references to a worker which we know from the graph it will be needing. In the near future, we mean to revisit the on-disc and in-memory data structure for the reference set, and implement laziness, so that references are loaded batch by batch as required. If you omit the distributed client, you will reuse the original memory footprint of the references and, since the compute is GIL-releasing, still get parallelism. That's my recommendation for this particular compute. Another thing worth considering, is the sheer size of the compute graph itself. It is possible, and efficient, to use |
Thanks very much for this helpful response, @martindurant! I've explored omitting the distributed client and rechunking, but my kerchunk reference dataset is always slower to compute on than if I just open the equivalent dataset with Lazy loading of the reference set sounds awesome! |
It would be interesting to get a profile for cell 4 - is all the time spent in json decoding (see #241 )? Similarly, the distributed client will tell you what the workers are busy doing (you get get a profile report or just look on the dashboard). I am not familiar, what does |
|
I presume you mean this cell 4 (apologies, there were a few cells labelled "[4]" in the notebook)? Note the output below is truncated. Then, the dask dashboard shows large amounts of unmanaged memory during the example compute operation on What's not immediately clear to me is why this example compute takes so much longer than if I just open/concat the dataset with |
So we see that
and then load the new file instead. The second load should be much faster. This is with the very latest fsspec, by the way. We may decide to make this automatic in the future, and it won't matter for zarr V3. |
The other questions:
|
Thanks @martindurant. Yes consolidating the metadata in this way pretty much halves the time taken to open the dataset. I'm excited to play with lazy reference formats if/when they become available. I'll close this issue in the meantime. Thanks for your help! |
cc #237 ( @jhamman ); I would happily add consolidation to all kerchunk output, and could even remove the duplicate metadata keys. In v3, do you know if/when a directory listing would be needed? We need to ensure that reading a reference-set can happen without an ls() operation, or provide an alternative way to get the listing than going through all the keys. |
Zarr v3's spec does not (yet) include consolidated metadata. This extension needs to be written by someone (zarr-developers/zarr-specs#136). In the v2 spec, consolidated metadata only consolidates the metadata keys. However, in v3, you might imagine a use case where a Zarr storage transformer is used to consolidate the listing of the chunk keys. This sounds a lot like the Kerchunk reference spec and could also be thought of as general chunk manifest (zarr-developers/zarr-specs#82).
To answer your specifically answer your question. In the xarray context, |
OK, so either way we will need an alternative mechanism to list at least some things (top level dirs for v2, meta files for v3), since we absolutely must avoid having to do string manipulation across all references. |
Firstly, thanks for this great tool!
I’m trying to generate a kerchunk reference dataset for many chunked netcdf4 files that comprise a single climate model experiment. I’m doing this on a local file system, not in a cloud environment, similar to #123.
The climate model experiment in question comprises 2TB across 61 netcdf files (unfortunately I can’t share these data). I generate a single reference json using the approach provided in the tutorial (code below). This all works well, and I can open my combined dataset using
xarray.open_dataset
and see that it has the correct structure and chunking.However, when I try to perform a small compute on a variable in this dataset using a dask distributed cluster (with 4GB per worker) I immediately run out of memory. My reference json is ~1GB. Is this being loaded by each worker? I am confused because there are examples in the docs of this approach being applied to 80TB datasets. However, based on my simple example, I would’ve thought that the reference json(s) for an 80TB dataset would be prohibitively large. Am I doing something wrong/misunderstanding? Any advice would be much appreciated.
The text was updated successfully, but these errors were encountered: