Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

wachsylon · 2024-01-11T10:09:14Z

Hi,

I woudl like users to open the generated kerchunked files without using much memory. I reached the point where I could not load all single jsons anymore into memory for the MultiZarrToZarr function. So I had to create more than one, lets say combined1.parq and combined2.parq. Users then could use a catalog entry which would use code like:

import xarray as xr
xr.open_mfdataset(
        [
            "reference::/combined1.parq",
            "reference::/combined2.parq",
        ],
        engine="zarr",
        backend_kwargs=dict(
            storage_options=dict(
                lazy=True,
                remote_protocol="file",
            ),
            consolidated=False,
        ),
        parallel=True,
        data_vars="minimal",
        coords="minimal",
        compat="override",
    )

however this requires GB of memory. Idk for what exactly but it is not applicable. Also, it creates another level of potential confusion: Why are thre multiple kerchunks of multiple kerchunks?

So the other option would be to reduce memory usage for merging with MultiZarrToZarr. However, a lot of kerchunk functions require json inputs. E.g., I also use the merge_vars function after the MultiZarrToZarr.

So what are your recommendations? Wouldnt it be nice to have an append kwarg for multizarrtozarr? Or is there already sth for that?

Best,
Fabi

The text was updated successfully, but these errors were encountered:

wachsylon · 2024-01-11T17:08:19Z

OK I find that if you add this one magic little keyword chunks="auto"to the open_mfdataset command, memory requirement shrinks to a minimum:

xr.open_mfdataset(
        [
            "reference::/combined1.parq",
            "reference::/combined2.parq",
        ],
        engine="zarr",
        backend_kwargs=dict(
            storage_options=dict(
                lazy=True,
                remote_protocol="file",
            ),
            consolidated=False,
        ),
        parallel=True,
        data_vars="minimal",
        coords="minimal",
        compat="override",
        chunks="auto"
    )

that already helps a lot so... we may close it unless you have even better ideas? :)

martindurant · 2024-01-11T17:49:29Z

Wouldnt it be nice to have an append kwarg for multizarrtozarr?

Yes, this is now possible as of #404 , which has not had much testing yet

martindurant · 2024-01-15T15:09:52Z

Final comment here: of course kerchunk can't magically get xarray to do the right thing, and I don't know why it used so much memory in the first place. However, we should be able to combine multiple datasets and save new metadata so that it becomes unnecessary to call open_mfdataset in the general case. We can leave that as aspiration.

Closing this as nothing-to-be-done.

martindurant closed this as completed Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

wachsylon commented Jan 11, 2024

wachsylon commented Jan 11, 2024

martindurant commented Jan 11, 2024

martindurant commented Jan 15, 2024

Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

Comments

wachsylon commented Jan 11, 2024

wachsylon commented Jan 11, 2024

martindurant commented Jan 11, 2024

martindurant commented Jan 15, 2024