Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce requried memory for inputs for MultiZarrToZarr and merge_vars #408

Closed
wachsylon opened this issue Jan 11, 2024 · 3 comments
Closed

Comments

@wachsylon
Copy link

Hi,

I woudl like users to open the generated kerchunked files without using much memory. I reached the point where I could not load all single jsons anymore into memory for the MultiZarrToZarr function. So I had to create more than one, lets say combined1.parq and combined2.parq. Users then could use a catalog entry which would use code like:

import xarray as xr
xr.open_mfdataset(
        [
            "reference::/combined1.parq",
            "reference::/combined2.parq",
        ],
        engine="zarr",
        backend_kwargs=dict(
            storage_options=dict(
                lazy=True,
                remote_protocol="file",
            ),
            consolidated=False,
        ),
        parallel=True,
        data_vars="minimal",
        coords="minimal",
        compat="override",
    )

however this requires GB of memory. Idk for what exactly but it is not applicable. Also, it creates another level of potential confusion: Why are thre multiple kerchunks of multiple kerchunks?

So the other option would be to reduce memory usage for merging with MultiZarrToZarr. However, a lot of kerchunk functions require json inputs. E.g., I also use the merge_vars function after the MultiZarrToZarr.

So what are your recommendations? Wouldnt it be nice to have an append kwarg for multizarrtozarr? Or is there already sth for that?

Best,
Fabi

@wachsylon
Copy link
Author

OK I find that if you add this one magic little keyword chunks="auto"to the open_mfdataset command, memory requirement shrinks to a minimum:

xr.open_mfdataset(
        [
            "reference::/combined1.parq",
            "reference::/combined2.parq",
        ],
        engine="zarr",
        backend_kwargs=dict(
            storage_options=dict(
                lazy=True,
                remote_protocol="file",
            ),
            consolidated=False,
        ),
        parallel=True,
        data_vars="minimal",
        coords="minimal",
        compat="override",
        chunks="auto"
    )

that already helps a lot so... we may close it unless you have even better ideas? :)

@martindurant
Copy link
Member

Wouldnt it be nice to have an append kwarg for multizarrtozarr?

Yes, this is now possible as of #404 , which has not had much testing yet

@martindurant
Copy link
Member

Final comment here: of course kerchunk can't magically get xarray to do the right thing, and I don't know why it used so much memory in the first place. However, we should be able to combine multiple datasets and save new metadata so that it becomes unnecessary to call open_mfdataset in the general case. We can leave that as aspiration.

Closing this as nothing-to-be-done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants