`zarr-python` v3 compatibility #516

mpiannucci · 2024-10-10T17:33:36Z

So far the only tested file type is HDF5. Thats the only module that currently works with new zarr python in some way

martindurant

There's much less change here than I might have thought.

kerchunk/tests/test_hdf.py

kerchunk/hdf.py

martindurant · 2024-10-10T17:49:06Z

kerchunk/hdf.py

@@ -496,6 +516,8 @@ def _translator(
                        if h5obj.fletcher32:
                            logging.info("Discarding fletcher32 checksum")
                            v["size"] -= 4
+                        key =  str.removeprefix(h5obj.name, "/") + "/" + ".".join(map(str, k))


This is the same as what _chunk_key did? Maybe make it a function with a comment saying it's a copy/reimplementation.

By the way, is h5obj.name not actually a string, so you could have done h5obj.name.removeprefix()?

martindurant · 2024-10-10T17:50:41Z

kerchunk/hdf.py

                    shape=h5obj.shape,
                    dtype=dt or h5obj.dtype,
                    chunks=h5obj.chunks or False,
                    fill_value=fill,
-                    compression=None,
+                    compressor=None,


So here, you could reintroduce the compressor

filters = filters[:-1] compressor = filters[-1]

but obviously it depends on whether there are indeed any filters at all.

It would still need back compat, since filters-only datasts definitely exist.

yeah the big issue is that v3 cares about what type of operation it is, and v2w doesnt so moving them around doesnt necessarily fix that bug

So there needs to be a change upstream?

Yes this: zarr-developers/zarr-python#2325

martindurant · 2024-10-10T17:51:36Z

kerchunk/hdf.py

+            for k, v in self.store_dict.items():
+                if isinstance(v, zarr.core.buffer.cpu.Buffer):
+                    key = str.removeprefix(k, "/")
+                    new_keys[key] = v.to_bytes()
+                    keys_to_remove.append(k)
+            for k in keys_to_remove:
+                del self.store_dict[k]


This is the hacky bit and could use some explanations. Even when requesting "v2", zarr makes Buffer objects, and the keys are also wrong?

Yeah so two issues here:

the keys we get from hdf are for example /depth/.zarray when then need to be depth/.zarray

we cant jsonify buffers, which is how the internal MemoryStore in v3 stores its data. So we need to convert the buffers to bytes to be serialized

OK - would appreciate comments on the code saying this.

kerchunk/hdf.py

kerchunk/tests/test_hdf.py

mpiannucci · 2024-10-15T18:37:23Z

Also worth noting... zarr 3 doesn't support numcodecs codecs out of the box. There is a pr to help this zarr-developers/numcodecs#524 (see also here for updated version zarr-developers/numcodecs#597) but it would mean a change to codec names which causes an incompatibility. For the initial icechunk examples we handle this in virtualizarr but long term it probably belongs here to work standalone.

mpiannucci · 2024-10-21T13:18:19Z

When this zarr-developers/zarr-python#2425 goes in it should unblock this to work full with zarr python v3.

We will also need to create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs so that a given dataset can be loaded in either v2 or v3 contexts (say if you kerchunk a grib file, then want to convert those references to an icechunk store)

martindurant · 2024-10-21T13:20:47Z

create both numcodecs and zarr v3 codec versions of all the custom kerchunk codecs

Is that a subclass in each case, specifying that it is to be bytes->bytes?

mpiannucci · 2024-10-21T13:22:52Z

Sorry the numcodecs versions exists as they are today. Yes the v3 version would be basically subclassing zarr.abc.codec using the numcodecs implementations of the kerchunk codecs.

Although grib decoder should really be bytes to array i think

mpiannucci · 2024-10-23T12:52:00Z

This is required upstream: fsspec/filesystem_spec#1734

mpiannucci · 2024-10-23T13:30:22Z

Also of note: To get this to work with zarr 3, we pass an fsspec ReferenceFilesystem to a zarr RemoteStore. This works fine with remote filesystems (where the data files live) that have async implementations (s3, http, etc) but does not work when the data files are on filesystem with only a sync implementation (local, etc). The tests heavily depend on the local filesystem, and i'm sure many others do as well so it needs to be figured out how the interaction of sync filesystems and the async zarr RemoteStore requirement work out

mpiannucci · 2024-10-23T14:18:24Z

Grib works as long as read in zarr 2 format.

To be used with zarr 3 the codec needs to be ported over to the new abc.Codec class from zarr

mpiannucci · 2024-10-23T15:25:28Z

I think one change that should be made for compatibility with zarr 3 and virtualizarr is that the Grib2 codec moves from filters to compressors.

martindurant · 2024-10-23T15:34:24Z

the Grib2 codec moves from filters to compressors

It returns an array, though - this was already pointed out. I think there is an assumption that compressors map to byte->byte codecs.

mpiannucci · 2024-10-23T15:36:08Z

Ill change it back for now. Will discuss with zarr folks

implement zarr3 support for grib

mpiannucci · 2024-11-26T20:37:55Z

State of the Union post merge with @moradology and installing latest fsspec and zarr 3

pip install git+https://github.com/fsspec/filesystem_spec

Test File	Fail?	Notes
`test_grib.py`	✅	datatree not tested
`test_netcdf.py`	3 Failures	test depend on memory which doesnt work with reference fs
`test_hdf.py`	5 Failures	Remote refs work fine
`test_df.py`	✅
`test_utils.py`	2 Failures
`test_fits.py`	5 Failures	`TypeError: Item key has incorrect type (expected slice, got int)`
`test_zarr.py`	2 Failures
`test_combine.py`	All Failed
`test_combine_concat.py`	8 Failures
`test_combine_dask.py`	3 Failures
`test_xarray_backend.py`	1 Failure	`TypeError: Item key has incorrect type (expected slice, got int)`

A lot of these are the same errors over and over a again. The hardest part will be maintaining compat for zarr python 2 library which I am not sure if it should be a goal of this PR @martindurant

mpiannucci added 10 commits October 4, 2024 16:37

Save progress for next week

39722e7

Bump zarr python version

d3c7e37

Get some tests working others failing

25d7d14

get through single hdf to zarr

ffe5f9d

Save progress

5aef233

Cleanup, almost working with hdf

b9323d2

Closer...

0f17119

Updating tests

5c8806b

reorganize

80fedcd

Save progress

1f69a0b

martindurant reviewed Oct 10, 2024

View reviewed changes

kerchunk/tests/test_hdf.py Outdated Show resolved Hide resolved

mpiannucci added 5 commits October 10, 2024 15:30

Refactor to clean things up

d556e52

Fix circular import

b27e64c

Iterate

41d6e8e

Change zarr dep

7ade1a6

More conversion

492ddee

mpiannucci mentioned this pull request Oct 12, 2024

Virtual Dataset Workflow Tracking Issue earth-mover/icechunk#197

Open

5 tasks

Specify zarr version

6e5741c

TomNicholas mentioned this pull request Oct 17, 2024

Make kerchunk dependency entirely optional zarr-developers/VirtualiZarr#258

Closed

mpiannucci added 3 commits October 23, 2024 09:31

Working remote hdf tests

c0316ac

Working grib impl

59bd36c

Add back commented out code

187ced2

TomNicholas mentioned this pull request Oct 23, 2024

Update dependencies for xarray, zarr-python, icechunk, kerchunk zarr-developers/VirtualiZarr#268

Open

4 tasks

Make grib codec a compressor since its bytes to array

690ed21

Switch back

5019b15

mpiannucci mentioned this pull request Oct 23, 2024

Zarr Python 3 tracking issue #514

Open

4 tasks

mpiannucci and others added 6 commits October 26, 2024 16:42

Add first pass at grib zarr 3 codec

d96cf46

Fix typing

cbcb720

Fix some broken tests; use async filesystem wrapper

b88655f

Implement zarr3 compatibility for grib

73eaf33

Use zarr3 stores directly; avoid use of internal fs

3757199

Merge pull request #4 from moradology/fix/zarr3-grib-tests

9444ff8

implement zarr3 support for grib

mpiannucci added 12 commits November 26, 2024 16:25

Forward

d8848ce

More

1fa294e

Figure out async wrapper

543178d

Closer on hdf5

96b56cd

netcdf but failing

0808b05

grib passing

aef006e

Fix inline test

d9bf0dd

More

884fc68

standardize compressor name

1145f45

Fix one more hdf test

94ec479

Small tweaks

a9693d1

Hide fsspec import where necessary

7e9112a

maxrjones mentioned this pull request Nov 29, 2024

Dependency Issue for Kerchunk -> Icechunk via Virtualizarr zarr-developers/VirtualiZarr#321

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`zarr-python` v3 compatibility #516

`zarr-python` v3 compatibility #516

mpiannucci commented Oct 10, 2024 •

edited

Loading

martindurant left a comment

martindurant Oct 10, 2024

martindurant Oct 10, 2024

mpiannucci Oct 10, 2024

martindurant Oct 15, 2024

mpiannucci Oct 15, 2024

martindurant Oct 10, 2024

mpiannucci Oct 10, 2024

martindurant Oct 15, 2024

mpiannucci commented Oct 15, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

martindurant commented Oct 21, 2024

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024

martindurant commented Oct 23, 2024

mpiannucci commented Oct 23, 2024

mpiannucci commented Nov 26, 2024 •

edited

Loading

zarr-python v3 compatibility #516

Are you sure you want to change the base?

zarr-python v3 compatibility #516

Conversation

mpiannucci commented Oct 10, 2024 • edited Loading

martindurant left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mpiannucci commented Oct 15, 2024 • edited Loading

mpiannucci commented Oct 21, 2024 • edited Loading

martindurant commented Oct 21, 2024

mpiannucci commented Oct 21, 2024 • edited Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024 • edited Loading

mpiannucci commented Oct 23, 2024

mpiannucci commented Oct 23, 2024

martindurant commented Oct 23, 2024

mpiannucci commented Oct 23, 2024

mpiannucci commented Nov 26, 2024 • edited Loading

`zarr-python` v3 compatibility #516

`zarr-python` v3 compatibility #516

mpiannucci commented Oct 10, 2024 •

edited

Loading

mpiannucci commented Oct 15, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 21, 2024 •

edited

Loading

mpiannucci commented Oct 23, 2024 •

edited

Loading

mpiannucci commented Nov 26, 2024 •

edited

Loading