Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for variable length string arrays #16

Merged
merged 18 commits into from
Nov 14, 2023

Conversation

minnerbe
Copy link
Contributor

@minnerbe minnerbe commented Mar 11, 2023

This is the zarr implementation of the corresponding pull request in the N5 base repository. To compile, the .jar files from the issue in the base repository are needed.

The main points are:

  • Zarr reads/writes string arrays through numcodecs, which encodes number of items of the whole array and number of bytes in each string. A subclass of VLenStringDataBlock was added to encapsulate that logic.
  • Since reading in N5 is done via DataBlocks, which must know the number of (decompressed) bytes in advance, some workaround was necessary to make this work with all compression schemes that are currently implemented in the tests.
  • Compatibility with the Python zarr package was tested. For this reason, the filter interface was fleshed out a bit, but not used in the actual (de-)serialization because that would have required more changes.

For full disclosure: the last commit changes the public API by fixing a typo. I can force-push away this one if requested as it's technically unrelated to the issue.

minnerbe added 6 commits March 7, 2023 21:12
TODO:
* Write works but doesn't write number of serialized bytes. Hence, read
  does not allocate correct number of bytes.
* Filter is non-functional at the moment. Adapt this such that it writes
  the correct json data (as seen in Anndata).
In this version reading and writing arrays to/from Python works (tested
manually)
@minnerbe
Copy link
Contributor Author

Thanks to @axtimwalde, reading and writing string arrays works and is compatible with Python. The current solution does not read every Object array as String array anymore. These two cases are separated in the constructor of DType, which now also depends on the Filters.

For me, this would be good to go now.

@bogovicj bogovicj merged commit 395bf3f into saalfeldlab:master Nov 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants