Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds checksum flag to zstd codec #519

Merged
merged 18 commits into from
Jun 24, 2024
Merged

Adds checksum flag to zstd codec #519

merged 18 commits into from
Jun 24, 2024

Conversation

normanrz
Copy link
Contributor

@normanrz normanrz commented Apr 22, 2024

This PR adds the checksum flag to the zstd codec. This is necessary to support the proposed zstd codec for Zarr3. We need it for the v3 refactoring of zarr-python.

TODO:

  • Unit tests and/or doctests in docstrings
  • Tests pass locally
  • Docstrings and API docs for any new/modified user-facing classes and functions
  • Changes documented in docs/release.rst
  • Docs build locally
  • GitHub Actions CI passes
  • Test coverage to 100% (Codecov passes)

@normanrz normanrz changed the title Adds checksum flag to zsdt codec Adds checksum flag to zstd codec Apr 22, 2024
@normanrz
Copy link
Contributor Author

normanrz commented May 8, 2024

I would love a review on this so that we can ship this in the upcoming zarr-python 3 release.

@d-v-b
Copy link
Contributor

d-v-b commented May 8, 2024

@mkitti would you mind giving this a look?

@martindurant
Copy link
Member

A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?

@normanrz
Copy link
Contributor Author

normanrz commented May 8, 2024

A good time to switch to cramjam? Does this zstd provide anything that that one doesn't?

cramjam also does not expose the checksum option.

@mkitti
Copy link
Contributor

mkitti commented May 8, 2024

I'm looking. We should normalize the implementation here such that negative compression levels are passed on to the C library and that the default compression level is the default compression level of the underlying C library.

@mkitti
Copy link
Contributor

mkitti commented May 8, 2024

The default CLEVEL here should be changed to 0. That should be passed on the C library directly without further logic.

DEFAULT_CLEVEL = 0

@normanrz
Copy link
Contributor Author

normanrz commented May 8, 2024

The default CLEVEL here should be changed to 0. That should be passed on the C library directly without further logic.

DEFAULT_CLEVEL = 0

I made that change. This is a breaking change, though.

@mkitti
Copy link
Contributor

mkitti commented Jun 19, 2024

I made that change. This is a breaking change, though.

The change is that we used to default to compression level 1 explicitly, overriding the configuration of the underlying Zstandard C library.

Now we pass 0 as the compression level, which uses the default compression level of the C library. The default compression level is usually 3.

For reference see
https://facebook.github.io/zstd/zstd_manual.html#Chapter5

Do we provide access to ZSTD_defaultCLevel(void)?

@normanrz
Copy link
Contributor Author

Do we provide access to ZSTD_defaultCLevel(void)?

We do not. For what would that be useful?

@normanrz
Copy link
Contributor Author

Are there any objections for changing the default compression level from 1 to 3 (i.e. default of the underlying zstd c library) @zarr-developers/python-core-devs?

Copy link

codecov bot commented Jun 19, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.91%. Comparing base (bef2e16) to head (696e582).
Report is 28 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #519   +/-   ##
=======================================
  Coverage   99.91%   99.91%           
=======================================
  Files          59       59           
  Lines        2312     2319    +7     
=======================================
+ Hits         2310     2317    +7     
  Misses          2        2           
Files with missing lines Coverage Δ
numcodecs/tests/test_zstd.py 100.00% <100.00%> (ø)

@mkitti
Copy link
Contributor

mkitti commented Jun 20, 2024

We do not. For what would that be useful?

That would be useful to checking the configuration of the Zstandard C Library. In other words, it would indicate what a compression level of 0 actually corresponds to. By default it is 3, but someone could have configured it differently at compile time:

https://github.com/facebook/zstd/blob/17b531501670781f37fc3e5070a29eede09bca3b/lib/zstd.h#L128-L130

@normanrz
Copy link
Contributor Author

But that would mean that people have a modified version of numcodecs with a modified zstd library within blosc. Given that a main purpose of numcodecs is to deliver wheels, that seems pretty unlikely to me.

@mkitti
Copy link
Contributor

mkitti commented Jun 24, 2024

Note that conda-forge builds Blosc with an independently configured zstandard library:
https://github.com/conda-forge/blosc-feedstock/blob/main/recipe%2Fmeta.yaml#L27

@normanrz
Copy link
Contributor Author

@mkitti I added the Zstd.{default,min,max}_level properties.

@mkitti
Copy link
Contributor

mkitti commented Jun 24, 2024

Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.

The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.

@martindurant
Copy link
Member

we do know what the uncompressed size of a chunk should be from array information

Only if this is the only stage in the decompression pipeline. Multiple byte compression/encoding stages are allowed, or indeed, not shape-preserving array operations too.

Side note: I don't believe that zstd, zstandard or cramjam.zstd need the total size in order to decompress, but can use it as an optimization. Also, at least the latter can do decompress_into where the target could be the final array buffer (if it is a contiguous block).

@normanrz
Copy link
Contributor Author

Currently, we depend on Zstd frame content size being encoded. However, we do know what the uncompressed size of a chunk should be from array information. Perhaps we should not generate an error if the frame content size is not "known" to Zstandard because it is, in fact, known to us.

The frame content size may not be encoded if the encoder is using streaming mode and did not pledge the full size of the content in beginning.

I think that change would be out of scope for this PR. We should continue this discussion in a new issue.

@normanrz normanrz merged commit 5b12b15 into main Jun 24, 2024
45 checks passed
@normanrz normanrz deleted the zstd-checksum branch June 24, 2024 15:48
DimitriPapadopoulos pushed a commit to DimitriPapadopoulos/numcodecs that referenced this pull request Aug 10, 2024
* expose checksum toggle for zstd

* fixes zstd checksumming

* less fixtures

* write_checksum -> checksum

* adds release notes

* set default clevel to 0

* release

* update fixtures

* fix checksum flag

* add test for checksum

* adds wrapper codecs for the v2 codec pipeline

* docstring
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants