Draft ZEP 0007: Strings #47

ivirshup · 2023-09-07T15:31:41Z

Finally getting around to posting this proposal that was initially put out on the zulip. You can see the initial conversation on the linked hackmd: https://hackmd.io/aSz4DAYnRRaoFPMQXrml3w

I've tried to be quite conservative in the definition here. The overall idea is: "use arrow's string type".

I would like to get more feedback on this, especially from implementers.

cc: @normanrz

jbms · 2023-09-07T17:06:44Z

I like the overall intent of this proposal. I think we need a new "array -> bytes" codec that will support the new string and binary types, since the existing bytes codec was intended for fixed-size types. Not sure what the best name for that codec would be, but "vlen" might be a reasonable choice.

Then the vlen codec could support various configuration options:

{"name": "vlen",
 "configuration": {
    "data_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
    "index_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
    "index_data_type": "uint32"
  }
}

Having separate data and index codecs allows different compression options to be used --- e.g. in the example above we use bit-wise shuffling for the data but byte-wise shuffling for the index.

One caveat is that if, as in the example above, the size of the encoded index is variable, then we would need to separately store the size of the index. Some compression formats may be self-delimiting and therefore not require that the size is stored, but we may not want to deal with that complexity.

ivirshup · 2023-09-17T15:48:28Z

@jbms, you raise a very good point. I was able to talk to Joris about this at the numfocus summit last week and got a lot of insight onto how arrow does this.

An arrow RecordBatch contains a flattened set of buffers corresponding to the columns (docs). To use an example from the docs considering the schema:

col1: Struct<a: Int32, b: List<item: Int64>, c: Float64>
col2: Utf8

We'd have the fields:

FieldNode 0: Struct name='col1'
FieldNode 1: Int32 name='a'
FieldNode 2: List name='b'
FieldNode 3: Int64 name='item'
FieldNode 4: Float64 name='c'
FieldNode 5: Utf8 name='col2'

Which corresponds to the buffers:

buffer 0: field 0 validity
buffer 1: field 1 validity
buffer 2: field 1 values
buffer 3: field 2 validity
buffer 4: field 2 offsets
buffer 5: field 3 validity
buffer 6: field 3 values
buffer 7: field 4 validity
buffer 8: field 4 values
buffer 9: field 5 validity
buffer 10: field 5 offsets
buffer 11: field 5 data

While how compression works with the IPC format isn't super well documented (apache/arrow#37756), we can find a description of it in the flatbuffer definitions. AFAICT, each buffer is compressed separately, but I believe you cannot specify different compressors for different buffers. There is also room in the specification for compressing the entire message, instead of the buffers individually.

So, where does that leave us?

Allowing separate compression of underlying buffers may be useful, and I think gets much more useful if more variable length types are allowed. I would also like to keep the goal of very low cost interoperability with Arrow.

I don't know that I love the idea of this codec:

Why make it specific for vlen? Surely multiple arrays of fixed lengths could be stored in a buffer using separate compression schemes as well.
How do we do chunk level metadata to know where component buffers start? Probably inline as a header for each chunk?

To me, arrow RecordBatch IPC format rhymes with zarr's sharding format. Instead of including multiple chunks of one array inside a shard, it's storing related chunks across multiple arrays inside a shard.

There may be a more parsimonious solution here that shares more with sharding + variable chunk sizes, instead of defining a new codec.

since the existing bytes codec was intended for fixed-size types.

Where is the bytes codec defined? I believe this would be a basic array->bytes codec, but I do not see this defined in the v3 spec: https://zarr-specs.readthedocs.io/en/latest/v3/codecs.html

jbms · 2023-09-17T21:52:57Z

The bytes codec is the proposed renaming of the endian codec: zarr-developers/zarr-specs#263
Sorry for the confusion about that.

I think arrow compatibility as far as the format of the offsets and data buffers is relatively easy to achieve. I am not so keen on trying to use the RecordBatch flatbuffers message format itself, since that adds a lot of baggage and complexity for what we could also accomplish with just a single 64-bit number.

It is true that there is some similarity with the sharding_indexed codec: in fact the sharding codec is just storing an array of variable-length byte strings. It differs from your proposed arrow-compatible format in that it stores both an offset and length for each entry to allow arbitrary ordering of sub-chunks.

But it is not clear to me how useful it is to try to unify the sharding format with the vlen string format, since the use cases and expected access patterns are very different.

Can you explain the connection to variable-size chunks (i.e. rectilinear grid)? Are you thinking more about sparse arrays?

I agree that if we had a different case where we are storing multiple arrays in one chunk, such as storing a chunk using a sparse array encoding, we would probably also want to allow separate codecs for each array, and these could be specified as part of the json configuration for this "sparse" codec. As far as the binary format, I suppose it could make sense to try to unify the sparse array format and the vlen string format in some way, but I'm not sure there is really that much benefit and it would bring in a lot of added complexity.

draft/ZEP0007.md

MSanKeys963 · 2023-10-26T12:39:51Z

Hi @ivirshup. I've fixed the RTD build issue in #51.

The PR preview is https://zeps--47.org.readthedocs.build/en/47/draft/ZEP0007.html.

joshmoore · 2023-11-24T18:00:26Z

@ivirshup, thinking about your comment in the description:

I would like to get more feedback on this, especially from implementers.

what would you like to see happen here on this PR before ZEP7 gets listed under https://zarr.dev/zeps/draft_zeps/?

joshmoore · 2024-07-11T16:24:07Z

@ivirshup: in light of the renewed interest in zarr-developers/zarr-specs#83 (comment), do you see coming back to this or are you interested in passing it off? (Some discussion during the ZEP meeting today)

rabernat · 2024-07-12T18:53:17Z

We discussed this today at the zarr-python meeting.

The above ideas are all good ones. The arrow approach of storing an offsets buffer and a data buffer seems to be the way most data formats today do it.

However, it may also be valuable to have a V3 codec that is backwards compatible with the existing Zarr V2 VLen codecs: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/vlen.pyx

These codecs use an "interleaved" format:

| header | item 0 len | item 0 data | item 1 len | item 1 data | ... 
| uint4  | unit4      | variable    | unit4      | variable    | ...

where the header stores the number of items.

You can see this in how Zarr V2 encodes data. Here's an example

import zarr  # V2

strings = ["hello", "world", "my", "name", "is", "Ryan"]
store = zarr.MemoryStore()
array = zarr.array(strings, dtype=str, store=store, compressor=None)

buffer = store['0']
nitems = int.from_bytes(buffer[:4], byteorder="little")
offset = 4
for _ in range(nitems):
    next_len = int.from_bytes(buffer[offset:offset+4], byteorder="little")
    offset += 4
    data = buffer[offset:offset+next_len]
    offset += next_len
    print(next_len, data)

5 b'hello'
5 b'world'
2 b'my'
4 b'name'
2 b'is'
4 b'Ryan'

ivirshup · 2024-07-18T03:36:12Z

I'm not sure I've got the time to follow this one up in the immediate future, so if someone else is interested in picking it up that would be great.

LDeakin · 2024-07-19T06:48:52Z

Thanks for renewing interest in this @rabernat. I've since experimented with variable-length data types in zarrs. My thoughts:

I like the vlen proposal @jbms made here for separate data and index codecs.
- Simple to implement and suited to partial decoding.
- Aligns well with the Zarr codec model and can benefit from current and future codecs.
- I structured the underlying index using the Apache arrow variable-size binary layout with the validity bitmap elided.
The numcodecs vlen-utf8/vlen-bytes/vlen-array codecs seem to be effectively the same thing, just for different data types. I think they could be standardised as one Zarr V3 codec, something like vlen_interleaved or vlen_v2.
- Transitioning Zarr V2 -> V3 means just changing the codec name
- Interleaved encoding is not suitable for partial decoding, so it should not be recommended for new data

Unanswered questions:

How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.
Should the string data type encompass fixed-length strings?

rabernat · 2024-07-19T14:29:24Z

Thanks for doing this work @LDeakin! Super helpful! I think your plan sounds great.

How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

jbms · 2024-07-19T15:49:28Z

Thanks for renewing interest in this @rabernat. I've since experimented with variable-length data types in zarrs. My thoughts:

I like the vlen proposal @jbms made here for separate data and index codecs.

Simple to implement and suited to partial decoding.

Aligns well with the Zarr codec model and can benefit from current and future codecs.

I structured the underlying index using the Apache arrow variable-size binary layout with the validity bitmap elided.

The numcodecs vlen-utf8/vlen-bytes/vlen-array codecs seem to be effectively the same thing, just for different data types. I think they could be standardised as one Zarr V3 codec, something like vlen_interleaved or vlen_v2.

Transitioning Zarr V2 -> V3 means just changing the codec name

Interleaved encoding is not suitable for partial decoding, so it should not be recommended for new data

Unanswered questions:

How best to handle fill_value metadata for string data types? It seems that the fill_value metadata cannot be an arbitrary JSON string in Zarr V3 without a spec version bump.

The interpretation of the json fill value depends on the data type so there is no problem here, since we are also introducing a new data type. It is okay and expected that old implementations return an error when parsing zarr metadata that specifies unsupported features. It is only a problem if the old implementation does not return an error, but interprets the data incorrectly.

Should the string data type encompass fixed-length strings?

I think fixed length strings introduce some additional questions and could be deferred.

LDeakin · 2024-07-20T01:21:51Z

I realise now that the vlen codec could more efficiently store chunks containing fill values. Instead of using the Apache arrow variable-size binary layout as is, negative offsets could be used to represent fill values and then their bytes would not need to be stored. EDIT: Would have to go 1-based indexing. For example:

data: [["fill value", "ab", "fill value", "cde"]
index: [-1, 1, -3, 3, 6]
bytes: [97, 98, 99, 100, 101]

It might be better to keep it simple, though.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

This is probably better suited to discussion in a new issue, but here are my thoughts. A ZEP0004 metadata convention for null/missing/mask values would be a step in the right direction, no support would be needed from Zarr implementations. But I think first-class support would be better:

A new optional array metadata field: "null_value": ... (same valid inputs as fill_value)
- Implementations could support storing/retrieving optional values (e.g. Option<T> in Rust) if something like this existed
An array->bytes codec could exist to more efficiently encode data that contains special values, such as a fill value, null value, or just common values. For example:
- Create an index or bitfields indicating which elements in a chunk are a special value (and which one)
- Linearise the chunk into a 1D array with special values removed and pass it through to an array->bytes codec

{
    "name": "bikeshed",
    "configuration": {
        "values": [0.0, "NaN"],
        "index_codecs": [...],
        "array_to_bytes_codec": { "name": "bytes", "configuration": { "endian": "little" } }
    }
}

jbms · 2024-07-20T04:26:32Z

I realise now that the vlen codec could more efficiently store chunks containing fill values. Instead of using the Apache arrow variable-size binary layout as is, negative offsets could be used to represent fill values and then their bytes would not need to be stored. EDIT: Would have to go 1-based indexing. For example:
data: [["fill value", "ab", "fill value", "cde"]
index: [-1, 1, -3, 3, 6]
bytes: [97, 98, 99, 100, 101]
It might be better to keep it simple, though.

Potentially this sort of compression could be handled by an additional codec layered on top.

Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that?

This is probably better suited to discussion in a new issue, but here are my thoughts. A ZEP0004 metadata convention for null/missing/mask values would be a step in the right direction, no support would be needed from Zarr implementations. But I think first-class support would be better:

A new optional array metadata field: "null_value": ... (same valid inputs as fill_value)

In arrow, a missing value is always a distinct value from any value within the domain of the data type, which is important if you need to preserve the full domain for non-missing values. In zarr I think it would most naturally be represented by some sort of separate mask array associated with the main array.

Implementations could support storing/retrieving optional values (e.g. Option<T> in Rust) if something like this existed

An array->bytes codec could exist to more efficiently encode data that contains special values, such as a fill value, null value, or just common values. For example:

Create an index or bitfields indicating which elements in a chunk are a special value (and which one)

Linearise the chunk into a 1D array with special values removed and pass it through to an array->bytes codec
{
    "name": "bikeshed",
    "configuration": {
        "values": [0.0, "NaN"],
        "index_codecs": [...],
        "array_to_bytes_codec": { "name": "bytes", "configuration": { "endian": "little" } }
    }
}

ZEP for strings

3ec5437

Merge branch 'main' into strings

156600d

MSanKeys963 reviewed Oct 25, 2023

View reviewed changes

draft/ZEP0007.md Outdated Show resolved Hide resolved

Update draft/ZEP0007.md

fa63306

jbms mentioned this pull request Oct 30, 2023

Data Type Extension: single point definition zarr-developers/zarr-specs#273

Open

rabernat mentioned this pull request Dec 15, 2023

Some reflections about how to improve the ZEP process #55

Open

Merge branch 'main' into strings

ddedd27

LDeakin mentioned this pull request May 20, 2024

How to read/write string Arrays? LDeakin/zarrs#21

Closed

This was referenced Jul 12, 2024

ArrowRecordBatchCodec and vlen string support zarr-developers/zarr-python#2031

Draft

Add string and bytes dtypes plus vlen-utf8 and vlen-bytes codecs zarr-developers/zarr-python#2036

Merged

LDeakin mentioned this pull request Jul 17, 2024

Variable length data types LDeakin/zarrs#40

Merged

flying-sheep mentioned this pull request Aug 8, 2024

Nullable string columns scverse/anndata#679

Closed

LDeakin mentioned this pull request Sep 11, 2024

Poor blosc compression ratios compared to v2 zarr-developers/zarr-python#2171

Open

Merge branch 'main' into strings

5b8c5bd

jhamman mentioned this pull request Sep 29, 2024

[v3] String support for v3 array zarr-developers/zarr-python#2268

Closed

ilan-gold mentioned this pull request Oct 2, 2024

Variable-Sized Data Types handling ilan-gold/zarrs-python#17

Open

LDeakin mentioned this pull request Nov 15, 2024

Improve support for legacy vlen codecs and compat with zarr-python LDeakin/zarrs#100

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft ZEP 0007: Strings #47

Draft ZEP 0007: Strings #47

ivirshup commented Sep 7, 2023

jbms commented Sep 7, 2023

ivirshup commented Sep 17, 2023

jbms commented Sep 17, 2023

MSanKeys963 commented Oct 26, 2023

joshmoore commented Nov 24, 2023

joshmoore commented Jul 11, 2024

rabernat commented Jul 12, 2024

ivirshup commented Jul 18, 2024

LDeakin commented Jul 19, 2024

rabernat commented Jul 19, 2024

jbms commented Jul 19, 2024

LDeakin commented Jul 20, 2024 •

edited

Loading

jbms commented Jul 20, 2024

Draft ZEP 0007: Strings #47

Are you sure you want to change the base?

Draft ZEP 0007: Strings #47

Conversation

ivirshup commented Sep 7, 2023

jbms commented Sep 7, 2023

ivirshup commented Sep 17, 2023

jbms commented Sep 17, 2023

MSanKeys963 commented Oct 26, 2023

joshmoore commented Nov 24, 2023

joshmoore commented Jul 11, 2024

rabernat commented Jul 12, 2024

ivirshup commented Jul 18, 2024

LDeakin commented Jul 19, 2024

rabernat commented Jul 19, 2024

jbms commented Jul 19, 2024

LDeakin commented Jul 20, 2024 • edited Loading

jbms commented Jul 20, 2024

LDeakin commented Jul 20, 2024 •

edited

Loading