-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft ZEP 0007: Strings #47
base: main
Are you sure you want to change the base?
Conversation
I like the overall intent of this proposal. I think we need a new "array -> bytes" codec that will support the new Then the {"name": "vlen",
"configuration": {
"data_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration": {"cname": "zstd", "clevel":5,"shuffle": "bitshuffle", "typesize":1,"blocksize":0}}],
"index_codecs": [{"name": "bytes"}, {"name": "blosc", "configuration":{"cname": "zstd", "clevel":5,"shuffle": "shuffle", "typesize":4,"blocksize":0}}],
"index_data_type": "uint32"
}
} Having separate data and index codecs allows different compression options to be used --- e.g. in the example above we use bit-wise shuffling for the data but byte-wise shuffling for the index. One caveat is that if, as in the example above, the size of the encoded index is variable, then we would need to separately store the size of the index. Some compression formats may be self-delimiting and therefore not require that the size is stored, but we may not want to deal with that complexity. |
@jbms, you raise a very good point. I was able to talk to Joris about this at the numfocus summit last week and got a lot of insight onto how arrow does this. An arrow
We'd have the fields:
Which corresponds to the buffers:
While how compression works with the IPC format isn't super well documented (apache/arrow#37756), we can find a description of it in the flatbuffer definitions. AFAICT, each buffer is compressed separately, but I believe you cannot specify different compressors for different buffers. There is also room in the specification for compressing the entire message, instead of the buffers individually. So, where does that leave us? Allowing separate compression of underlying buffers may be useful, and I think gets much more useful if more variable length types are allowed. I would also like to keep the goal of very low cost interoperability with Arrow. I don't know that I love the idea of this codec:
To me, arrow There may be a more parsimonious solution here that shares more with sharding + variable chunk sizes, instead of defining a new codec.
Where is the bytes codec defined? I believe this would be a basic |
The I think arrow compatibility as far as the format of the offsets and data buffers is relatively easy to achieve. I am not so keen on trying to use the RecordBatch flatbuffers message format itself, since that adds a lot of baggage and complexity for what we could also accomplish with just a single 64-bit number. It is true that there is some similarity with the sharding_indexed codec: in fact the sharding codec is just storing an array of variable-length byte strings. It differs from your proposed arrow-compatible format in that it stores both an offset and length for each entry to allow arbitrary ordering of sub-chunks. But it is not clear to me how useful it is to try to unify the sharding format with the vlen string format, since the use cases and expected access patterns are very different. Can you explain the connection to variable-size chunks (i.e. rectilinear grid)? Are you thinking more about sparse arrays? I agree that if we had a different case where we are storing multiple arrays in one chunk, such as storing a chunk using a sparse array encoding, we would probably also want to allow separate codecs for each array, and these could be specified as part of the json configuration for this "sparse" codec. As far as the binary format, I suppose it could make sense to try to unify the sparse array format and the vlen string format in some way, but I'm not sure there is really that much benefit and it would bring in a lot of added complexity. |
Hi @ivirshup. I've fixed the RTD build issue in #51. The PR preview is https://zeps--47.org.readthedocs.build/en/47/draft/ZEP0007.html. |
@ivirshup, thinking about your comment in the description:
what would you like to see happen here on this PR before ZEP7 gets listed under https://zarr.dev/zeps/draft_zeps/? |
@ivirshup: in light of the renewed interest in zarr-developers/zarr-specs#83 (comment), do you see coming back to this or are you interested in passing it off? (Some discussion during the ZEP meeting today) |
We discussed this today at the zarr-python meeting. The above ideas are all good ones. The arrow approach of storing an offsets buffer and a data buffer seems to be the way most data formats today do it. However, it may also be valuable to have a V3 codec that is backwards compatible with the existing Zarr V2 VLen codecs: https://github.com/zarr-developers/numcodecs/blob/main/numcodecs/vlen.pyx These codecs use an "interleaved" format:
where the header stores the number of items. You can see this in how Zarr V2 encodes data. Here's an example import zarr # V2
strings = ["hello", "world", "my", "name", "is", "Ryan"]
store = zarr.MemoryStore()
array = zarr.array(strings, dtype=str, store=store, compressor=None)
buffer = store['0']
nitems = int.from_bytes(buffer[:4], byteorder="little")
offset = 4
for _ in range(nitems):
next_len = int.from_bytes(buffer[offset:offset+4], byteorder="little")
offset += 4
data = buffer[offset:offset+next_len]
offset += next_len
print(next_len, data)
|
I'm not sure I've got the time to follow this one up in the immediate future, so if someone else is interested in picking it up that would be great. |
Thanks for renewing interest in this @rabernat. I've since experimented with variable-length data types in zarrs. My thoughts:
Unanswered questions:
|
Thanks for doing this work @LDeakin! Super helpful! I think your plan sounds great.
Having spent more time with Arrow, I find myself wishing we had the concept of "missing data" or "null values" more deeply integrated into Zarr. Do you have any thoughts on that? |
The interpretation of the json fill value depends on the data type so there is no problem here, since we are also introducing a new data type. It is okay and expected that old implementations return an error when parsing zarr metadata that specifies unsupported features. It is only a problem if the old implementation does not return an error, but interprets the data incorrectly.
I think fixed length strings introduce some additional questions and could be deferred. |
I realise now that the
It might be better to keep it simple, though.
This is probably better suited to discussion in a new issue, but here are my thoughts. A ZEP0004 metadata convention for null/missing/mask values would be a step in the right direction, no support would be needed from Zarr implementations. But I think first-class support would be better:
{
"name": "bikeshed",
"configuration": {
"values": [0.0, "NaN"],
"index_codecs": [...],
"array_to_bytes_codec": { "name": "bytes", "configuration": { "endian": "little" } }
}
} |
Potentially this sort of compression could be handled by an additional codec layered on top.
In arrow, a missing value is always a distinct value from any value within the domain of the data type, which is important if you need to preserve the full domain for non-missing values. In zarr I think it would most naturally be represented by some sort of separate mask array associated with the main array.
|
Finally getting around to posting this proposal that was initially put out on the zulip. You can see the initial conversation on the linked hackmd: https://hackmd.io/aSz4DAYnRRaoFPMQXrml3w
I've tried to be quite conservative in the definition here. The overall idea is: "use
arrow
's string type".I would like to get more feedback on this, especially from implementers.
cc: @normanrz