You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that Amazon S3 (and apparently also Google) define a
limit of 1024 bytes for object keys. This limit apparently
applies to the whole key and not, say, segments of the key,
where segment is the name between '/' occurrences.
I know that for atmospheric sciences netcdf-4 datasets, variable
names are used to encode a variety of properties such as dates
and locations. This often results in long variable
names. Additionally, deeply nested groups are used to also
classify sets of variables. Bottom line: it is probable that
such datasets will run up against the 1024 byte limit in the
near future.
So my question to the community is: how do we deal with the 1024
byte limit? Or do we ignore it?
One might hope that Amazon will up that limit Real-Soon-Now. My
guess is that a limit of 4096 bytes would be adequate to push
the problem off to a more distant future.
If such a length increase does not happen, then we may need to
rethink the Zarr layout so that this limit is circumvented.
Below are some initial thoughts about this. I hope I am not
overthinking this and that there is some simpler approach that I
have not considered.
One possible proposal is to use a structure where
the long key is replaced with the hash of the long key.
This leads to an inode-like system with flat space of hash keys
and the objects for those hashkeys contain metadata and chunk-data.
In order to represent the group structure, one
would need to extend this to have some "inodes" be directory-like
objects that map a key segment to the hashkey of the inodes
"contained" in the directory.
I am sure there are other ways to do this. Is may also be worth
asking about the purpose of the groups. Right now they serve
as a namespace and as a primitive indexing mechanism for the leaf
content-bearing objects. Perhaps they are superfluous.
In any case, the 1024 byte key-length limit is likely
to be a problem for Zarr in the near future.
The community needs to decide if it wants to ignore this
limitation or address it in some general way.
=Dennis Heimbigner
Unidata
The text was updated successfully, but these errors were encountered:
Thanks I'll try to see if I can add some of that into the spec.
I think that the length limitation workaround might need to be on a pe-store basis. At least in spec v3 there is the data/ and meta/ prefix so it would be easy to have the equivalent of "mount points"/ references.
I'm not a huge fan of the hashing/ inode-like as this will likely mean a single place where we store the mapping which would require locking, amd make listing more difficult.
I noticed that Amazon S3 (and apparently also Google) define a
limit of 1024 bytes for object keys. This limit apparently
applies to the whole key and not, say, segments of the key,
where segment is the name between '/' occurrences.
I know that for atmospheric sciences netcdf-4 datasets, variable
names are used to encode a variety of properties such as dates
and locations. This often results in long variable
names. Additionally, deeply nested groups are used to also
classify sets of variables. Bottom line: it is probable that
such datasets will run up against the 1024 byte limit in the
near future.
So my question to the community is: how do we deal with the 1024
byte limit? Or do we ignore it?
One might hope that Amazon will up that limit Real-Soon-Now. My
guess is that a limit of 4096 bytes would be adequate to push
the problem off to a more distant future.
If such a length increase does not happen, then we may need to
rethink the Zarr layout so that this limit is circumvented.
Below are some initial thoughts about this. I hope I am not
overthinking this and that there is some simpler approach that I
have not considered.
One possible proposal is to use a structure where
the long key is replaced with the hash of the long key.
This leads to an inode-like system with flat space of hash keys
and the objects for those hashkeys contain metadata and chunk-data.
In order to represent the group structure, one
would need to extend this to have some "inodes" be directory-like
objects that map a key segment to the hashkey of the inodes
"contained" in the directory.
I am sure there are other ways to do this. Is may also be worth
asking about the purpose of the groups. Right now they serve
as a namespace and as a primitive indexing mechanism for the leaf
content-bearing objects. Perhaps they are superfluous.
In any case, the 1024 byte key-length limit is likely
to be a problem for Zarr in the near future.
The community needs to decide if it wants to ignore this
limitation or address it in some general way.
=Dennis Heimbigner
Unidata
The text was updated successfully, but these errors were encountered: