-
Notifications
You must be signed in to change notification settings - Fork 29
/
Copy pathv3.0.rst
1761 lines (1342 loc) · 69.6 KB
/
v3.0.rst
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
.. This file is in restructured text format: https://docutils.sourceforge.io/rst.html
.. _zarr-core-specification-v3.0:
======================================
Zarr core specification (version 3.0)
======================================
Specification URI:
https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html
Editors:
* Alistair Miles (`@alimanfoo <https://github.com/alimanfoo>`_), Wellcome Sanger Institute
* Jonathan Striebel (`@jstriebel <https://github.com/jstriebel>`_), Scalable Minds
* Jeremy Maitin-Shepard (`@jbms <https://github.com/jbms>`_), Google
Corresponding ZEP:
`ZEP0001 — Zarr specification version 3 <https://zarr.dev/zeps/accepted/ZEP0001.html>`_
Issue tracking:
`GitHub issues <https://github.com/zarr-developers/zarr-specs/labels/core-protocol-v3.0>`_
Suggest an edit for this spec:
`GitHub editor <https://github.com/zarr-developers/zarr-specs/blob/main/docs/v3/core/v3.0.rst>`_
Copyright 2019-Present Zarr core development team. This work
is licensed under a `Creative Commons Attribution 3.0 Unported License
<https://creativecommons.org/licenses/by/3.0/>`_.
----
Abstract
========
This specification defines the Zarr format for N-dimensional typed arrays.
Status of this document
=======================
ZEP0001 was accepted on May 15th, 2023 via https://github.com/zarr-developers/zarr-specs/issues/227.
Introduction
============
This specification defines a format for multidimensional array data. This
type of data is common in scientific and numerical computing
applications. Many domains face computational challenges as
increasingly large volumes of data are being generated, for example,
via high resolution microscopy, remote sensing imagery, genome
sequencing or numerical simulation. The primary motivation for the
development of Zarr is to address this challenge by
enabling the storage of large multidimensional arrays in a way that is
compatible with parallel and/or distributed computing applications.
This specification supersedes the `Zarr storage
specification version 2
<https://zarr.readthedocs.io/en/stable/spec/v2.html>`_ (Zarr v2). The
Zarr v2 specification is implemented in several programming
languages and is used to store and analyse large
scientific datasets from a variety of domains. However, it has become
clear that there are several opportunities for modest but useful
improvements to be made in the format, and for establishing a foundation
that allows for greater interoperability, whilst also enabling a variety
of more advanced and specialised features to be explored and developed.
This specification also draws heavily on the `N5 API and
file-system specification <https://github.com/saalfeldlab/n5>`_, which
was developed in parallel to Zarr v2 with similar
goals and features. This specification defines a core set of features
at the intersection of both Zarr v2 and N5, and so aims to provide a
common target that can be fully implemented across multiple
programming environments and serve a wide range of applications.
We highlight the following areas motivating the
development of this specification.
Extensibility
-------------
The development of systems for storage of very large array-like data
is a very active area of research and development, and there are many
possibilities that remain to be explored. A goal of this specification
is to define a format with a number of clear extension points and
mechanisms, in order to provide a framework for freely building on and
exploring these possibilities. We aim to make this possible, whilst
also providing pathways for a graceful degradation of functionality
where possible, in order to retain interoperability. We also aim to
provide a framework for community-defined extensions, which can be
developed and published independently without requiring centralised
coordination of all specifications.
See :ref:`extension points <extensions_section>` below.
Interoperability
----------------
While the Zarr v2 and N5 specifications have each been implemented in
multiple programming languages, there is currently not feature parity
across all implementations. This is in part because the feature set
includes some features that are not easily translated or supported
across different programming languages. This specification aims to
define a set of core features that are useful and sufficient to
address a significant fraction of use cases, but are also
straightforward to implement fully across different programming
languages. Additional functionality can then be layered via
extensions, some of which may aim for wide adoption, some of which may
be more specialised and have more limited implementation.
Stability Policy
----------------
This core specification adheres to a ``MAJOR.MINOR`` version
number format. When incrementing the minor version, only additional features
can be added. Breaking changes require incrementing the major version.
A Zarr implementation that provides the read and write API by
implementing a specification ``X.Y`` can be considered compatible with all
datasets which only use features contained in version ``X.Y``.
For example, spec ``X.1`` adds core feature "foo" compared to ``X.0``. Assuming
implementation A implements ``X.1`` and implementation B implements ``X.0``.
Data using feature "foo" can only be read with implementation A. B fails to open
it, as the key "foo" is unknown.
Data not using "foo" can be used with both implementations, even if it's written
with implementation B.
Therefore, data is only marked with the respective major version, unknown
features are auto-discovered via the metadata document.
Notably, this excludes extension points such as codecs, data types, chunk grids
and storage transformers from the compatibility of the core specification, as
well as store support. However, versioned extension points and stores are also
expected to follow this stability policy.
Document conventions
====================
Conformance requirements are expressed with a combination of
descriptive assertions and [RFC2119]_ terminology. The key words
"MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD",
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in the normative
parts of this document are to be interpreted as described in
[RFC2119]_. However, for readability, these words do not appear in all
uppercase letters in this specification.
All of the text of this specification is normative except sections
explicitly marked as non-normative, examples, and notes. Examples in
this specification are introduced with the words "for example".
Concepts and terminology
========================
This section introduces and defines some key terms and explains the
conceptual model underpinning the Zarr format.
The following figure illustrates the first part of the terminology:
..
The following image was produced with https://excalidraw.com/
and can be loaded there, as the source is embedded in the png.
.. image:: terminology-hierarchy.excalidraw.png
:width: 600
.. _hierarchy:
*Hierarchy*
A Zarr hierarchy is a tree structure, where each node in the tree
is either a group_ or an array_. Group nodes may have children but
array nodes may not. All nodes in a hierarchy have a name_ and a
path_. The root of a Zarr hierarchy may be either a group_ or an array_.
In the latter case, the hierarchy consists of just the single array.
.. _array:
.. _arrays:
*Array*
An array is a node in a hierarchy_. An array is a data structure
with zero or more dimensions_ whose lengths define the shape_ of
the array. An array contains zero or more data elements_. All
elements_ in an array conform to the same `data type`_. An array
may not have child nodes.
.. _group:
.. _groups:
*Group*
A group is a node in a hierarchy_ that may have child nodes.
.. _name:
.. _names:
*Name*
Each child node of a group has a name, which is a string of
characters with some additional constraints defined in the section
on `node names`_ below. Two sibling nodes cannot have the same
name.
.. _path:
.. _paths:
*Path*
Each node in a hierarchy_ has a path, a Unicode string that uniquely
identifies the node and defines its location within the hierarchy_. The root
node has a path of ``/``. The path of a non-root node is equal the
concatenation of:
- the path of its parent node;
- the ``/`` character, unless the parent is the root node;
- the name_ of the node itself.
For example, the path ``"/foo/bar"`` identifies a node named ``"bar"``,
whose parent is named ``"foo"``, whose parent is the root of the hierarchy.
A path always starts with ``/``, and a non-root path cannot end with ``/``,
because node names_ must be non-empty and cannot contain ``/``.
.. _dimension:
.. _dimensions:
*Dimension*
An array_ has a fixed number of zero or more dimensions. Each dimension has
an integer length. This specification only considers the case where the
lengths of all dimensions are finite. However,
:ref:`extensions<extensions_section>` may be defined which allow a dimension
to have an infinite or variable length.
.. _shape:
*Shape*
The shape of an array_ is the tuple of dimension_ lengths. For
example, if an array_ has 2 dimensions_, where the length of the
first dimension_ is 100 and the length of the second dimension_ is
20, then the shape of the array_ is (100, 20). A shape can be the empty
tuple in the case of zero-dimension arrays (scalars).
.. _element:
.. _elements:
*Element*
An array_ contains zero or more elements. Each element is
identified by a tuple of integer coordinates, one for each
dimension_ of the array_. If all dimensions_ of an array_ have
finite length, then the number of elements in the array_ is given
by the product of the dimension_ lengths.
.. _data type:
*Data type*
A data type defines the set of possible values that an array_ may
contain. For example, the 32-bit signed integer data type defines binary
representations for all integers in the range −2,147,483,648 to
2,147,483,647. This specification only defines a limited set of data types,
but extensions may define other data types.
.. _chunk:
.. _chunks:
*Chunk*
An array_ is divided into a set of chunks, where each chunk is a
hyperrectangle defined by a tuple of intervals, one for each
dimension_ of the array_. The chunk shape is the tuple of interval
lengths, and the chunk size (i.e., number of elements_ contained
within the chunk) is the product of its interval lengths.
The chunk shape elements are non-zero when the corresponding dimensions of
the arrays have non-zero length.
.. _grid:
.. _grids:
*Grid*
The chunks_ of an array_ are organised into a grid. This
specification only considers the case where all chunks_ have the
same chunk shape and the chunks form a regular grid. However,
extensions may define other grid types such as
rectilinear grids.
.. _codec:
.. _codecs:
*Codec*
The list of *codecs* specified for an array_ determine the encoded byte
representation of each chunk in the store_.
.. _metadata document:
.. _metadata documents:
*Metadata document*
Each array_ or group_ in a hierarchy_ is represented by a metadata document,
which is a machine-readable document containing essential
processing information about the node. For example, an array_
metadata document specifies the number of dimensions_, shape_,
`data type`_, grid_, and codec_ for that array_.
.. _store:
.. _stores:
*Store*
The `metadata documents`_ and encoded chunk_ data for all nodes in a
hierarchy_ are held in a store as raw bytes. To enable a variety
of different store types to be used, this specification defines an
`Abstract store interface`_ which is a common set of operations that stores
may provide. For example, a directory in a file system can be a Zarr store,
where keys are file names, values are file contents, and files can be read,
written, listed or deleted via the operating system. Equally, an S3 bucket
can provide this interface, where keys are resource names, values are
resource contents, and resources can be read, written or deleted via HTTP.
.. _storage transformer:
.. _storage transformers:
*Storage transformer*
To provide performance enhancements or other optimizations,
storage transformers may intercept and alter the storage keys and bytes
of an array_ before they reach the underlying physical storage.
Upon retrieval, the original keys and bytes are restored within the
transformer. Any number of storage transformers can be registered and
stacked. In contrast to codecs, storage transformers can act on the
complete array, rather than individual chunks. See the
`storage transformers details`_ below.
.. _`storage transformers details`: #storage-transformers-1
The following figure illustrates the codec, store and storage transformer
terminology for a use case of reading from an array:
..
The following image was produced with https://excalidraw.com/
and can be loaded there, as the source is embedded in the png.
.. image:: terminology-read.excalidraw.png
:width: 600
.. _stored-representation:
Stored representation
=====================
A Zarr hierarchy_ is represented by the following set of key/value entries in an
underlying store_:
- The array_ or group_ metadata document for the root of a Zarr hierarchy_ is
stored under the key ``zarr.json``.
- The metadata document of a non-root array or group with hierarchy path ``P``
is obtained by stripping the leading ``/`` of the path and appending
``/zarr.json``. For example, the metadata document of an array or group with
path ``/foo/bar`` is ``foo/bar/zarr.json``.
- All chunk or other data of an array is stored under the key prefix determined
by its path. For a root array, the key prefix is obtained from the metadata
document key by stripping the trailing ``zarr.json``. For example, for a root
array, the prefix is the empty string. For a non-root array with hierarchy
path ``/foo/bar``, the prefix is ``foo/bar/``.
.. list-table:: Metadata Storage Key example
:header-rows: 1
* - Type
- Path "P"
- Key for Metadata at path `P`
* - Array (Root)
- `/`
- `zarr.json`
* - Group (Root)
- `/`
- `zarr.json`
* - Group
- `/foo`
- `foo/zarr.json`
* - Array
- `/foo`
- `foo/zarr.json`
* - Group
- `/foo/bar`
- `foo/bar/zarr.json`
* - Array
- `/foo/baz`
- `foo/baz/zarr.json`
.. list-table:: Data Storage Key example
:header-rows: 1
* - Path `P` of array
- Chunk grid indices
- Data key
* - `/foo/baz`
- `(1, 0)`
- `foo/baz/c/1/0`
.. note::
When storing a Zarr hierarchy in a filesystem-like store (e.g. the local
filesystem or S3) as a sub-directory, it is recommended that the
sub-directory name ends with ``.zarr`` to indicate the start of a hierarchy
to users.
.. _metadata:
Metadata
========
This section defines the structure of metadata documents for Zarr hierarchies,
which consists of two types of metadata documents: array metadata documents, and
group metadata documents. Both types of metadata documents are stored under the
key ``zarr.json`` within the prefix of the array or group. Each type of
metadata document is described in the following subsections.
Metadata documents are defined here using the JSON
type system defined in [RFC8259]_. In this section, the terms "value",
"number", "string" and "object" are used to denote the types as
defined in [RFC8259]_. The term "array" is also used as defined in
[RFC8259]_, except where qualified as "Zarr array". Following
[RFC8259]_, this section also describes an object as a set of
name/value pairs. This section also defines how metadata documents are
encoded for storage.
.. _array-metadata:
Array metadata
--------------
Each Zarr array in a hierarchy must have an array metadata document, named
``zarr.json``. This document must contain a single object with the following
mandatory names:
``zarr_format``
^^^^^^^^^^^^^^^
An integer defining the version of the storage specification to which the
array store adheres, must be ``3`` here.
``node_type``
^^^^^^^^^^^^^^^
A string defining the type of hierarchy node element, must be ``array``
here.
``shape``
^^^^^^^^^
An array of integers providing the length of each dimension of the
Zarr array. For example, a value ``[10, 20]`` indicates a
two-dimensional Zarr array, where the first dimension has length
10 and the second dimension has length 20.
``data_type``
^^^^^^^^^^^^^
The data type of the Zarr array. If the data type is defined in
this specification, then the value must be the data type
identifier provided as a string. For example, ``"float64"`` for
little-endian 64-bit floating point number.
The ``data_type`` value is an extension point and may be defined by a data
type extension. If the data type is defined by an extension, then the value
may be either a plain string or an object containing the members ``name``
and optionally ``configuration``. A plain string is equivalent to
specifying an object with just a ``name`` member. The ``name`` is required
and its value must refer to a v3 data type specification. ``configuration``
is optional and its value is defined by the extension.
``chunk_grid``
^^^^^^^^^^^^^^
The chunk grid of the Zarr array. If the chunk grid is a regular chunk grid
as defined in this specification, then the value must be an object with the
names ``name`` and ``configuration``. The value of ``name`` must be the
string ``"regular"``, and the value of ``configuration`` an object with the
member ``chunk_shape``. ``chunk_shape`` must be an array of
integers providing the lengths of the chunk along each dimension of the
array. For example,
``{"name": "regular", "configuration": {"chunk_shape": [2, 5]}}``
means a regular grid where the chunks have length 2 along the first
dimension and length 5 along the second dimension.
The ``chunk_grid`` value is an extension point and may be defined by an
extension. If the chunk grid type is defined by an extension, then ``name``
must be a string referring to a v3 chunk grid specification. The
``configuration`` is optional and defined by the extension.
``chunk_key_encoding``
^^^^^^^^^^^^^^^^^^^^^^
The mapping from chunk grid cell coordinates to keys in the underlying
store.
The value must be an object with required string member ``name``, specifying
the encoding type, and optional member ``configuration`` specifying encoding
type-dependent parameters; the ``configuration`` value must be an object if
it is specified.
The following encodings are defined:
- ``default``
The ``configuration`` object may contain one optional member,
``separator``, which must be either ``"/"`` or ``"."``. If not specified,
``separator`` defaults to ``"/"``.
The key for a chunk with grid index (``k``, ``j``, ``i``, ...) is
formed by taking the initial prefix ``c``, and appending for each dimension:
- the ``separator`` character, followed by,
- the ASCII decimal string representation of the chunk index within that dimension.
For example, in a 3 dimensional array, with a separator of ``/`` the identifier
for the chunk at grid index (1, 23, 45) is the string ``"c/1/23/45"``. With a
separator of ``.``, the identifier is the string ``"c.1.23.45"``. The initial prefix
``c`` ensures that metadata documents and chunks have separate prefixes.
.. note:: A main difference with spec v2 is that the default chunk separator
changed from ``.`` to ``/``, as in N5. This decreases the maximum number of
items in hierarchical stores like directory stores.
.. note:: Arrays may have 0 dimensions (when for example representing scalars),
in which case the coordinate of a chunk is the empty tuple, and the chunk key
will consist of the string ``c``.
- ``v2``
The ``configuration`` object may contain one optional member,
``separator``, which must be either ``"/"`` or ``"."``. If not specified,
``separator`` defaults to ``"."``.
The identifier for chunk with at least one dimension is formed by
concatenating for each dimension:
- the ASCII decimal string representation of the chunk index within that
dimension, followed by
- the ``separator`` character, except that it is omitted for the last
dimension.
For example, in a 3 dimensional array, with a separator of ``.`` the identifier
for the chunk at grid index (1, 23, 45) is the string ``"1.23.45"``. With a
separator of ``/``, the identifier is the string ``"1/23/45"``.
For chunk grids with 0 dimensions, the single chunk has the key ``"0"``.
.. note::
This encoding is intended only to allow existing v2 arrays to be
converted to v3 without having to rename chunks. It is not recommended
to be used when writing new arrays.
``fill_value``
^^^^^^^^^^^^^^
Provides an element value to use for uninitialised portions of the
Zarr array.
The permitted values depend on the data type:
``bool``
The value must be a JSON boolean (``false`` or ``true``).
Integers (``{uint,int}{8,16,32,64}``)
The value must be a JSON number with no fraction or exponent part that is
within the representable range of the data type.
IEEE 754 floating point numbers (``float{16,32,64}``)
The value may be either:
- A JSON number, that will be rounded to the nearest representable value.
- A JSON string of the form:
- ``"Infinity"``, denoting positive infinity;
- ``"-Infinity"``, denoting negative infinity;
- ``"NaN"``, denoting thenot-a-number (NaN) value where the sign bit is
0 (positive), the most significant bit (MSB) of the mantissa is 1, and
all other bits of the mantissa are zero;
- ``"0xYYYYYYYY"``, specifying the byte representation of the floating
point number as an unsigned integer. For example, for ``float32``,
``"NaN"`` is equivalent to ``"0x7fc00000"``. This representation is
the only way to specify a NaN value other than the specific NaN value
denoted by ``"NaN"``.
.. warning::
While this NaN syntax is consistent with the syntax accepted by the
C99 ``strtod`` function, C99 leaves the meaning of the NaN payload
string implementation defined, which may not match the Zarr
definition.
Complex numbers (``complex{64,128}``)
The value must be a two-element array, specifying the real and imaginary
components respectively, where each component is specified as defined
above for floating point number.
For example, ``[1, 2]`` indicates ``1 + 2i`` and ``["-Infinity", "NaN"]``
indicates a complex number with real component of -inf and imaginary
component of NaN.
Raw data types (``r<N>``)
An array of integers, with length equal to ``<N>``, where each integer is
in the range ``[0, 255]``.
Extensions to the spec that define new data types must also define the JSON
fill value representation.
.. note::
The ``fill_value`` metadata field is required, but Zarr implementations
may provide an interface for creating a new array with which users can
leave the fill value unspecified, in which case a default fill value for
the data type will be chosen. However, the default fill value that is
chosen MUST be recorded in the metadata.
``codecs``
^^^^^^^^^^
Specifies a list of codecs to be used for encoding and decoding chunks. The
value must be an array of objects, each object containing a member with
``name`` whose value is a string referring to a v3 codec specification. The
codec object may also contain a ``configuration`` object which consists of
the parameter names and values as defined by the corresponding codec
specification. Since an ``array -> bytes`` codec must be specified, the
list cannot be empty.
The following members are optional:
``attributes``
^^^^^^^^^^^^^^
The value must be an object. The object may contain any key/value
pairs, where the key must be a string and the value can be an arbitrary
JSON literal. Intended to allow storage of arbitrary user metadata.
.. note::
An extension to store user attributes in a separate document is being
discussed in https://github.com/zarr-developers/zarr-specs/issues/72.
.. note::
A proposal to specify metadata conventions (ZEP 4) is being discussed in
https://github.com/zarr-developers/zeps/pull/28.
``storage_transformers``
^^^^^^^^^^^^^^^^^^^^^^^^
Specifies a stack of `storage transformers`_. Each value in the list must be
an object containing the names ``name`` and optionally ``configuration``.
The ``name`` is required and the value must be a string referring to the
extension. The object may also contain a ``configuration`` object which
consists of the parameter names and values as defined by the corresponding
storage transformer specification. When the ``storage_transformers`` name is
absent no storage transformer is used, same for an empty list.
``dimension_names``
^^^^^^^^^^^^^^^^^^^
Specifies dimension names, e.g. ``["x", "y", "z"]``. If specified, must be
an array of strings or null objects with the same length as ``shape``. An
unnamed dimension is indicated by the null object. If ``dimension_names`` is
not specified, all dimensions are unnamed.
For compatibility with Zarr implementations and applications that support
using dimension names to uniquely identify dimensions, it is recommended but
not required that all non-null dimension names are distinct (no two
dimensions have the same non-empty name).
This specification also does not place any restrictions on the use of the
same dimension name across multiple arrays within the same Zarr hierarchy,
but extensions or specific applications may do so.
The array metadata object must not contain any other names.
Those are reserved for future versions of this specification.
An implementation must fail to open Zarr hierarchies, groups
or arrays with unknown metadata fields, with the exception of
objects with a ``"must_understand": false`` key-value pair.
For example, the array metadata JSON document below defines a
two-dimensional array of 64-bit little-endian floating point numbers,
with 10000 rows and 1000 columns, divided into a regular chunk grid where
each chunk has 1000 rows and 100 columns, and thus there will be 100
chunks in total arranged into a 10 by 10 grid. Within each chunk the
binary values are laid out in C contiguous order. Each chunk is
compressed using gzip compression prior to storage::
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"dimension_names": ["rows", "columns"],
"data_type": "float64",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "big"
}
}],
"fill_value": "NaN",
"attributes": {
"foo": 42,
"bar": "apples",
"baz": [1, 2, 3, 4]
}
}
The following example illustrates an array with the same shape and chunking as
above, but using a (currently made up) extension data type::
{
"zarr_format": 3,
"node_type": "array",
"shape": [10000, 1000],
"data_type": {
"name": "datetime",
"configuration": {
"unit": "ns"
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [1000, 100]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"codecs": [{
"name": "bytes",
"configuration": {
"endian": "big"
}
}],
"fill_value": null,
}
.. note::
Comparison with Zarr spec v2:
- ``dtype`` has been renamed to ``data_type``,
- ``chunks`` has been replaced with ``chunk_grid``,
- ``dimension_separator`` has been replaced with ``chunk_key_encoding``,
- ``order`` has been replaced by the :ref:`transpose <transpose-codec-v1>` codec,
- the separate ``filters`` and ``compressor`` fields been combined into the single ``codecs`` field.
.. _group-metadata:
Group metadata
--------------
A Zarr group metadata object must contain the following mandatory key:
``zarr_format``
^^^^^^^^^^^^^^^
An integer defining the version of the storage specification to which the
array store adheres, must be ``3`` here.
``node_type``
^^^^^^^^^^^^^^^
A string defining the type of hierarchy node element, must be ``group``
here.
Optional keys:
``attributes``
^^^^^^^^^^^^^^
The value must be an object. The object may contain any key/value
pairs, where the key must be a string and the value can be an arbitrary
JSON literal. Intended to allow storage of arbitrary user metadata.
For example, the JSON document below defines a group::
{
"zarr_format": 3,
"node_type": "group",
"attributes": {
"spam": "ham",
"eggs": 42
}
}
The group metadata object must not contain any other names. Those are reserved
for future versions of this specification. An implementation must fail to open
zarr hierarchies or groups with unknown metadata fields, with the exception of
objects with a ``"must_understand": false`` key-value pair.
Node names
==========
The root node does not have a name and is the empty string ``""``.
Except for the root node, each node in a hierarchy must have a name,
which is a string of unicode code points. The following constraints
apply to node names:
* must not be the empty string (``""``)
* must not include the character ``"/"``
* must not be a string composed only of period characters, e.g. ``"."`` or ``".."``
* must not start with the reserved prefix ``"__"``
To ensure consistent behaviour across different storage systems and programming
languages, we recommend users to only use characters in the sets ``a-z``,
``A-Z``, ``0-9``, ``-``, ``_``, ``.``.
Node names are case sensitive, e.g., the names "foo" and "FOO" are **not**
identical.
When using non-ASCII Unicode characters, we recommend users to use
case-folded NFKC-normalized strings following the
`General Security Profile for Identifiers of the Unicode Security Mechanisms (Unicode Technical Standard #39) <http://www.unicode.org/reports/tr39/#General_Security_Profile>`_.
This follows the
`Recommendations for Programmers (B) of the Unicode Security Considerations (Unicode Technical Report #36) <https://unicode.org/reports/tr36/#Recommendations_General>`_.
.. note::
A storage transformer for unicode normalization might be added later, see
https://github.com/zarr-developers/zarr-specs/issues/201.
.. note::
The underlying store might pose additional restriction on node names,
such as the following:
* `260 characters path length limit in Windows <https://learn.microsoft.com/en-us/windows/win32/fileio/maximum-file-path-limitation>`_
* 1,024 bytes UTF8 object key limit for
`AWS S3 <https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html>`_
and `GCS <https://cloud.google.com/storage/docs/objects#naming>`_, with
additional constraints.
* `Windows paths are case-insensitive by default <https://learn.microsoft.com/en-us/windows/win32/fileio/naming-a-file#naming-conventions>`_
* `MacOS paths are case-insensitive by default <https://support.apple.com/guide/disk-utility/file-system-formats-dsku19ed921c/mac>`_
.. note::
If a store requires an explicit byte string representation the default
representation is the ``UTF-8`` encoded Unicode string.
.. note::
The prefix ``__zarr`` is reserved for core Zarr data, and extensions
can use other files and folders starting with ``__``.
Data types
==========
A data type describes the set of possible binary values that an array
element may take, along with some information about how the values
should be interpreted.
This core specification defines a limited set of data types to
represent boolean values, integers, and floating point
numbers. Extensions may define additional data types. All of the data
types defined here have a fixed size, in the sense that all values
require the same number of bytes. However, extensions may define
variable sized data types.
Note that the Zarr specification is intended to enable communication
of data between a variety of computing environments. The native byte
order may differ between machines used to write and read the data.
Each data type is associated with an identifier, which can be used in
metadata documents to refer to the data type. For the data types
defined in this specification, the identifier is a simple ASCII
string. However, extensions may use any JSON value to identify a data
type.
Core data types
---------------
.. list-table:: Data types
:header-rows: 1
* - Identifier
- Numerical type
* - ``bool``
- Boolean
* - ``int8``
- Integer in ``[-2^7, 2^7-1]``
* - ``int16``
- Integer in ``[-2^15, 2^15-1]``
* - ``int32``
- Integer in ``[-2^31, 2^31-1]``
* - ``int64``
- Integer in ``[-2^63, 2^63-1]``
* - ``uint8``
- Integer in ``[0, 2^8-1]``
* - ``uint16``
- Integer in ``[0, 2^16-1]``
* - ``uint32``
- Integer in ``[0, 2^32-1]``
* - ``uint64``
- Integer in ``[0, 2^64-1]``
* - ``float16`` (optionally supported)
- IEEE 754 half-precision floating point: sign bit, 5 bits exponent, 10 bits mantissa
* - ``float32``
- IEEE 754 single-precision floating point: sign bit, 8 bits exponent, 23 bits mantissa
* - ``float64``
- IEEE 754 double-precision floating point: sign bit, 11 bits exponent, 52 bits mantissa
* - ``complex64``
- real and complex components are each IEEE 754 single-precision floating point
* - ``complex128``
- real and complex components are each IEEE 754 double-precision floating point
* - ``r*`` (Optional)
- raw bits, variable size given by ``*``, limited to be a multiple of 8
Additionally to these base types, an implementation should also handle the
raw/opaque pass-through type designated by the lower-case letter ``r`` followed
by the number of bits, multiple of 8. For example, ``r8``, ``r16``, and ``r24``
should be understood as fall-back types of respectively 1, 2, and 3 byte length.
Zarr v3 is limited to type sizes that are a multiple of 8 bits but may support
other type sizes in later versions of this specification.
.. note::
We are explicitly looking for more feedback and prototypes of code using the ``r*``,
raw bits, for various endianness and whether the spec could be made clearer.
.. note::
Currently only fixed size elements are supported as a core data type.
There are many requests for variable length element encoding. There are many
ways to encode variable length and we want to keep flexibility. While we seem
to agree that for random access the most likely contender is to have two
arrays, one with the actual variable length data and one with fixed size
(pointer + length) to the variable size data, we do not want to commit to such
a structure.
See https://github.com/zarr-developers/zarr-specs/issues/62.
Chunk grids
===========
A chunk grid defines a set of chunks which contain the elements of an
array. The chunks of a grid form a tessellation of the array space,
which is a space defined by the dimensionality and shape of the
array. This means that every element of the array is a member of one
chunk, and there are no gaps or overlaps between chunks.
In general there are different possible types of grids. The core
specification defines the regular grid type, where all chunks are
hyperrectangles of the same shape. Extensions may define other grid
types, such as rectilinear grids where chunks are still
hyperrectangles but do not all share the same shape.
A grid type must also define rules for constructing an identifier for
each chunk that is unique within the grid, which is a string of ASCII
characters that can be used to construct keys to save and retrieve
chunk data in a store, see also the `Storage`_ section.
Regular grids
-------------
A regular grid is a type of grid where an array is divided into chunks
such that each chunk is a hyperrectangle of the same shape. The
dimensionality of the grid is the same as the dimensionality of the
array. Each chunk in the grid can be addressed by a tuple of positive
integers (`k`, `j`, `i`, ...) corresponding to the indices of the
chunk along each dimension.
The origin element of a chunk has coordinates in the array space (`k` *
`dz`, `j` * `dy`, `i` * `dx`, ...) where (`dz`, `dy`, `dx`, ...) are
the chunk sizes along each dimension.
Thus the origin element of the chunk at grid index (0, 0, 0,
...) is at coordinate (0, 0, 0, ...) in the array space, i.e., the
grid is aligned with the origin of the array. If the length of any
array dimension is not perfectly divisible by the chunk length along
the same dimension, then the grid will overhang the edge of the array