Skip to content

Commit

Permalink
Add a history of storage format versions et. al. (#5205)
Browse files Browse the repository at this point in the history
[SC-50706](https://app.shortcut.com/tiledb-inc/story/50706/include-all-historical-storage-format-version-information-in-the-current-spec)

This PR updates the storage format specification to include information
about all past versions. The work is divided into two parts. First, a
new file was created that lists what changed in each storage format
version, going in more detail than the existing table (which will be
removed). After that, the fields in the various data structures will be
updated to indicate in which version they were introduced.

I sourced the changes in each storage format by searching for the
`version(\(\)|_)? (>=?|<=?|==) \d+` regex in the code, as well as
searching for references of [these named
constants](https://github.com/TileDB-Inc/TileDB/blob/41eb1cc9df9603fc82a6c9dbaf0ae0c25a8ace8f/tiledb/sm/misc/constants.cc#L706-L731).

I also made other small fixes to the spec as I found them.

Information about past format versions of _groups_ will be added in
another PR.

---
TYPE: NO_HISTORY
DESC: The storage format specification was updated to include
information about all previous versions.

---------

Co-authored-by: KiterLuc <[email protected]>
  • Loading branch information
teo-tsirpanis and KiterLuc authored Aug 28, 2024
1 parent 8b15119 commit 6921ebb
Show file tree
Hide file tree
Showing 5 changed files with 176 additions and 32 deletions.
30 changes: 2 additions & 28 deletions format_spec/FORMAT_SPEC.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,36 +11,10 @@ title: Format Specification
- [Dictionary filter](filters/dictionary_encoding.md)
- RLE filter

## History

|Format version|TileDB version|Description|
|-|-|-|
|1|1.4|[Decouple format and library version](https://github.com/TileDB-Inc/TileDB/commit/610f087515b6de5c3290b09dab30c6943ec77feb)|
|2|1.5|[Always split coordinate tiles](https://github.com/TileDB-Inc/TileDB/commit/9394b38bdfbacd606d673896b4ae87e7968b7c2f)|
|3|1.6|[Parallelize fragment metadata loading](https://github.com/TileDB-Inc/TileDB/commit/a2eb6237e622c3a17691dbe04c9223ba099f7466)|
|4|1.7|[Remove KV storage](https://github.com/TileDB-Inc/TileDB/commit/e733f7baa85a41e25e5834a220234397d6038401)|
|5|2.0|[Split coordinates into individual files](https://github.com/TileDB-Inc/TileDB/commit/d3543bdbc4ee7c2ed1f2de8cee42b04c6ec8eafc)|
|6|2.1|[Implement attribute fill values](https://github.com/TileDB-Inc/TileDB/commit/eaafa47c97af0ee654a0ca2e97da7b8d941e672b)|
|7|2.2|[Nullable attribute support](https://github.com/TileDB-Inc/TileDB/commit/a7fd8d6dd74bb4fa1ae25a6f995da93812f92c20)|
|8|2.3|[Percent encode attribute/dimension file names](https://github.com/TileDB-Inc/TileDB/commit/97c5c4b0aa35cfd96197558ffc1189860b4adc6f)|
|9|2.3|[Name attribute/dimension files by index](https://github.com/TileDB-Inc/TileDB/commit/9a2ed1c22242f097300c2909baf6cb671a7ee33e)|
|10|2.4|[Added array schema evolution](https://github.com/TileDB-Inc/TileDB/commit/41e5e8f4b185f49777560d637b1d61de498364ce)|
|11|2.7|[Store integral cells, aka, don't split cells across chunks](https://github.com/TileDB-Inc/TileDB/commit/beab5113526b7156c8c6492542f1681555c8ae87)|
|12|2.8|[New array directory structure](https://github.com/TileDB-Inc/TileDB/commit/ce204ad1ea5b40f006f4a6ddf240d89c08b3235b)|
|13|2.9|[Add dictionary filter](https://github.com/TileDB-Inc/TileDB/commit/5637e8c678451c9d2356ccada118b504c8ca85f0)|
|14|2.10|[Consolidation with timestamps, add has_timestamps to footer](https://github.com/TileDB-Inc/TileDB/commit/31a3dce8db254efc36f6d28249febed41bba3bcd)|
|15|2.11|[Remove consolidate with timestamps config](https://github.com/TileDB-Inc/TileDB/commit/6b49739e79d804dc56eb0a7e422823ae6f002276)|
|16|2.12|[Implement delete strategy](https://github.com/TileDB-Inc/TileDB/commit/8d64b1f38177113379fa741016136dbd2b06fcfd)|
|17|2.14|[Add dimension labels and data order](https://github.com/TileDB-Inc/TileDB/commit/bb433fcf12dc74a38c7e843808ec1e593b16ce71)|
|18|2.15|[Dimension Labels no longer experimental](https://github.com/TileDB-Inc/TileDB/commit/c3a1bb47e7237f50e8ed9e33abfaa3161e23ff64)|
|19|2.16|[Vac files now use relative URIs](https://github.com/TileDB-Inc/TileDB/commit/ef3236a526b67c50138436a16f67ad274c2ca037)|
|20|2.17|[Enumerations](https://github.com/TileDB-Inc/TileDB/commit/c0d7c6a50fdeffbcc7d8c9ba4a29230fe22baed6)|
|21|2.19|[Tile metadata are now correctly calculated for nullable fixed size strings on dense arrays](https://github.com/TileDB-Inc/TileDB/commit/081bcc5f7ce4bee576f08b97de348236ac88d429)|
|22|2.25|[Add array current domain](https://github.com/TileDB-Inc/TileDB/commit/9116d3c95a83d72545520acb9a7808fc63478963)|

## Table of Contents

* **Array**
* [Format Version History](./history.md)
* [File hierarchy](./array_file_hierarchy.md)
* [Array Schema](./array_schema.md)
* [Fragment](./fragment.md)
Expand All @@ -53,4 +27,4 @@ title: Format Specification
* [Consolidated Fragment Metadata File](./consolidated_fragment_metadata_file.md)
* [Filter Pipeline](./filter_pipeline.md)
* [Timestamped Name](./timestamped_name.md)
* [Vacuum Pipeline](./vacuum_file.md)
* [Vacuum File](./vacuum_file.md)
169 changes: 169 additions & 0 deletions format_spec/history.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,169 @@
---
title: Format version history
---

# Format Version History

## Version 22

Introduced in TileDB 2.25

* The _Current domain_ field was added to [array schemas](./array_schema.md#array-schema-file).

## Version 21

Introduced in TileDB 2.19

* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays.

> [!NOTE]
> This version does not contain any changes to the storage format, but was introduced as an indicator for implementations to not rely on tile metadata for nullable fixed-size strings on dense arrays on previous versions.
## Version 20

Introduced in TileDB 2.17

* Arrays can have [enumerations](./enumeration.md).
* The bit-width reduction and positive delta filters are supported on data of date or time types.
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the double-delta filter contain the _Reinterpret datatype_ field.

## Version 19

Introduced in TileDB 2.16

* [Vacuum files](./vacuum_file.md) contain relative paths to the location of the array.
* The [filter pipeline options](./filter_pipeline.md#filter-options) for the delta filter contain the _Reinterpret datatype_ field.

## Version 18

Introduced in TileDB 2.15

* Arrays can have [dimension labels](./array_schema.md#dimension-label).

## Version 17

Introduced in TileDB 2.14

* The _Order_ field was added to [attributes](./array_schema.md#attribute).
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile.

## Version 16

Introduced in TileDB 2.12

* Arrays can have [delete commit files](./delete_commit_file.md).
* Arrays can have [update commit files](./update_commit_file.md).
* The TileDB implementation currently supports writing update commit files as an experimental feature, but they are not yet considered when performing reads.
* Fragment metadata contain [tile processed conditions](./fragment.md#tile-processed-conditions).

## Version 15

Introduced in TileDB 2.11

* Consolidated fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer).

## Version 14

Introduced in TileDB 2.10

* Consolidated fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer).

## Version 13

Introduced in TileDB 2.9

* The [dictionary filter](./filters/dictionary_encoding.md) was added.

## Version 12

Introduced in TileDB 2.8

* The [array file hierarchy](./array_file_hierarchy.md) was updated to store fragments, commits and consolidated fragment metadata in separate subdirectories.
* The extension of commit files was changed to `.wrt`.
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile.

## Version 11

Introduced in TileDB 2.7

* Fragment metadata contain [metadata](./fragment.md#tile-mins-maxes) (min/max value, sum, null count) for each tile.
* The TileDB implementation has been updated to never split cells when storing them in chunks.

## Version 10

Introduced in TileDB 2.4

* Arrays support schema evolution.
* Array schemas are stored in a `__schema` subdirectory, and have a [timestamped name](./timestamped_name.md).
* The _Array schema name_ field was added to the [fragment metadata footer](./fragment.md#footer).
* The _Footer length_ field of the [fragment metadata footer](./fragment.md#footer) is always written.

## Version 9

Introduced in TileDB 2.3

* [Data files](./fragment.md#data-file) are named by the index of their attribute or dimension.
* The _URI_ fields of [Consolidated fragment metadata files](./consolidated_fragment_metadata_file.md) contain relative paths to the location of fragments in the array.

## Version 8

Introduced in TileDB 2.2.3

* [Data files](./fragment.md#data-file) are named by the name of their attribute or dimension, after percent encoding certain characters. These characters are `!#$%&'()*+,/:;=?@[]`, as specified in [RFC 3986](https://tools.ietf.org/html/rfc3986), as well as `"<>\|`, which are not allowed in Windows file names.

## Version 7

Introduced in TileDB 2.2

* Attributes can be nullable.
* The _Nullable_ and _Fill value validity_ fields were added to [attributes](./array_schema.md#attribute).
* The _Validity filters_ field was added to [array schemas](./array_schema.md#array-schema-file).
* Fragment metadata contain validity [tile offsets](./fragment.md#tile-offsets).

## Version 6

Introduced in TileDB 2.1

* The _Fill value_ field was added to [attributes](./array_schema.md#attribute).

## Version 5

Introduced in TileDB 2.0

* Dimensions are stored in separate [data files](./fragment.md#data-file).
* Sparse arrays can have string dimensions and dimensions with different datatypes.
* The _Dimension datatype_, _Cell val num_ and _Filters_ fields were added to [dimensions](./array_schema.md#dimension).
* The _Domain size_ field was added to [dimensions](./array_schema.md#dimension). The domain of a dimension can have a variable size.
* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain).
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions.
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree).
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file).
* Committed fragments are indicated by the presence of an `.ok` file in the array's directory, with the same [timestamped name](./timestamped_name.md) as the fragment.

## Version 4

Introduced in TileDB 1.7

* Support for the [key-value store](https://tiledb-inc-tiledb.readthedocs-hosted.com/en/1.6.3/tutorials/kv.html) object type was removed. Key-value stores have been superseded by sparse arrays.

## Version 3

Introduced in TileDB 1.6

* The structure of [fragment metadata files](./fragment.md#fragment-metadata-file) was overhauled.
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added.
* The _Bounding coords_ field was removed.
* The _MBRs_ field was removed. MBRs are now stored in the R-Tree.
* Structures other than the footer like tile offsets, sizes and metadata are wrapped in their own generic tiles. This allows loading them lazily and in parallel.

## Version 2

Introduced in TileDB 1.5

* Cell coordinate values of each dimension are always stored next to each other, regardless of whether they are filtered with a compression filter or not.

## Version 1

Introduced in TileDB 1.4

* Initial version of the TileDB storage format.
2 changes: 1 addition & 1 deletion format_spec/vacuum_file.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ my_array # array folder
| ...
```

When located in the commits folder, it will include the URI of fragments (in the `__fragments` folder) that can be vaccumed. When located in the array metadata folder, it will include the URI or array metadata files that can be vaccumed.
When located in the commits folder, it will include the URI of fragments (in the `__fragments` folder) that can be vacuumed. When located in the array metadata folder, it will include the URI or array metadata files that can be vacuumed.

The vacuum file is a simple text file where each line contains a URI string:

Expand Down
5 changes: 3 additions & 2 deletions tiledb/sm/filter/filter_pipeline.h
Original file line number Diff line number Diff line change
Expand Up @@ -288,8 +288,9 @@ class FilterPipeline {
FilterPipeline* pipeline, const EncryptionKey& encryption_key);

/**
* Checks if an attribute/dimension needs to be filtered in chunks or as a
* whole
* Checks if the offsets tiles of an attribute/dimension should be skipped
* from being written. This happens in filters that encode the offsets
* alongside the data.
*
* @param type Datatype of the input attribute/dimension
* @param version Array schema version
Expand Down
2 changes: 1 addition & 1 deletion tiledb/sm/fragment/fragment_metadata.cc
Original file line number Diff line number Diff line change
Expand Up @@ -2040,7 +2040,7 @@ void FragmentMetadata::load_has_timestamps(Deserializer& deserializer) {
// ===== FORMAT =====
// has_delete_meta (char)
void FragmentMetadata::load_has_delete_meta(Deserializer& deserializer) {
// Get includes timestamps
// Get includes delete metadata
has_delete_meta_ = deserializer.read<char>();

// Rebuild index map
Expand Down

0 comments on commit 6921ebb

Please sign in to comment.