Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a history of storage format versions et. al. #5205

Merged
merged 13 commits into from
Aug 28, 2024

Conversation

teo-tsirpanis
Copy link
Member

@teo-tsirpanis teo-tsirpanis commented Jul 19, 2024

SC-50706

This PR updates the storage format specification to include information about all past versions. The work is divided into two parts. First, a new file was created that lists what changed in each storage format version, going in more detail than the existing table (which will be removed). After that, the fields in the various data structures will be updated to indicate in which version they were introduced.

I sourced the changes in each storage format by searching for the version(\(\)|_)? (>=?|<=?|==) \d+ regex in the code, as well as searching for references of these named constants.

I also made other small fixes to the spec as I found them.

Information about past format versions of groups will be added in another PR.


TYPE: NO_HISTORY
DESC: The storage format specification was updated to include information about all previous versions.

Copy link
Member Author

@teo-tsirpanis teo-tsirpanis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I linked each change to the pull request that introduced it.


## Version 22

Introduced in TileDB 2.25
Copy link
Member Author

@teo-tsirpanis teo-tsirpanis Jul 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not add the commit or PR that bumped the version, because several of the changes did not happen at the same time the version was bumped, and the purpose of this spec is that you don't need to read the source code to understand it.


Introduced in TileDB 2.25

* The _Current domain_ field was added to [array schemas](./array_schema.md).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.19

* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for fixed-size strings.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.17

* Arrays can have [enumerations](./enumeration.md).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.19

* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback: should we be including links to PRs in the Markdown itself? If you don't think that's appropriate, we can skip, but in the past open-source projects I've worked on, it was common practice to publicly include the PRs in all release documentation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's definitely not required as I said above, and personally I would prefer to clearly separate the storage format specification from its implementation, but I'm not strongly opposed to it.

@ihnorton what do you think?


* Deleting data from arrays is supported. Arrays can have [delete commit files](./delete_commit_file.md).
* Updating data in arrays is supported. Arrays can have [update commit files](./update_commit_file.md).
* Fragment metadata contain [tile processed conditions](./fragment.md#tile-processed-conditions).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.12

* Deleting data from arrays is supported. Arrays can have [delete commit files](./delete_commit_file.md).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_spec/history.md Outdated Show resolved Hide resolved
Introduced in TileDB 2.14

* The _Order_ field was added to [attributes](./array_schema.md#attribute).
* Cell offsets in dimensions or attributes of UTF-8 string type are not written in the offset tiles, if the RLE or dictionary filter exists in the filter pipeline. They are instead encoded as part of the data tile.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#3868


I am not entirely sure, from the storage format perspective does this change only mean that the offsets file will be empty?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so


* The [array file hierarchy](./array_file_hierarchy.md) was updated to store fragments, commits and consolidated fragment metadata in separate subdirectories.
* The extension of commit files was changed to `.wrt`.
* Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2969


The same with the dictionary filter happened in version 13, but I am not including it because that's when the dictionary filter was added.

* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions.
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree).
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file).
* Committed fragments are indicated by the presence of an `.ok` file in the array's directory, with the same [timestamped name](./timestamped_name.md) as the fragment.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain).
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions.
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree).
* The _Allows dups_ field was added to [array schemas](./array_schema.md#array-schema-file).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.2.3

* [Data files](./fragment.md#data-file) are named by the name of their attribute or dimension, after percent encoding certain characters. These characters are `!#$%&'()*+,/:;=?@[]`, as specified in [RFC 3986](https://tools.ietf.org/html/rfc3986), as well as `"<>\|`, which are not allowed in Windows file names.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


## Version 8

Introduced in TileDB 2.2.3
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduced in 2.3.x, backported to 2.2.x in #2054.


Introduced in TileDB 2.0

* Dimensions are stored in separate [data files](./fragment.md#data-file).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to mention zipped coordinates here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My plan of documenting zipped coordinates is to update fragment.md in a subsequent PR to say something like "prior to format version 6 dimension data for sparse arrays are stored in a __coords file…".

What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good.

Comment on lines +116 to +119
* Attributes can be nullable.
* The _Nullable_ and _Fill value validity_ fields were added to [attributes](./array_schema.md#attribute).
* The _Validity filters_ field was added to [array schemas](./array_schema.md#array-schema-file).
* Fragment metadata contain validity [tile offsets](./fragment.md#tile-offsets).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Introduced in TileDB 2.1

* The _Fill value_ field was added to [attributes](./array_schema.md#attribute).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +132 to +135
* Sparse arrays can have string dimensions and dimensions with different datatypes.
* The _Dimension datatype_, _Cell val num_ and _Filters_ fields were added to [dimensions](./array_schema.md#dimension).
* The _Domain size_ field was added to [dimensions](./array_schema.md#dimension). The domain of a dimension can have a variable size.
* The _Domain datatype_ field was removed from [domains](./array_schema.md#domain).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +136 to +137
* The [MBR](./fragment.md#mbr) structure has been updated to support variable-sized dimensions.
* The _Dimension number_ and _R-Tree datatype_ fields have been removed from [R-Trees](./fragment.md#r-tree).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format_spec/history.md Outdated Show resolved Hide resolved
format_spec/history.md Outdated Show resolved Hide resolved
Comment on lines +152 to +154
* The structure of [fragment metadata files](./fragment.md#fragment-metadata-file) was overhauled.
* The [footer](./fragment.md#footer) and [R-Tree](./fragment.md#r-tree) structures were added.
* The _Bounding coords_ field was removed.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1197, specifically this diff


Introduced in TileDB 1.5

* Cell coordinate values of each dimension are always stored next to each other, regardless of whether they are filtered with a compression filter or not.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@teo-tsirpanis
Copy link
Member Author

teo-tsirpanis commented Jul 24, 2024

The first part of the work is done. For ease of review we can do the second part in a subsequent PR.

@teo-tsirpanis teo-tsirpanis marked this pull request as ready for review July 24, 2024 18:01
Copy link
Contributor

@nickvigilante nickvigilante left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good to me! Only one comment about including the issues directly in line with the release notes.


Introduced in TileDB 2.19

* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General feedback: should we be including links to PRs in the Markdown itself? If you don't think that's appropriate, we can skip, but in the past open-source projects I've worked on, it was common practice to publicly include the PRs in all release documentation.

@teo-tsirpanis teo-tsirpanis requested a review from KiterLuc August 20, 2024 18:40
format_spec/history.md Outdated Show resolved Hide resolved
format_spec/history.md Outdated Show resolved Hide resolved

Introduced in TileDB 2.0

* Dimensions are stored in separate [data files](./fragment.md#data-file).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would try to mention zipped coordinates here.


Introduced in TileDB 2.9

* The [dictionary filter](./filters/dictionary_encoding.md) was added.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider mentioning:
Cell offsets in dimensions or attributes of ASCII string type are not written in the offset tiles, if the RLE filter exists in the filter pipeline. They are instead encoded as part of the data tile.


Introduced in TileDB 2.10

* Fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer).
* Consolidated fragments can have timestamp files. The _Includes timestamps_ field was added to the [fragment metadata footer](./fragment.md#footer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


Introduced in TileDB 2.11

* Fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer).
* Consolidated fragments can have delete metadata files. The _Includes delete metadata_ field was added to the [fragment metadata footer](./fragment.md#footer).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That wasn't apparent from a read of the source code but sure, it's valuable to mention. Reminder for myself to update the documentation of this flag in a subsequent PR.

* The TileDB implementation has been updated to fix computing [tile metadata](./fragment.md#tile-mins-maxes) for nullable fixed-size strings on dense arrays.

> [!NOTE]
> This version does not contain any changes to the storage format, but was introduced as an indicator for implementations to not rely on tile metadata under certain circumstances on previous versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only applies to fixed sized strings on dense arrays I think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote this in the bullet point above and did not want to repeat myself.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm mostly worried someone reads this "not rely on tile metadata under certain circumstances on previous versions." and makes the wrong assumption. I think we should repeat for 100% clarity. This is the last issue for me in this PR, once we resolve it's good to go IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

@KiterLuc KiterLuc merged commit 6921ebb into TileDB-Inc:dev Aug 28, 2024
62 checks passed
@teo-tsirpanis teo-tsirpanis deleted the format-history branch August 28, 2024 12:24
teo-tsirpanis added a commit that referenced this pull request Oct 25, 2024
[SC-54621](https://app.shortcut.com/tiledb-inc/story/54621/explicitly-mention-what-changed-in-each-storage-format-version-throughout-the-specification)

Continuing #5205, this PR goes through each format version and mentions
the changes to the main specification document. It also documents
structures like `__coords.tdb` and legacy fragment metadata, and fixes
any defects encountered in the meantime.

---
TYPE: FORMAT
DESC: The storage format specification was updated to document format
changes of previous versions throughout the main document.

---------

Co-authored-by: Nick Vigilante <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants