feat(serialization): Store relative offsets for tensor data and headers #170

Eta0 · 2024-07-18T08:12:00Z

Segment-relative offsets in headers

Closes #142. Helpful for #148, #150, and #159.

Overview

Instead of storing the absolute file position of header entries and tensor data, this change stores their offsets relative to the beginning of their respective segment (e.g. the header segment, or the data segment). This makes it easier to create and edit various parts of a file's metadata independently from other parts of the metadata or the tensor data. Information about absolute offsets are instead stored on a segment-by-segment basis in a "layout" section of the metadata, preceding the metadata and header sections.

This separates the file header and tensor data enough that it should even be possible now to split them into two completely separate files, if that design may be helpful later on.

This change also aligns the end of the metadata segment to a 4096-byte block boundary. The header segment also ends on a 4096-byte boundary as was the case before. This allows these segments to more easily be manipulated with fallocate for resolving #150 and #159 later.

Deserialization

This change requires minimal modifications to deserialization. In newer files, the layout section is read, and the absolute offsets listed in it are used to convert all subsequently-read relative offsets into absolute offsets.

Structure

The binary format of the layout section is:

Byte length of the layout section (64-bit)
Version tag (8-bit, always 1 for now)
Number of segments (8-bit)
Metadata segment
- Tag (8-bit, always 1)
- Absolute start offset (64-bit)
- Absolute end offset (64-bit)
Header segment
- Tag (8-bit, always 2)
- Absolute start offset (64-bit)
- Absolute end offset (64-bit)
Tensor data segment
- Tag (8-bit, always 3)
- Absolute start offset (64-bit)

The tag fields aren't strictly necessary, but are present to make it slightly easier to write a deserializer that could handle the metadata segment being removed in the future, or optional segments, or variations of segments that include more or less information. I'm open to suggestions on improving the format for the layout section. Other potentially odd choices at present:

End offsets refer to the allocated space for the segment, including padding, and do not represent the length of meaningful content in a segment
- Length of meaningful content may be encoded elsewhere in the metadata as appropriate
End offsets are included for the metadata and header segments, even though they are currently equal to the start offset of the following segment
- This is here for convenience and in case segments somehow need to not be written back-to-back in a single file at some point
Tensor data doesn't have an end offset listed
- The end of the tensor data segment is assumed to be the end of the file
- This lessens the need to update the layout segment after every tensor (bulk-)write
- We could probably add an end offset anyway if we wanted to since that information is available to the serializer during each prologue synchronization

If we want to support serialization to multiple files eventually, these fields could also be modified to include file identifiers, e.g. listing that the metadata is in file 0 but the tensor data segment is located in file 1, or potentially even something like listing multiple tensor data segments spread across multiple files.

Other bug fixes

Buffered writes at the beginning of a file were not being properly flushed before switching to writing with pwrite, causing the beginning of a file to sometimes be overwritten with unfinished data. This was a regression from v2.9.0 where the writer was always flushed after buffered writes and is fixed again in this PR.

Instead of storing the absolute file position of header entries and tensor data, this change stores their offsets relative to the beginning of their respective segment (e.g. the header segment, or the data segment). This makes it easier to create and edit headers without needing to already know where all prior segments end. Information about absolute offsets are instead stored on a segment-by-segment basis in a "layout" section of the metadata, preceding the metadata and header sections.

wbrown

This looks good! 👍

My general comment would be that we should probably encode a data_length field. As you mentioned in your PR description, end_offset no longer meaningfully tells us the actual length of the data.

We may want to make it easier on ourselves to slice that data by having the actual data length available.

wbrown · 2024-08-27T14:11:38Z

tensorizer/serialization.py

+            self._metadata_end = self._metadata_start + 8 + approx_metadata_size
+            # Extend the metadata segment to end on a block boundary
+            # This allows later manipulation with the fallocate syscall
+            # on coöperating operating systems to extend this segment


This spelling of cooperating intended? :)

wbrown · 2024-08-27T14:12:20Z

tensorizer/serialization.py

+            # This allows later manipulation with the fallocate syscall
+            # on coöperating operating systems to extend this segment
+            # in the middle of the file
+            self._metadata_end -= self._metadata_end % -4096


Should block boundaries be a constant? i.e. do we know that the aligned boundaries are the same on ARM vs x86-64?

wbrown · 2024-08-27T14:15:12Z

tensorizer/serialization.py

+    _FORMAT: ClassVar[struct.Struct] = struct.Struct(
+        "<"
+        "B"  # Layout version tag
+        "B"  # Number of entries


This is the number of metadata header entries, which is currently 3, correct? I ask because we have this as B. :)

Eta0 added serialization schema-change label for 3.0-dev work that involves schema changes labels Jul 18, 2024

Eta0 requested review from wbrown and bchess July 18, 2024 08:12

Eta0 self-assigned this Jul 18, 2024

Eta0 linked an issue Jul 18, 2024 that may be closed by this pull request

[3.0] Make data_offset relative to the data section of the file #142

Open

Eta0 marked this pull request as ready for review July 18, 2024 17:11

Eta0 requested review from harubaru and removed request for bchess August 5, 2024 19:15

wbrown reviewed Aug 27, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(serialization): Store relative offsets for tensor data and headers #170

feat(serialization): Store relative offsets for tensor data and headers #170

Eta0 commented Jul 18, 2024 •

edited

Loading

wbrown left a comment

wbrown Aug 27, 2024

wbrown Aug 27, 2024

wbrown Aug 27, 2024

feat(serialization): Store relative offsets for tensor data and headers #170

Are you sure you want to change the base?

feat(serialization): Store relative offsets for tensor data and headers #170

Conversation

Eta0 commented Jul 18, 2024 • edited Loading

Segment-relative offsets in headers

Overview

Deserialization

Structure

Other bug fixes

wbrown left a comment

Choose a reason for hiding this comment

wbrown Aug 27, 2024

Choose a reason for hiding this comment

wbrown Aug 27, 2024

Choose a reason for hiding this comment

wbrown Aug 27, 2024

Choose a reason for hiding this comment

Eta0 commented Jul 18, 2024 •

edited

Loading