Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(serialization): Store relative offsets for tensor data and headers #170

Open
wants to merge 1 commit into
base: v3.0.0-dev
Choose a base branch
from

Conversation

Eta0
Copy link
Contributor

@Eta0 Eta0 commented Jul 18, 2024

Segment-relative offsets in headers

Closes #142. Helpful for #148, #150, and #159.

Overview

Instead of storing the absolute file position of header entries and tensor data, this change stores their offsets relative to the beginning of their respective segment (e.g. the header segment, or the data segment). This makes it easier to create and edit various parts of a file's metadata independently from other parts of the metadata or the tensor data. Information about absolute offsets are instead stored on a segment-by-segment basis in a "layout" section of the metadata, preceding the metadata and header sections.

This separates the file header and tensor data enough that it should even be possible now to split them into two completely separate files, if that design may be helpful later on.

This change also aligns the end of the metadata segment to a 4096-byte block boundary. The header segment also ends on a 4096-byte boundary as was the case before. This allows these segments to more easily be manipulated with fallocate for resolving #150 and #159 later.

Deserialization

This change requires minimal modifications to deserialization. In newer files, the layout section is read, and the absolute offsets listed in it are used to convert all subsequently-read relative offsets into absolute offsets.

Structure

The binary format of the layout section is:

  • Byte length of the layout section (64-bit)
  • Version tag (8-bit, always 1 for now)
  • Number of segments (8-bit)
  • Metadata segment
    • Tag (8-bit, always 1)
    • Absolute start offset (64-bit)
    • Absolute end offset (64-bit)
  • Header segment
    • Tag (8-bit, always 2)
    • Absolute start offset (64-bit)
    • Absolute end offset (64-bit)
  • Tensor data segment
    • Tag (8-bit, always 3)
    • Absolute start offset (64-bit)

The tag fields aren't strictly necessary, but are present to make it slightly easier to write a deserializer that could handle the metadata segment being removed in the future, or optional segments, or variations of segments that include more or less information. I'm open to suggestions on improving the format for the layout section. Other potentially odd choices at present:

  1. End offsets refer to the allocated space for the segment, including padding, and do not represent the length of meaningful content in a segment
    • Length of meaningful content may be encoded elsewhere in the metadata as appropriate
  2. End offsets are included for the metadata and header segments, even though they are currently equal to the start offset of the following segment
    • This is here for convenience and in case segments somehow need to not be written back-to-back in a single file at some point
  3. Tensor data doesn't have an end offset listed
    • The end of the tensor data segment is assumed to be the end of the file
    • This lessens the need to update the layout segment after every tensor (bulk-)write
    • We could probably add an end offset anyway if we wanted to since that information is available to the serializer during each prologue synchronization

If we want to support serialization to multiple files eventually, these fields could also be modified to include file identifiers, e.g. listing that the metadata is in file 0 but the tensor data segment is located in file 1, or potentially even something like listing multiple tensor data segments spread across multiple files.

Other bug fixes

Buffered writes at the beginning of a file were not being properly flushed before switching to writing with pwrite, causing the beginning of a file to sometimes be overwritten with unfinished data. This was a regression from v2.9.0 where the writer was always flushed after buffered writes and is fixed again in this PR.

Instead of storing the absolute file position of header entries
and tensor data, this change stores their offsets relative to the
beginning of their respective segment (e.g. the header segment, or the
data segment). This makes it easier to create and edit headers without
needing to already know where all prior segments end.

Information about absolute offsets are instead stored on a
segment-by-segment basis in a "layout" section of the metadata,
preceding the metadata and header sections.
@Eta0 Eta0 added serialization schema-change label for 3.0-dev work that involves schema changes labels Jul 18, 2024
@Eta0 Eta0 requested review from wbrown and bchess July 18, 2024 08:12
@Eta0 Eta0 self-assigned this Jul 18, 2024
@Eta0 Eta0 linked an issue Jul 18, 2024 that may be closed by this pull request
@Eta0 Eta0 marked this pull request as ready for review July 18, 2024 17:11
@Eta0 Eta0 requested review from harubaru and removed request for bchess August 5, 2024 19:15
Copy link
Collaborator

@wbrown wbrown left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good! 👍

My general comment would be that we should probably encode a data_length field. As you mentioned in your PR description, end_offset no longer meaningfully tells us the actual length of the data.

We may want to make it easier on ourselves to slice that data by having the actual data length available.

self._metadata_end = self._metadata_start + 8 + approx_metadata_size
# Extend the metadata segment to end on a block boundary
# This allows later manipulation with the fallocate syscall
# on coöperating operating systems to extend this segment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This spelling of cooperating intended? :)

# This allows later manipulation with the fallocate syscall
# on coöperating operating systems to extend this segment
# in the middle of the file
self._metadata_end -= self._metadata_end % -4096
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should block boundaries be a constant? i.e. do we know that the aligned boundaries are the same on ARM vs x86-64?

_FORMAT: ClassVar[struct.Struct] = struct.Struct(
"<"
"B" # Layout version tag
"B" # Number of entries
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the number of metadata header entries, which is currently 3, correct? I ask because we have this as B. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
schema-change label for 3.0-dev work that involves schema changes serialization
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[3.0] Make data_offset relative to the data section of the file
2 participants