-
Notifications
You must be signed in to change notification settings - Fork 31
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(serialization): Store relative offsets for tensor data and headers #170
base: v3.0.0-dev
Are you sure you want to change the base?
Conversation
Instead of storing the absolute file position of header entries and tensor data, this change stores their offsets relative to the beginning of their respective segment (e.g. the header segment, or the data segment). This makes it easier to create and edit headers without needing to already know where all prior segments end. Information about absolute offsets are instead stored on a segment-by-segment basis in a "layout" section of the metadata, preceding the metadata and header sections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good! 👍
My general comment would be that we should probably encode a data_length
field. As you mentioned in your PR description, end_offset
no longer meaningfully tells us the actual length of the data.
We may want to make it easier on ourselves to slice that data by having the actual data length available.
self._metadata_end = self._metadata_start + 8 + approx_metadata_size | ||
# Extend the metadata segment to end on a block boundary | ||
# This allows later manipulation with the fallocate syscall | ||
# on coöperating operating systems to extend this segment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This spelling of cooperating
intended? :)
# This allows later manipulation with the fallocate syscall | ||
# on coöperating operating systems to extend this segment | ||
# in the middle of the file | ||
self._metadata_end -= self._metadata_end % -4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should block boundaries be a constant? i.e. do we know that the aligned boundaries are the same on ARM vs x86-64?
_FORMAT: ClassVar[struct.Struct] = struct.Struct( | ||
"<" | ||
"B" # Layout version tag | ||
"B" # Number of entries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the number of metadata header entries, which is currently 3
, correct? I ask because we have this as B
. :)
Segment-relative offsets in headers
Closes #142. Helpful for #148, #150, and #159.
Overview
Instead of storing the absolute file position of header entries and tensor data, this change stores their offsets relative to the beginning of their respective segment (e.g. the header segment, or the data segment). This makes it easier to create and edit various parts of a file's metadata independently from other parts of the metadata or the tensor data. Information about absolute offsets are instead stored on a segment-by-segment basis in a "layout" section of the metadata, preceding the metadata and header sections.
This separates the file header and tensor data enough that it should even be possible now to split them into two completely separate files, if that design may be helpful later on.
This change also aligns the end of the metadata segment to a 4096-byte block boundary. The header segment also ends on a 4096-byte boundary as was the case before. This allows these segments to more easily be manipulated with
fallocate
for resolving #150 and #159 later.Deserialization
This change requires minimal modifications to deserialization. In newer files, the layout section is read, and the absolute offsets listed in it are used to convert all subsequently-read relative offsets into absolute offsets.
Structure
The binary format of the layout section is:
1
for now)1
)2
)3
)The tag fields aren't strictly necessary, but are present to make it slightly easier to write a deserializer that could handle the metadata segment being removed in the future, or optional segments, or variations of segments that include more or less information. I'm open to suggestions on improving the format for the layout section. Other potentially odd choices at present:
If we want to support serialization to multiple files eventually, these fields could also be modified to include file identifiers, e.g. listing that the metadata is in file 0 but the tensor data segment is located in file 1, or potentially even something like listing multiple tensor data segments spread across multiple files.
Other bug fixes
Buffered writes at the beginning of a file were not being properly flushed before switching to writing with
pwrite
, causing the beginning of a file to sometimes be overwritten with unfinished data. This was a regression from v2.9.0 where the writer was always flushed after buffered writes and is fixed again in this PR.