Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bit-pack the lexer's token info #4270

Merged
merged 21 commits into from
Sep 6, 2024

Conversation

chandlerc
Copy link
Contributor

@chandlerc chandlerc commented Sep 3, 2024

This makes each token info consist of 8 bytes of data:

  • 1 byte of the kind
  • 1 bit for whitespace tracking
  • 23 bits of payload
  • 32 bits for byte offset in the file

This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.

This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.

The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.

One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.

Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.

All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D

First, this replaces the separate line index and column index in the
token information with a single 32-bit byte offset of the token. This is
then used to compute line and column numbers with a binary search of the
line structure and then using that to compute the column within the
line. In practice, this is _much_ more efficient:

- Smaller token data structure. This will hopefully combine with
  a subsequent optimization PR that shrinks the token data structure
  still further.
- Fewer stores to form each token's information in the tight hot loop of
  the lexer.
- Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the
hot lexing path, and so this ends up being a significant win and shrinks
some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the
line length. Our source buffer already ensures we only have 2 GiB of
source with a nice diagnostic. I've just added a check to help document
this in the lexer. The line length can be avoided in all of the cases it
was being used, largely by looking at the next line's start and working
from there. This also precipitated cleaning up some code that dated from
when lines were only built during lexing rather than being pre-built,
which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on
`TokenizedBuffer`, but to an extent that was already needed as these
methods somewhat predate the more pervasive style conventions. I avoided
that here to keep this PR focused on the implementation change, I'll
create a subsequent PR to update the API to both better nomenclature and
remove deviations from our conventions.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on
both x86 and Arm. For x86, where I have nice comparison tools, it
appears 3% to 20% faster depending on the specific pattern. For Arm
server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these
changes alone on x86. Significant reduction in instruction count
required for lexing, but the overall performance is bottlenecked
elsewhere in the overall compilation it seems. However, on Arm, despite
the more modest gains in special cases of lexing, this shows fairly
consistent 1-2% improvements in overall lexing performance on our
compilation benchmark. And the expectaiton is these improvements will
compound with subsequent work to further compact our representation.
This makes each token info consist of 8 bytes of data:
- 1 byte of the kind
- 1 bit for whitespace tracking
- 23 bits of payload
- 32 bits for byte offset in the file

This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.

This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.

The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.

One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.

Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.

All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D

wip to pre-compute grouping on write
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generally looks good, although I have some questions about the AddLexedToken + TokenInfo constructor setup.

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the comments! I think I responded to all of them.

Also have updated to be on top of the trunk with the base PR merged in, so you should be able to view this more normally as an unstacked PR, sorry for the confusion there.

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! For my part, I'm good with the change, with a few small comments that you can choose whether to make changes. Approving, assuming you'll address geoffromer's questions before merging.

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@geoffromer geoffromer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gah, sorry, I saw your warning that this was a stacked PR, but then failed to set my diff view properly. I've resolved the comments that concern the previous PR.

toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
chandlerc and others added 3 commits September 5, 2024 00:17
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is looking good. I'm not merging just in case geoffromer has more comments.

@geoffromer geoffromer added this pull request to the merge queue Sep 6, 2024
Merged via the queue into carbon-language:trunk with commit c43fa3a Sep 6, 2024
8 checks passed
dwblaikie pushed a commit to dwblaikie/carbon-lang that referenced this pull request Sep 6, 2024
This makes each token info consist of 8 bytes of data:
- 1 byte of the kind
- 1 bit for whitespace tracking
- 23 bits of payload
- 32 bits for byte offset in the file

This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.

This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.

The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.

One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.

Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.

All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D

---------

Co-authored-by: Jon Ross-Perkins <[email protected]>
Co-authored-by: Geoff Romer <[email protected]>
dwblaikie pushed a commit to dwblaikie/carbon-lang that referenced this pull request Sep 9, 2024
This makes each token info consist of 8 bytes of data:
- 1 byte of the kind
- 1 bit for whitespace tracking
- 23 bits of payload
- 32 bits for byte offset in the file

This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.

This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.

The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.

One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.

Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.

All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D

---------

Co-authored-by: Jon Ross-Perkins <[email protected]>
Co-authored-by: Geoff Romer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants