Bit-pack the lexer's token info #4270

chandlerc · 2024-09-03T10:02:13Z

This makes each token info consist of 8 bytes of data:

1 byte of the kind
1 bit for whitespace tracking
23 bits of payload
32 bits for byte offset in the file

This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.

This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.

The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.

One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.

Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.

All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D

First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.

This makes each token info consist of 8 bytes of data: - 1 byte of the kind - 1 bit for whitespace tracking - 23 bits of payload - 32 bits for byte offset in the file This builds directly on representing the location of the token as a single 32-bit offset, now compressing the rest of the data into a single 32-bit bitfield. This adds some implementation limits: we can no longer lex more than 2^23 tokens in a single source file. Nor can we have more than 2^23 string literals, integer literals, real literals, or identifiers. Only the first of these is even close to an issue, and even then seems unlikely to ever be a problem in practice. The memory efficiency here is great and the motivating goal. But to make this work well, we also need to streamline how we create the tokens. Otherwise, all the bit fiddling can end up erasing our gains. This PR adds a number of APIs to manage creating and accessing the now significantly more complex storage of token infos to try and help with this. One big change required to simplify the writes here is to switch from computing whether a token has trailing space after-the-fact to pre-computing whether a token will have leading space. That lets us have the leading space information available immediately when forming the token, and avoids doing a single bit flip afterward. Another change that helps with this representation is to minimize the updating of groups after-the-fact. The code now tries to set the opening index directly when creating the closing token and only updates the opening group afterward. Because of the bit packing, this is a reduction of 0.5% of dynamic instructions in the compile benchmark, and has dramatic improvements for the grouping symbol focused benchmarks. All combined, this is a significant improvement on the lexer-focused benchmarks despite the added complexity, and a significant win on our compile time benchmarks due to both the lexer improvements and downstream memory density improvements: 5-12% reduction in lex time, growing larger as files get larger. About a 4.5% reduction in parse time, and even a 1-2% reduction in total check time. =D wip to pre-compute grouping on write

jonmeow

This generally looks good, although I have some questions about the AddLexedToken + TokenInfo constructor setup.

toolchain/lex/lex.cpp

toolchain/lex/tokenized_buffer.h

toolchain/lex/lex.cpp

Co-authored-by: Jon Ross-Perkins <[email protected]>

toolchain/lex/tokenized_buffer.h

toolchain/lex/tokenized_buffer.cpp

toolchain/lex/lex.cpp

Co-authored-by: Jon Ross-Perkins <[email protected]>

chandlerc

Thanks for all the comments! I think I responded to all of them.

Also have updated to be on top of the trunk with the base PR merged in, so you should be able to view this more normally as an unstacked PR, sorry for the confusion there.

toolchain/lex/lex.cpp

toolchain/lex/tokenized_buffer.h

jonmeow

Thanks for the changes! For my part, I'm good with the change, with a few small comments that you can choose whether to make changes. Approving, assuming you'll address geoffromer's questions before merging.

toolchain/lex/lex.cpp

toolchain/lex/tokenized_buffer.h

toolchain/lex/lex.cpp

geoffromer

Gah, sorry, I saw your warning that this was a stacked PR, but then failed to set my diff view properly. I've resolved the comments that concern the previous PR.

toolchain/lex/tokenized_buffer.h

toolchain/lex/lex.cpp

toolchain/lex/tokenized_buffer.h

Co-authored-by: Geoff Romer <[email protected]> Co-authored-by: Jon Ross-Perkins <[email protected]>

toolchain/lex/tokenized_buffer.h

toolchain/lex/lex.cpp

jonmeow

I think this change is looking good. I'm not merging just in case geoffromer has more comments.

This makes each token info consist of 8 bytes of data: - 1 byte of the kind - 1 bit for whitespace tracking - 23 bits of payload - 32 bits for byte offset in the file This builds directly on representing the location of the token as a single 32-bit offset, now compressing the rest of the data into a single 32-bit bitfield. This adds some implementation limits: we can no longer lex more than 2^23 tokens in a single source file. Nor can we have more than 2^23 string literals, integer literals, real literals, or identifiers. Only the first of these is even close to an issue, and even then seems unlikely to ever be a problem in practice. The memory efficiency here is great and the motivating goal. But to make this work well, we also need to streamline how we create the tokens. Otherwise, all the bit fiddling can end up erasing our gains. This PR adds a number of APIs to manage creating and accessing the now significantly more complex storage of token infos to try and help with this. One big change required to simplify the writes here is to switch from computing whether a token has trailing space after-the-fact to pre-computing whether a token will have leading space. That lets us have the leading space information available immediately when forming the token, and avoids doing a single bit flip afterward. Another change that helps with this representation is to minimize the updating of groups after-the-fact. The code now tries to set the opening index directly when creating the closing token and only updates the opening group afterward. Because of the bit packing, this is a reduction of 0.5% of dynamic instructions in the compile benchmark, and has dramatic improvements for the grouping symbol focused benchmarks. All combined, this is a significant improvement on the lexer-focused benchmarks despite the added complexity, and a significant win on our compile time benchmarks due to both the lexer improvements and downstream memory density improvements: 5-12% reduction in lex time, growing larger as files get larger. About a 4.5% reduction in parse time, and even a 1-2% reduction in total check time. =D --------- Co-authored-by: Jon Ross-Perkins <[email protected]> Co-authored-by: Geoff Romer <[email protected]>

chandlerc added 2 commits August 31, 2024 10:30

github-actions bot added the toolchain label Sep 3, 2024

github-actions bot requested a review from geoffromer September 3, 2024 10:02

jonmeow reviewed Sep 3, 2024

View reviewed changes

chandlerc and others added 5 commits September 3, 2024 13:22

Apply suggestions from code review

b7efe8d

Co-authored-by: Jon Ross-Perkins <[email protected]>

rename

4e04e85

more rename

02bb35d

no more d

ef900e2

format

c6fe3b1

geoffromer reviewed Sep 3, 2024

View reviewed changes

chandlerc and others added 8 commits September 3, 2024 23:47

add comments

cf84395

Apply suggestions from code review

8bec617

Co-authored-by: Jon Ross-Perkins <[email protected]>

Merge branch 'opt-tok-buffer' into opt-tok-buffer2

54e65e3

Merge branch 'trunk' into opt-tok-buffer

092e04d

Merge branch 'opt-tok-buffer' into opt-tok-buffer2

afb28ee

format

a6bc7b9

address review

4ad5cd2

format

3977beb

chandlerc commented Sep 4, 2024

View reviewed changes

chandlerc requested review from jonmeow and geoffromer September 4, 2024 01:47

jonmeow approved these changes Sep 4, 2024

View reviewed changes

geoffromer reviewed Sep 4, 2024

View reviewed changes

chandlerc and others added 3 commits September 5, 2024 00:17

Apply suggestions from code review

eb6ca64

Co-authored-by: Geoff Romer <[email protected]> Co-authored-by: Jon Ross-Perkins <[email protected]>

inlien defs

3f08d63

format

81298fc

chandlerc requested a review from geoffromer September 5, 2024 07:47

geoffromer approved these changes Sep 5, 2024

View reviewed changes

toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved

review

ea70eea

chandlerc added 2 commits September 6, 2024 09:20

Rename

359802b

format

9bf386a

chandlerc requested review from jonmeow and geoffromer September 6, 2024 09:21

jonmeow approved these changes Sep 6, 2024

View reviewed changes

geoffromer approved these changes Sep 6, 2024

View reviewed changes

geoffromer added this pull request to the merge queue Sep 6, 2024

Merged via the queue into carbon-language:trunk with commit c43fa3a Sep 6, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bit-pack the lexer's token info #4270

Bit-pack the lexer's token info #4270

chandlerc commented Sep 3, 2024 •

edited

Loading

jonmeow left a comment

chandlerc left a comment

jonmeow left a comment

geoffromer left a comment

jonmeow left a comment

Bit-pack the lexer's token info #4270

Bit-pack the lexer's token info #4270

Conversation

chandlerc commented Sep 3, 2024 • edited Loading

jonmeow left a comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

jonmeow left a comment

Choose a reason for hiding this comment

geoffromer left a comment

Choose a reason for hiding this comment

jonmeow left a comment

Choose a reason for hiding this comment

chandlerc commented Sep 3, 2024 •

edited

Loading