-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bit-pack the lexer's token info #4270
Conversation
First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.
This makes each token info consist of 8 bytes of data: - 1 byte of the kind - 1 bit for whitespace tracking - 23 bits of payload - 32 bits for byte offset in the file This builds directly on representing the location of the token as a single 32-bit offset, now compressing the rest of the data into a single 32-bit bitfield. This adds some implementation limits: we can no longer lex more than 2^23 tokens in a single source file. Nor can we have more than 2^23 string literals, integer literals, real literals, or identifiers. Only the first of these is even close to an issue, and even then seems unlikely to ever be a problem in practice. The memory efficiency here is great and the motivating goal. But to make this work well, we also need to streamline how we create the tokens. Otherwise, all the bit fiddling can end up erasing our gains. This PR adds a number of APIs to manage creating and accessing the now significantly more complex storage of token infos to try and help with this. One big change required to simplify the writes here is to switch from computing whether a token has trailing space after-the-fact to pre-computing whether a token will have leading space. That lets us have the leading space information available immediately when forming the token, and avoids doing a single bit flip afterward. Another change that helps with this representation is to minimize the updating of groups after-the-fact. The code now tries to set the opening index directly when creating the closing token and only updates the opening group afterward. Because of the bit packing, this is a reduction of 0.5% of dynamic instructions in the compile benchmark, and has dramatic improvements for the grouping symbol focused benchmarks. All combined, this is a significant improvement on the lexer-focused benchmarks despite the added complexity, and a significant win on our compile time benchmarks due to both the lexer improvements and downstream memory density improvements: 5-12% reduction in lex time, growing larger as files get larger. About a 4.5% reduction in parse time, and even a 1-2% reduction in total check time. =D wip to pre-compute grouping on write
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This generally looks good, although I have some questions about the AddLexedToken
+ TokenInfo
constructor setup.
Co-authored-by: Jon Ross-Perkins <[email protected]>
Co-authored-by: Jon Ross-Perkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the comments! I think I responded to all of them.
Also have updated to be on top of the trunk with the base PR merged in, so you should be able to view this more normally as an unstacked PR, sorry for the confusion there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! For my part, I'm good with the change, with a few small comments that you can choose whether to make changes. Approving, assuming you'll address geoffromer's questions before merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gah, sorry, I saw your warning that this was a stacked PR, but then failed to set my diff view properly. I've resolved the comments that concern the previous PR.
Co-authored-by: Geoff Romer <[email protected]> Co-authored-by: Jon Ross-Perkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change is looking good. I'm not merging just in case geoffromer has more comments.
This makes each token info consist of 8 bytes of data: - 1 byte of the kind - 1 bit for whitespace tracking - 23 bits of payload - 32 bits for byte offset in the file This builds directly on representing the location of the token as a single 32-bit offset, now compressing the rest of the data into a single 32-bit bitfield. This adds some implementation limits: we can no longer lex more than 2^23 tokens in a single source file. Nor can we have more than 2^23 string literals, integer literals, real literals, or identifiers. Only the first of these is even close to an issue, and even then seems unlikely to ever be a problem in practice. The memory efficiency here is great and the motivating goal. But to make this work well, we also need to streamline how we create the tokens. Otherwise, all the bit fiddling can end up erasing our gains. This PR adds a number of APIs to manage creating and accessing the now significantly more complex storage of token infos to try and help with this. One big change required to simplify the writes here is to switch from computing whether a token has trailing space after-the-fact to pre-computing whether a token will have leading space. That lets us have the leading space information available immediately when forming the token, and avoids doing a single bit flip afterward. Another change that helps with this representation is to minimize the updating of groups after-the-fact. The code now tries to set the opening index directly when creating the closing token and only updates the opening group afterward. Because of the bit packing, this is a reduction of 0.5% of dynamic instructions in the compile benchmark, and has dramatic improvements for the grouping symbol focused benchmarks. All combined, this is a significant improvement on the lexer-focused benchmarks despite the added complexity, and a significant win on our compile time benchmarks due to both the lexer improvements and downstream memory density improvements: 5-12% reduction in lex time, growing larger as files get larger. About a 4.5% reduction in parse time, and even a 1-2% reduction in total check time. =D --------- Co-authored-by: Jon Ross-Perkins <[email protected]> Co-authored-by: Geoff Romer <[email protected]>
This makes each token info consist of 8 bytes of data: - 1 byte of the kind - 1 bit for whitespace tracking - 23 bits of payload - 32 bits for byte offset in the file This builds directly on representing the location of the token as a single 32-bit offset, now compressing the rest of the data into a single 32-bit bitfield. This adds some implementation limits: we can no longer lex more than 2^23 tokens in a single source file. Nor can we have more than 2^23 string literals, integer literals, real literals, or identifiers. Only the first of these is even close to an issue, and even then seems unlikely to ever be a problem in practice. The memory efficiency here is great and the motivating goal. But to make this work well, we also need to streamline how we create the tokens. Otherwise, all the bit fiddling can end up erasing our gains. This PR adds a number of APIs to manage creating and accessing the now significantly more complex storage of token infos to try and help with this. One big change required to simplify the writes here is to switch from computing whether a token has trailing space after-the-fact to pre-computing whether a token will have leading space. That lets us have the leading space information available immediately when forming the token, and avoids doing a single bit flip afterward. Another change that helps with this representation is to minimize the updating of groups after-the-fact. The code now tries to set the opening index directly when creating the closing token and only updates the opening group afterward. Because of the bit packing, this is a reduction of 0.5% of dynamic instructions in the compile benchmark, and has dramatic improvements for the grouping symbol focused benchmarks. All combined, this is a significant improvement on the lexer-focused benchmarks despite the added complexity, and a significant win on our compile time benchmarks due to both the lexer improvements and downstream memory density improvements: 5-12% reduction in lex time, growing larger as files get larger. About a 4.5% reduction in parse time, and even a 1-2% reduction in total check time. =D --------- Co-authored-by: Jon Ross-Perkins <[email protected]> Co-authored-by: Geoff Romer <[email protected]>
This makes each token info consist of 8 bytes of data:
This builds directly on representing the location of the token as
a single 32-bit offset, now compressing the rest of the data into
a single 32-bit bitfield.
This adds some implementation limits: we can no longer lex more than
2^23 tokens in a single source file. Nor can we have more than 2^23
string literals, integer literals, real literals, or identifiers. Only
the first of these is even close to an issue, and even then seems
unlikely to ever be a problem in practice.
The memory efficiency here is great and the motivating goal. But to make
this work well, we also need to streamline how we create the tokens.
Otherwise, all the bit fiddling can end up erasing our gains. This PR
adds a number of APIs to manage creating and accessing the now
significantly more complex storage of token infos to try and help with
this.
One big change required to simplify the writes here is to switch from
computing whether a token has trailing space after-the-fact to
pre-computing whether a token will have leading space. That lets us have
the leading space information available immediately when forming the
token, and avoids doing a single bit flip afterward.
Another change that helps with this representation is to minimize the
updating of groups after-the-fact. The code now tries to set the opening
index directly when creating the closing token and only updates the
opening group afterward. Because of the bit packing, this is a reduction
of 0.5% of dynamic instructions in the compile benchmark, and has
dramatic improvements for the grouping symbol focused benchmarks.
All combined, this is a significant improvement on the lexer-focused
benchmarks despite the added complexity, and a significant win on our
compile time benchmarks due to both the lexer improvements and
downstream memory density improvements: 5-12% reduction in lex time,
growing larger as files get larger. About a 4.5% reduction in parse
time, and even a 1-2% reduction in total check time. =D