-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shrink the lexer's token location and line data structures. #4269
Conversation
First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think this also ends up being a nice simplification for code.
@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text, | |||
CARBON_CHECK( | |||
IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]); | |||
|
|||
int column = ComputeColumn(position); | |||
int32_t byte_offset = position; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where you do this, would it be worth a comment reminding that position
is modified later?
Maybe we should rename position
, like byte_cursor
might be clearer? (note, not asking for a rename here, just mulling whether it's worth it given the byte_offset
name)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have this in a few places, does it seem worth a comment in all? just here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how many. What I'm trying to figure out, particular with a byte_cursor
rename, is if a different name is a better approach than just a comment.
If you don't think it's helpful enough, we can avoid a comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.
toolchain/lex/tokenized_buffer.h
Outdated
@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> { | |||
SourceBuffer& source) | |||
: value_stores_(&value_stores), source_(&source) {} | |||
|
|||
auto FindLineIndexImpl(int32_t offset) const -> LineIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why just offset
here but byte_offset
elsewhere? Comments could also help explain what this is an offset into.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No reason, switched to byte_offset
. Comment still useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
I still appreciate comments on functions, but understand if you're going to decline.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.
Co-authored-by: Jon Ross-Perkins <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again, I think all the review is addressed so merging!
@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text, | |||
CARBON_CHECK( | |||
IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]); | |||
|
|||
int column = ComputeColumn(position); | |||
int32_t byte_offset = position; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.
toolchain/lex/tokenized_buffer.h
Outdated
@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> { | |||
SourceBuffer& source) | |||
: value_stores_(&value_stores), source_(&source) {} | |||
|
|||
auto FindLineIndexImpl(int32_t offset) const -> LineIndex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.
…anguage#4269) First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation. --------- Co-authored-by: Jon Ross-Perkins <[email protected]>
…anguage#4269) First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation. --------- Co-authored-by: Jon Ross-Perkins <[email protected]>
First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is much more efficient:
We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures.
Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications.
With this PR, I think it makes sense to re-name a bunch of methods on
TokenizedBuffer
, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions.There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now.
The performance impact of this varies quite a bit...
The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement.
The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.