Shrink the lexer's token location and line data structures. #4269

chandlerc · 2024-09-03T10:00:25Z

First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is much more efficient:

Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further.
Fewer stores to form each token's information in the tight hot loop of the lexer.
Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on TokenizedBuffer, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions.

There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.

First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.

jonmeow

Looks good! I think this also ends up being a nice simplification for code.

toolchain/lex/lex.cpp

jonmeow · 2024-09-03T17:09:31Z

toolchain/lex/lex.cpp

@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text,
  CARBON_CHECK(
      IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]);

-  int column = ComputeColumn(position);
+  int32_t byte_offset = position;


Where you do this, would it be worth a comment reminding that position is modified later?

Maybe we should rename position, like byte_cursor might be clearer? (note, not asking for a rename here, just mulling whether it's worth it given the byte_offset name)

I have this in a few places, does it seem worth a comment in all? just here?

I'm not sure how many. What I'm trying to figure out, particular with a byte_cursor rename, is if a different name is a better approach than just a comment.

If you don't think it's helpful enough, we can avoid a comment.

No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.

toolchain/lex/tokenized_buffer.h

jonmeow · 2024-09-03T17:19:24Z

toolchain/lex/tokenized_buffer.h

@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
                           SourceBuffer& source)
      : value_stores_(&value_stores), source_(&source) {}

+  auto FindLineIndexImpl(int32_t offset) const -> LineIndex;


Why just offset here but byte_offset elsewhere? Comments could also help explain what this is an offset into.

No reason, switched to byte_offset. Comment still useful?

Thanks!

I still appreciate comments on functions, but understand if you're going to decline.

Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.

toolchain/lex/tokenized_buffer.cpp

Co-authored-by: Jon Ross-Perkins <[email protected]>

toolchain/lex/tokenized_buffer.cpp

chandlerc

Thanks again, I think all the review is addressed so merging!

chandlerc · 2024-09-03T23:48:09Z

toolchain/lex/lex.cpp

@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text,
  CARBON_CHECK(
      IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]);

-  int column = ComputeColumn(position);
+  int32_t byte_offset = position;


No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.

chandlerc · 2024-09-03T23:48:55Z

toolchain/lex/tokenized_buffer.h

@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
                           SourceBuffer& source)
      : value_stores_(&value_stores), source_(&source) {}

+  auto FindLineIndexImpl(int32_t offset) const -> LineIndex;


Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.

…anguage#4269) First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is _much_ more efficient: - Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further. - Fewer stores to form each token's information in the tight hot loop of the lexer. - Less state to maintain while lexing, fewer computations while lexing. We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures. Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications. With this PR, I think it makes sense to re-name a bunch of methods on `TokenizedBuffer`, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions. There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now. The performance impact of this varies quite a bit... The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement. The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation. --------- Co-authored-by: Jon Ross-Perkins <[email protected]>

github-actions bot requested a review from zygoloid September 3, 2024 10:00

github-actions bot added the toolchain label Sep 3, 2024

chandlerc mentioned this pull request Sep 3, 2024

Bit-pack the lexer's token info #4270

Merged

jonmeow approved these changes Sep 3, 2024

View reviewed changes

Apply suggestions from code review

b7efe8d

Co-authored-by: Jon Ross-Perkins <[email protected]>

CarbonInfraBot reviewed Sep 3, 2024

View reviewed changes

toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved

toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved

chandlerc added 5 commits September 3, 2024 20:59

rename

4e04e85

more rename

02bb35d

no more d

ef900e2

format

c6fe3b1

add comments

cf84395

chandlerc commented Sep 3, 2024

View reviewed changes

chandlerc enabled auto-merge September 3, 2024 23:49

chandlerc added this pull request to the merge queue Sep 3, 2024

Merged via the queue into carbon-language:trunk with commit 97e98bc Sep 4, 2024
8 checks passed

chandlerc deleted the opt-tok-buffer branch September 4, 2024 00:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shrink the lexer's token location and line data structures. #4269

Shrink the lexer's token location and line data structures. #4269

chandlerc commented Sep 3, 2024

jonmeow left a comment

jonmeow Sep 3, 2024

chandlerc Sep 3, 2024

jonmeow Sep 3, 2024

chandlerc Sep 3, 2024

jonmeow Sep 3, 2024

chandlerc Sep 3, 2024

jonmeow Sep 3, 2024

chandlerc Sep 3, 2024

chandlerc left a comment

chandlerc Sep 3, 2024

chandlerc Sep 3, 2024

Shrink the lexer's token location and line data structures. #4269

Shrink the lexer's token location and line data structures. #4269

Conversation

chandlerc commented Sep 3, 2024

jonmeow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chandlerc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment