Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink the lexer's token location and line data structures. #4269

Merged
merged 7 commits into from
Sep 4, 2024

Conversation

chandlerc
Copy link
Contributor

First, this replaces the separate line index and column index in the token information with a single 32-bit byte offset of the token. This is then used to compute line and column numbers with a binary search of the line structure and then using that to compute the column within the line. In practice, this is much more efficient:

  • Smaller token data structure. This will hopefully combine with a subsequent optimization PR that shrinks the token data structure still further.
  • Fewer stores to form each token's information in the tight hot loop of the lexer.
  • Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the hot lexing path, and so this ends up being a significant win and shrinks some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the line length. Our source buffer already ensures we only have 2 GiB of source with a nice diagnostic. I've just added a check to help document this in the lexer. The line length can be avoided in all of the cases it was being used, largely by looking at the next line's start and working from there. This also precipitated cleaning up some code that dated from when lines were only built during lexing rather than being pre-built, which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on TokenizedBuffer, but to an extent that was already needed as these methods somewhat predate the more pervasive style conventions. I avoided that here to keep this PR focused on the implementation change, I'll create a subsequent PR to update the API to both better nomenclature and remove deviations from our conventions.

There may also be a way to de-duplicate the binary search in the diagnostic location conversion and the main line accessor binary search, but it wasn't obvious to me that it would be a net savings, so left it alone for now.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on both x86 and Arm. For x86, where I have nice comparison tools, it appears 3% to 20% faster depending on the specific pattern. For Arm server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these changes alone on x86. Significant reduction in instruction count required for lexing, but the overall performance is bottlenecked elsewhere in the overall compilation it seems. However, on Arm, despite the more modest gains in special cases of lexing, this shows fairly consistent 1-2% improvements in overall lexing performance on our compilation benchmark. And the expectaiton is these improvements will compound with subsequent work to further compact our representation.

First, this replaces the separate line index and column index in the
token information with a single 32-bit byte offset of the token. This is
then used to compute line and column numbers with a binary search of the
line structure and then using that to compute the column within the
line. In practice, this is _much_ more efficient:

- Smaller token data structure. This will hopefully combine with
  a subsequent optimization PR that shrinks the token data structure
  still further.
- Fewer stores to form each token's information in the tight hot loop of
  the lexer.
- Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the
hot lexing path, and so this ends up being a significant win and shrinks
some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the
line length. Our source buffer already ensures we only have 2 GiB of
source with a nice diagnostic. I've just added a check to help document
this in the lexer. The line length can be avoided in all of the cases it
was being used, largely by looking at the next line's start and working
from there. This also precipitated cleaning up some code that dated from
when lines were only built during lexing rather than being pre-built,
which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on
`TokenizedBuffer`, but to an extent that was already needed as these
methods somewhat predate the more pervasive style conventions. I avoided
that here to keep this PR focused on the implementation change, I'll
create a subsequent PR to update the API to both better nomenclature and
remove deviations from our conventions.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on
both x86 and Arm. For x86, where I have nice comparison tools, it
appears 3% to 20% faster depending on the specific pattern. For Arm
server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these
changes alone on x86. Significant reduction in instruction count
required for lexing, but the overall performance is bottlenecked
elsewhere in the overall compilation it seems. However, on Arm, despite
the more modest gains in special cases of lexing, this shows fairly
consistent 1-2% improvements in overall lexing performance on our
compilation benchmark. And the expectaiton is these improvements will
compound with subsequent work to further compact our representation.
Copy link
Contributor

@jonmeow jonmeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think this also ends up being a nice simplification for code.

toolchain/lex/lex.cpp Outdated Show resolved Hide resolved
toolchain/lex/lex.cpp Show resolved Hide resolved
@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text,
CARBON_CHECK(
IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]);

int column = ComputeColumn(position);
int32_t byte_offset = position;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where you do this, would it be worth a comment reminding that position is modified later?

Maybe we should rename position, like byte_cursor might be clearer? (note, not asking for a rename here, just mulling whether it's worth it given the byte_offset name)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have this in a few places, does it seem worth a comment in all? just here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how many. What I'm trying to figure out, particular with a byte_cursor rename, is if a different name is a better approach than just a comment.

If you don't think it's helpful enough, we can avoid a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.

toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.h Outdated Show resolved Hide resolved
@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
SourceBuffer& source)
: value_stores_(&value_stores), source_(&source) {}

auto FindLineIndexImpl(int32_t offset) const -> LineIndex;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why just offset here but byte_offset elsewhere? Comments could also help explain what this is an offset into.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No reason, switched to byte_offset. Comment still useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

I still appreciate comments on functions, but understand if you're going to decline.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.

toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
Co-authored-by: Jon Ross-Perkins <[email protected]>
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
toolchain/lex/tokenized_buffer.cpp Outdated Show resolved Hide resolved
Copy link
Contributor Author

@chandlerc chandlerc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again, I think all the review is addressed so merging!

@@ -1158,7 +1157,7 @@ auto Lexer::LexKeywordOrIdentifier(llvm::StringRef source_text,
CARBON_CHECK(
IsIdStartByteTable[static_cast<unsigned char>(source_text[position])]);

int column = ComputeColumn(position);
int32_t byte_offset = position;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think the comment makes sense. I've put it on all of them. And can follow up with renames, there are a bunch of renamings that I think will make sense, but want to flush the stack of optimizations first for merge simplicity.

@@ -322,6 +307,7 @@ class TokenizedBuffer : public Printable<TokenizedBuffer> {
SourceBuffer& source)
: value_stores_(&value_stores), source_(&source) {}

auto FindLineIndexImpl(int32_t offset) const -> LineIndex;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment added (to the definition since this is supposed to be an implementation detail function). Happy to add / move as useful in follow up.

@chandlerc chandlerc added this pull request to the merge queue Sep 3, 2024
Merged via the queue into carbon-language:trunk with commit 97e98bc Sep 4, 2024
8 checks passed
@chandlerc chandlerc deleted the opt-tok-buffer branch September 4, 2024 00:06
dwblaikie pushed a commit to dwblaikie/carbon-lang that referenced this pull request Sep 6, 2024
…anguage#4269)

First, this replaces the separate line index and column index in the
token information with a single 32-bit byte offset of the token. This is
then used to compute line and column numbers with a binary search of the
line structure and then using that to compute the column within the
line. In practice, this is _much_ more efficient:

- Smaller token data structure. This will hopefully combine with a
subsequent optimization PR that shrinks the token data structure still
further.
- Fewer stores to form each token's information in the tight hot loop of
the lexer.
- Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the
hot lexing path, and so this ends up being a significant win and shrinks
some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the
line length. Our source buffer already ensures we only have 2 GiB of
source with a nice diagnostic. I've just added a check to help document
this in the lexer. The line length can be avoided in all of the cases it
was being used, largely by looking at the next line's start and working
from there. This also precipitated cleaning up some code that dated from
when lines were only built during lexing rather than being pre-built,
which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on
`TokenizedBuffer`, but to an extent that was already needed as these
methods somewhat predate the more pervasive style conventions. I avoided
that here to keep this PR focused on the implementation change, I'll
create a subsequent PR to update the API to both better nomenclature and
remove deviations from our conventions.

There may also be a way to de-duplicate the binary search in the
diagnostic location conversion and the main line accessor binary search,
but it wasn't obvious to me that it would be a net savings, so left it
alone for now.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on
both x86 and Arm. For x86, where I have nice comparison tools, it
appears 3% to 20% faster depending on the specific pattern. For Arm
server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these
changes alone on x86. Significant reduction in instruction count
required for lexing, but the overall performance is bottlenecked
elsewhere in the overall compilation it seems. However, on Arm, despite
the more modest gains in special cases of lexing, this shows fairly
consistent 1-2% improvements in overall lexing performance on our
compilation benchmark. And the expectaiton is these improvements will
compound with subsequent work to further compact our representation.

---------

Co-authored-by: Jon Ross-Perkins <[email protected]>
dwblaikie pushed a commit to dwblaikie/carbon-lang that referenced this pull request Sep 9, 2024
…anguage#4269)

First, this replaces the separate line index and column index in the
token information with a single 32-bit byte offset of the token. This is
then used to compute line and column numbers with a binary search of the
line structure and then using that to compute the column within the
line. In practice, this is _much_ more efficient:

- Smaller token data structure. This will hopefully combine with a
subsequent optimization PR that shrinks the token data structure still
further.
- Fewer stores to form each token's information in the tight hot loop of
the lexer.
- Less state to maintain while lexing, fewer computations while lexing.

We only have to search to build the line and column information off the
hot lexing path, and so this ends up being a significant win and shrinks
some of the more significant data structures.

Second, this shrinks the line start to a 32-bit integer and removes the
line length. Our source buffer already ensures we only have 2 GiB of
source with a nice diagnostic. I've just added a check to help document
this in the lexer. The line length can be avoided in all of the cases it
was being used, largely by looking at the next line's start and working
from there. This also precipitated cleaning up some code that dated from
when lines were only built during lexing rather than being pre-built,
which resulted in nice simplifications.

With this PR, I think it makes sense to re-name a bunch of methods on
`TokenizedBuffer`, but to an extent that was already needed as these
methods somewhat predate the more pervasive style conventions. I avoided
that here to keep this PR focused on the implementation change, I'll
create a subsequent PR to update the API to both better nomenclature and
remove deviations from our conventions.

There may also be a way to de-duplicate the binary search in the
diagnostic location conversion and the main line accessor binary search,
but it wasn't obvious to me that it would be a net savings, so left it
alone for now.

The performance impact of this varies quite a bit...

The lexer's benchmark improves pretty consistent across the board on
both x86 and Arm. For x86, where I have nice comparison tools, it
appears 3% to 20% faster depending on the specific pattern. For Arm
server CPUs at least it seems a much smaller but still an improvement.

The overall compilation benchmarks however don't improve much with these
changes alone on x86. Significant reduction in instruction count
required for lexing, but the overall performance is bottlenecked
elsewhere in the overall compilation it seems. However, on Arm, despite
the more modest gains in special cases of lexing, this shows fairly
consistent 1-2% improvements in overall lexing performance on our
compilation benchmark. And the expectaiton is these improvements will
compound with subsequent work to further compact our representation.

---------

Co-authored-by: Jon Ross-Perkins <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants