-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero alloc lexer #322
Zero alloc lexer #322
Conversation
62f3ad1
to
186e21c
Compare
hi, thanks so much for the PR! After merging the streaming lexer, we want to punt on zero-alloc for a while, as we have a re-parsing feature coming up in the near future which may impose lifetime requirements on the lexer. Once that's done I think we would look at either zero-alloc or a small-string optimisation (which would also remove most allocations while still having an owned Token type). #115 should already help a fair bit. Perhaps you could pull the new benchmark into a separate PR so we can already see how that evolves, before going ahead with zero-alloc? |
70230b6
to
a7efa0e
Compare
Thanks for looking into this. I rebased this PR over main and integrated the fixes that were added in #357. You should be able to run the recently added benchmarks now. The changes to seem to help a fair bit, but there is still roughly a 20% differential between
|
a7efa0e
to
c6c4a15
Compare
Update as of 2022-12-15 with and without lexer commit on main
|
c6c4a15
to
6db79c5
Compare
Thanks for the update! I see you already removed the Lexer/LexerIterator split so i don't have many notes here :) i generally support this direction but want to discuss w/ @lrlna first, which likely needs to wait til january. |
cc @msvec80 |
Hi @allancalix, thanks again for putting all of this together!! This PR is rather large, and I want to break it down into more digestible parts. Can I ask you to pull benchmarks into their own PR? Those are super handy and can be merged right away. We need to think a bit more about the approach towards a zero-alloc lexer. I think the particular approach you took here does achieve substantial performance improvements, but it also significantly complicates the lexer, a part of the compiler that I'd like to keep as simple and as easily debuggable as possible. A few things I'd like to bring up for consideration:
On another note, I also wanted to ask as to how you came up with this approach? What sorts of considerations did you take into account already? |
It's a reasonable step, though the lexer uses possibly many mutable strings to build up token values which may limit the upside of this approach (from docs "If a string does not satisfy the aforementioned conditions, it is heap-allocated"). That said it's certainly a smaller change and I'd be interested how much performance improvement we see.
I would be happy to explore adding a new data structure with the goal of isolating complexity and improving the overall debug-ability of the lexer (I hope i'm not misunderstanding the intention here).
This approach was heavily inspired by Zig's lexer (https://github.com/ziglang/zig/blob/master/lib/std/zig/tokenizer.zig#L409) which uses a similar approach. My thinking was that if I could prevent the complexity from leaking out of the lexer into the parser that the number of changes to the lexer over time would be relatively small. I did update the lexer twice to keep up with the upstream bug fixes in the existing lexer. |
6db79c5
to
9ebe61c
Compare
9ebe61c
to
9a86bc1
Compare
9a86bc1
to
a516311
Compare
@@ -589,6 +593,7 @@ mod tests { | |||
} | |||
|
|||
#[test] | |||
#[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is off by one because this version of the lexer includes the EOF
token in the count while main does not. I don't know which one is more correct so ignoring for now because excluding the EOF token in the count is not completely trival.
@allancalix we are putting this into our next milestone. stay tuned! |
mutation { | ||
messageSender( | ||
message: "some ok outer string "Tráfico" more ok outer string", | ||
) { | ||
delivered | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the issue with this test is that a string value here must not have quotation marks. Only block string values can have (escaped) quotation marks. So Tráfico
gets registered as a named token, and named tokens can only be literal letters and numbers. What was the utf-8 issue that you run into here?
I'll remove this test as it doesn't relate to the PR; we can add a fix once I understand the issue you're facing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was added to cover a regression I found from real queries. The problem was that the lexer was capturing utf-8 characters for Name
tokens and this triggers a panic in the parser.
So with the byte sequence "some ok outer string "Tráfico" more ok outer string"
I ended up with something like:
String "some ok outer string "
Ident Tráfico # triggers panic
String " more ok outer string"
I fixed the issue by using prev_str
instead of current_str
in the branch. This matches the previous lexer's behavior by creating something like:
String "some ok outer string "
Ident Tr
Ident fico
String " more ok outer string"
Error á
@lrlna The alternate way of handling this that might make more sense is to produce a token of Ident Tráfico
. The line below sometimes triggers a panic when trying to index into string slices because depending on the character name[1..]
may fall inside the boundaries of a utf-8 character and not in between separate characters.
if name.len() >= 2 && !name[1..].chars().all(is_remainder_char) { |
@allancalix thank you so much for all your work on this and for keeping up with all the rebases over the last few months! We are merging it in today aiming to publish sometime next week. |
Thanks for all the feedback, I'm happy to see this get merged! |
Relates to #293
I'm hoping for some feedback on the general direction + level of interest in including these changes to the upstream project.
This PR is a superset of the work done in #115 (attribution pending). This PR combines the lazy streaming lexer with the addition of removing string allocations from all tokens. This change is largely backwards compatible with the existing parser.
I apologize in advance for the amount of changes in the lexer, it was quite difficult to maintain lifetime invariants with the the number of mutable borrows from the
Peekable
trait. This approach uses a finite state machine to minimize the amount of backtracking and lookaheads required.These changes have a considerable in parse times for small queries and a very large impact on the pathological query observed in the referenced ticket. At this point the existing and new implementations are compatible, though more testing is warranted (fuzzing perhaps) given the magnitude of changes.
Comparison over main as of 2022-09-30
Remaining Work