Zero alloc lexer #322

allancalix · 2022-10-01T02:26:23Z

Relates to #293

I'm hoping for some feedback on the general direction + level of interest in including these changes to the upstream project.

This PR is a superset of the work done in #115 (attribution pending). This PR combines the lazy streaming lexer with the addition of removing string allocations from all tokens. This change is largely backwards compatible with the existing parser.

I apologize in advance for the amount of changes in the lexer, it was quite difficult to maintain lifetime invariants with the the number of mutable borrows from the Peekable trait. This approach uses a finite state machine to minimize the amount of backtracking and lookaheads required.

These changes have a considerable in parse times for small queries and a very large impact on the pathological query observed in the referenced ticket. At this point the existing and new implementations are compatible, though more testing is warranted (fuzzing perhaps) given the magnitude of changes.

Comparison over main as of 2022-09-30

parser_peek_n           time:   [8.7874 µs 8.8062 µs 8.8267 µs]                           
                        change: [-23.995% -23.824% -23.635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

many_aliases            time:   [2.2570 ms 2.2611 ms 2.2657 ms]                          
                        change: [-35.767% -35.427% -35.112%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Remaining Work

Token limits was implemented after this work was done and seems to misbehave - investigate #[ignore] test and re-enable them once limits are fixed
Simplify Cursor type + implementation
Additional fuzzing
Add doc comments

goto-bus-stop · 2022-11-14T09:26:49Z

hi, thanks so much for the PR! After merging the streaming lexer, we want to punt on zero-alloc for a while, as we have a re-parsing feature coming up in the near future which may impose lifetime requirements on the lexer. Once that's done I think we would look at either zero-alloc or a small-string optimisation (which would also remove most allocations while still having an owned Token type).

#115 should already help a fair bit. Perhaps you could pull the new benchmark into a separate PR so we can already see how that evolves, before going ahead with zero-alloc?

allancalix · 2022-11-14T22:24:39Z

Thanks for looking into this. I rebased this PR over main and integrated the fixes that were added in #357. You should be able to run the recently added benchmarks now.

The changes to seem to help a fair bit, but there is still roughly a 20% differential between 0.3 and this version of the lexer. Here are some benchmarks, note the difference between apollo_graphql_parser and apollo_fork_graphql_parser:

async_graphql_parser    time:   [7.0465 µs 7.0680 µs 7.0964 µs]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

apollo_graphql_parser   time:   [5.0164 µs 5.0229 µs 5.0295 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

apollo_fork_graphql_parser
                        time:   [3.9070 µs 3.9121 µs 3.9173 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

graphql_parser          time:   [5.3375 µs 5.3478 µs 5.3612 µs]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

async_graphql_parser #2 time:   [2.0918 ms 2.0992 ms 2.1066 ms]

apollo_graphql_parser #2
                        time:   [2.0997 ms 2.1086 ms 2.1197 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking apollo_fork_graphql_parser #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
apollo_fork_graphql_parser #2
                        time:   [1.6741 ms 1.6824 ms 1.6912 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

graphql_parser #2       time:   [2.0441 ms 2.0472 ms 2.0507 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

allancalix · 2022-12-15T23:11:59Z

Update as of 2022-12-15 with and without lexer commit on main

many_aliases            time:   [1.9225 ms 1.9321 ms 1.9425 ms]                          
                        change: [-20.666% -20.165% -19.761%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

query_lexer             time:   [1.1044 µs 1.1071 µs 1.1099 µs]                         
                        change: [-46.599% -46.451% -46.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

query_parser            time:   [7.4076 µs 7.4286 µs 7.4505 µs]                          
                        change: [-21.579% -21.229% -20.915%] (p = 0.00 < 0.05)
                        Performance has improved.

     Running benches/supergraph.rs (/Users/allancalix/src/github.com/Shopify/apollo-rs/target/release/deps/supergraph-a793e8d443792cd0)
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
supergraph_lexer        time:   [43.429 µs 43.524 µs 43.616 µs]                              
                        change: [-45.923% -45.811% -45.701%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

supergraph_parser       time:   [178.65 µs 179.14 µs 179.71 µs]                              
                        change: [-19.674% -19.286% -18.936%] (p = 0.00 < 0.05)
                        Performance has improved.

goto-bus-stop · 2022-12-16T12:32:00Z

Thanks for the update! I see you already removed the Lexer/LexerIterator split so i don't have many notes here :) i generally support this direction but want to discuss w/ @lrlna first, which likely needs to wait til january.

gbossh · 2022-12-21T02:02:59Z

cc @msvec80

lrlna · 2022-12-21T14:09:03Z

Hi @allancalix, thanks again for putting all of this together!!

This PR is rather large, and I want to break it down into more digestible parts. Can I ask you to pull benchmarks into their own PR? Those are super handy and can be merged right away.

We need to think a bit more about the approach towards a zero-alloc lexer. I think the particular approach you took here does achieve substantial performance improvements, but it also significantly complicates the lexer, a part of the compiler that I'd like to keep as simple and as easily debuggable as possible. A few things I'd like to bring up for consideration:

Is there a path forward where we don't overload the existing Cursor and let it solely be a peekable iterator over chars? I am thinking of whether introducing a Reader on top of the existing Cursor that keeps track of start_pos, pos, etc might be a good approach that keeps the current lexer simple and debuggable and offloads handling of allocations (or lack of thereof) on a separate structure.
I am wondering a bit about an introduction of a state machine for lexer's token kinds. The state implemented here conflates two ideas: the state of the stream, and the current token kind. This will make it more complicated to debug the lexer in case of changes to the grammar or any bugs found. I am also not certain if it's in general helpful to the idea of a zero-alloc lexer.
My colleague @goto-bus-stop has been thinking about using smol-str crate to help eliminate a bunch of the Strings in the lexer. While we figure out a good path forward for this zero-alloc lexer, we'll likely implement a smol-str improvement first, which should be quite helpful here (having your benchmarks in main would be super useful here too). What are your thoughts on this?

On another note, I also wanted to ask as to how you came up with this approach? What sorts of considerations did you take into account already?

allancalix · 2022-12-21T18:36:42Z

This PR is rather large, and I want to break it down into more digestible parts.
Yeah 😅 , my initial attempt to introduce this change was minimalist but I ran into significant challenges:

the mutable borrows of the lexer itself with the large number of immutable borrows required to make this work made lifetimes difficult to resolve
I found it difficult to translate the functions that depend on mutable strings to build up token values (e.g. lexer/mod.rs)

we'll likely implement a smol-str improvement first, which should be quite helpful here. What are your thoughts on this?

It's a reasonable step, though the lexer uses possibly many mutable strings to build up token values which may limit the upside of this approach (from docs "If a string does not satisfy the aforementioned conditions, it is heap-allocated"). That said it's certainly a smaller change and I'd be interested how much performance improvement we see.

Is there a path forward where we don't overload the existing Cursor and let it solely be a peekable iterator over chars?

I would be happy to explore adding a new data structure with the goal of isolating complexity and improving the overall debug-ability of the lexer (I hope i'm not misunderstanding the intention here).

On another note, I also wanted to ask as to how you came up with this approach? What sorts of considerations did you take into account already?

This approach was heavily inspired by Zig's lexer (https://github.com/ziglang/zig/blob/master/lib/std/zig/tokenizer.zig#L409) which uses a similar approach. My thinking was that if I could prevent the complexity from leaking out of the lexer into the parser that the number of changes to the lexer over time would be relatively small. I did update the lexer twice to keep up with the upstream bug fixes in the existing lexer.

allancalix · 2023-05-13T01:49:16Z

crates/apollo-parser/src/parser/mod.rs

@@ -589,6 +593,7 @@ mod tests {
    }

    #[test]
+    #[ignore]


This is off by one because this version of the lexer includes the EOF token in the count while main does not. I don't know which one is more correct so ignoring for now because excluding the EOF token in the count is not completely trival.

lrlna · 2023-05-26T10:38:19Z

@allancalix we are putting this into our next milestone. stay tuned!

lrlna · 2023-07-05T12:45:20Z

crates/apollo-parser/test_data/lexer/err/0014_international.graphql

+  mutation {
+      messageSender(
+          message: "some ok outer string "Tráfico" more ok outer string",
+      ) {
+        delivered
+  }
+}


the issue with this test is that a string value here must not have quotation marks. Only block string values can have (escaped) quotation marks. So Tráfico gets registered as a named token, and named tokens can only be literal letters and numbers. What was the utf-8 issue that you run into here?

I'll remove this test as it doesn't relate to the PR; we can add a fix once I understand the issue you're facing.

This test was added to cover a regression I found from real queries. The problem was that the lexer was capturing utf-8 characters for Name tokens and this triggers a panic in the parser.

So with the byte sequence "some ok outer string "Tráfico" more ok outer string" I ended up with something like:

String "some ok outer string " Ident Tráfico # triggers panic String " more ok outer string"

I fixed the issue by using prev_str instead of current_str in the branch. This matches the previous lexer's behavior by creating something like:

String "some ok outer string " Ident Tr Ident fico String " more ok outer string" Error á

@lrlna The alternate way of handling this that might make more sense is to produce a token of Ident Tráfico. The line below sometimes triggers a panic when trying to index into string slices because depending on the character name[1..] may fall inside the boundaries of a utf-8 character and not in between separate characters.

apollo-rs/crates/apollo-parser/src/parser/grammar/name.rs

Line 22 in efa365e

if name.len() >= 2 && !name[1..].chars().all(is_remainder_char) {

lrlna · 2023-07-05T13:19:17Z

@allancalix thank you so much for all your work on this and for keeping up with all the rebases over the last few months! We are merging it in today aiming to publish sometime next week.

allancalix · 2023-07-05T22:09:41Z

Thanks for all the feedback, I'm happy to see this get merged!

allancalix force-pushed the zero-alloc-lexer branch from 62f3ad1 to 186e21c Compare October 1, 2022 02:27

Geal mentioned this pull request Oct 19, 2022

Reduce token copying #323

Merged

allancalix force-pushed the zero-alloc-lexer branch 2 times, most recently from 70230b6 to a7efa0e Compare November 14, 2022 22:19

allancalix force-pushed the zero-alloc-lexer branch from a7efa0e to c6c4a15 Compare December 15, 2022 23:08

allancalix force-pushed the zero-alloc-lexer branch from c6c4a15 to 6db79c5 Compare December 15, 2022 23:13

allancalix mentioned this pull request Dec 21, 2022

Adds benchmark for nested aliased query type #397

Merged

allancalix force-pushed the zero-alloc-lexer branch from 6db79c5 to 9ebe61c Compare December 22, 2022 17:57

allancalix force-pushed the zero-alloc-lexer branch from 9ebe61c to 9a86bc1 Compare May 4, 2023 21:49

Remove string allocations for lexing tokens

a516311

allancalix force-pushed the zero-alloc-lexer branch from 9a86bc1 to a516311 Compare May 13, 2023 01:32

allancalix commented May 13, 2023

View reviewed changes

Use CharIndicies iterator to track offset position

5b2ee0a

allancalix marked this pull request as ready for review May 23, 2023 07:40

allancalix requested review from lrlna and goto-bus-stop as code owners May 23, 2023 07:40

lrlna added this to the [email protected] milestone May 26, 2023

Merge branch 'main' into zero-alloc-lexer

18290ae

lrlna reviewed Jul 5, 2023

View reviewed changes

lrlna added 3 commits July 5, 2023 14:49

rm international char test

eaf41d7

doc comments for lexer

5d39063

count eof token in tests

965618e

lrlna approved these changes Jul 5, 2023

View reviewed changes

lrlna self-assigned this Jul 5, 2023

define State outside of Cursor impl

38e09b0

lrlna merged commit efa365e into apollographql:main Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero alloc lexer #322

Zero alloc lexer #322

allancalix commented Oct 1, 2022 •

edited

Loading

goto-bus-stop commented Nov 14, 2022

allancalix commented Nov 14, 2022 •

edited

Loading

allancalix commented Dec 15, 2022

goto-bus-stop commented Dec 16, 2022

gbossh commented Dec 21, 2022

lrlna commented Dec 21, 2022

allancalix commented Dec 21, 2022 •

edited

Loading

allancalix May 13, 2023

lrlna commented May 26, 2023

lrlna Jul 5, 2023

allancalix Jul 5, 2023

lrlna commented Jul 5, 2023

allancalix commented Jul 5, 2023

Zero alloc lexer #322

Zero alloc lexer #322

Conversation

allancalix commented Oct 1, 2022 • edited Loading

Comparison over main as of 2022-09-30

Remaining Work

goto-bus-stop commented Nov 14, 2022

allancalix commented Nov 14, 2022 • edited Loading

allancalix commented Dec 15, 2022

goto-bus-stop commented Dec 16, 2022

gbossh commented Dec 21, 2022

lrlna commented Dec 21, 2022

allancalix commented Dec 21, 2022 • edited Loading

allancalix May 13, 2023

Choose a reason for hiding this comment

lrlna commented May 26, 2023

lrlna Jul 5, 2023

Choose a reason for hiding this comment

allancalix Jul 5, 2023

Choose a reason for hiding this comment

lrlna commented Jul 5, 2023

allancalix commented Jul 5, 2023

allancalix commented Oct 1, 2022 •

edited

Loading

allancalix commented Nov 14, 2022 •

edited

Loading

allancalix commented Dec 21, 2022 •

edited

Loading