Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zero alloc lexer #322

Merged
merged 7 commits into from
Jul 5, 2023
Merged

Zero alloc lexer #322

merged 7 commits into from
Jul 5, 2023

Conversation

allancalix
Copy link
Contributor

@allancalix allancalix commented Oct 1, 2022

Relates to #293

I'm hoping for some feedback on the general direction + level of interest in including these changes to the upstream project.

This PR is a superset of the work done in #115 (attribution pending). This PR combines the lazy streaming lexer with the addition of removing string allocations from all tokens. This change is largely backwards compatible with the existing parser.

I apologize in advance for the amount of changes in the lexer, it was quite difficult to maintain lifetime invariants with the the number of mutable borrows from the Peekable trait. This approach uses a finite state machine to minimize the amount of backtracking and lookaheads required.

These changes have a considerable in parse times for small queries and a very large impact on the pathological query observed in the referenced ticket. At this point the existing and new implementations are compatible, though more testing is warranted (fuzzing perhaps) given the magnitude of changes.

Comparison over main as of 2022-09-30

parser_peek_n           time:   [8.7874 µs 8.8062 µs 8.8267 µs]                           
                        change: [-23.995% -23.824% -23.635%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

many_aliases            time:   [2.2570 ms 2.2611 ms 2.2657 ms]                          
                        change: [-35.767% -35.427% -35.112%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high severe

Remaining Work

  • Token limits was implemented after this work was done and seems to misbehave - investigate #[ignore] test and re-enable them once limits are fixed
  • Simplify Cursor type + implementation
  • Additional fuzzing
  • Add doc comments

@Geal Geal mentioned this pull request Oct 19, 2022
@goto-bus-stop
Copy link
Member

hi, thanks so much for the PR! After merging the streaming lexer, we want to punt on zero-alloc for a while, as we have a re-parsing feature coming up in the near future which may impose lifetime requirements on the lexer. Once that's done I think we would look at either zero-alloc or a small-string optimisation (which would also remove most allocations while still having an owned Token type).

#115 should already help a fair bit. Perhaps you could pull the new benchmark into a separate PR so we can already see how that evolves, before going ahead with zero-alloc?

@allancalix allancalix force-pushed the zero-alloc-lexer branch 2 times, most recently from 70230b6 to a7efa0e Compare November 14, 2022 22:19
@allancalix
Copy link
Contributor Author

allancalix commented Nov 14, 2022

Thanks for looking into this. I rebased this PR over main and integrated the fixes that were added in #357. You should be able to run the recently added benchmarks now.

The changes to seem to help a fair bit, but there is still roughly a 20% differential between 0.3 and this version of the lexer. Here are some benchmarks, note the difference between apollo_graphql_parser and apollo_fork_graphql_parser:

async_graphql_parser    time:   [7.0465 µs 7.0680 µs 7.0964 µs]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe

apollo_graphql_parser   time:   [5.0164 µs 5.0229 µs 5.0295 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild

apollo_fork_graphql_parser
                        time:   [3.9070 µs 3.9121 µs 3.9173 µs]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild

graphql_parser          time:   [5.3375 µs 5.3478 µs 5.3612 µs]
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) low severe
  2 (2.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

async_graphql_parser #2 time:   [2.0918 ms 2.0992 ms 2.1066 ms]

apollo_graphql_parser #2
                        time:   [2.0997 ms 2.1086 ms 2.1197 ms]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Benchmarking apollo_fork_graphql_parser #2: Warming up for 3.0000 s
Warning: Unable to complete 100 samples in 5.0s. You may wish to increase target time to 8.8s, enable flat sampling, or reduce sample count to 50.
apollo_fork_graphql_parser #2
                        time:   [1.6741 ms 1.6824 ms 1.6912 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

graphql_parser #2       time:   [2.0441 ms 2.0472 ms 2.0507 ms]
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

@allancalix
Copy link
Contributor Author

Update as of 2022-12-15 with and without lexer commit on main

many_aliases            time:   [1.9225 ms 1.9321 ms 1.9425 ms]                          
                        change: [-20.666% -20.165% -19.761%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

query_lexer             time:   [1.1044 µs 1.1071 µs 1.1099 µs]                         
                        change: [-46.599% -46.451% -46.308%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

query_parser            time:   [7.4076 µs 7.4286 µs 7.4505 µs]                          
                        change: [-21.579% -21.229% -20.915%] (p = 0.00 < 0.05)
                        Performance has improved.

     Running benches/supergraph.rs (/Users/allancalix/src/github.com/Shopify/apollo-rs/target/release/deps/supergraph-a793e8d443792cd0)
WARNING: HTML report generation will become a non-default optional feature in Criterion.rs 0.4.0.
This feature is being moved to cargo-criterion (https://github.com/bheisler/cargo-criterion) and will be optional in a future version of Criterion.rs. To silence this warning, either switch to cargo-criterion or enable the 'html_reports' feature in your Cargo.toml.

Gnuplot not found, using plotters backend
supergraph_lexer        time:   [43.429 µs 43.524 µs 43.616 µs]                              
                        change: [-45.923% -45.811% -45.701%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 16 outliers among 100 measurements (16.00%)
  2 (2.00%) low mild
  3 (3.00%) high mild
  11 (11.00%) high severe

supergraph_parser       time:   [178.65 µs 179.14 µs 179.71 µs]                              
                        change: [-19.674% -19.286% -18.936%] (p = 0.00 < 0.05)
                        Performance has improved.

@goto-bus-stop
Copy link
Member

Thanks for the update! I see you already removed the Lexer/LexerIterator split so i don't have many notes here :) i generally support this direction but want to discuss w/ @lrlna first, which likely needs to wait til january.

@gbossh
Copy link

gbossh commented Dec 21, 2022

cc @msvec80

@lrlna
Copy link
Member

lrlna commented Dec 21, 2022

Hi @allancalix, thanks again for putting all of this together!!

This PR is rather large, and I want to break it down into more digestible parts. Can I ask you to pull benchmarks into their own PR? Those are super handy and can be merged right away.

We need to think a bit more about the approach towards a zero-alloc lexer. I think the particular approach you took here does achieve substantial performance improvements, but it also significantly complicates the lexer, a part of the compiler that I'd like to keep as simple and as easily debuggable as possible. A few things I'd like to bring up for consideration:

  1. Is there a path forward where we don't overload the existing Cursor and let it solely be a peekable iterator over chars? I am thinking of whether introducing a Reader on top of the existing Cursor that keeps track of start_pos, pos, etc might be a good approach that keeps the current lexer simple and debuggable and offloads handling of allocations (or lack of thereof) on a separate structure.

  2. I am wondering a bit about an introduction of a state machine for lexer's token kinds. The state implemented here conflates two ideas: the state of the stream, and the current token kind. This will make it more complicated to debug the lexer in case of changes to the grammar or any bugs found. I am also not certain if it's in general helpful to the idea of a zero-alloc lexer.

  3. My colleague @goto-bus-stop has been thinking about using smol-str crate to help eliminate a bunch of the Strings in the lexer. While we figure out a good path forward for this zero-alloc lexer, we'll likely implement a smol-str improvement first, which should be quite helpful here (having your benchmarks in main would be super useful here too). What are your thoughts on this?

On another note, I also wanted to ask as to how you came up with this approach? What sorts of considerations did you take into account already?

@allancalix
Copy link
Contributor Author

allancalix commented Dec 21, 2022

This PR is rather large, and I want to break it down into more digestible parts.
Yeah 😅 , my initial attempt to introduce this change was minimalist but I ran into significant challenges:

  1. the mutable borrows of the lexer itself with the large number of immutable borrows required to make this work made lifetimes difficult to resolve
  2. I found it difficult to translate the functions that depend on mutable strings to build up token values (e.g. lexer/mod.rs)

we'll likely implement a smol-str improvement first, which should be quite helpful here. What are your thoughts on this?

It's a reasonable step, though the lexer uses possibly many mutable strings to build up token values which may limit the upside of this approach (from docs "If a string does not satisfy the aforementioned conditions, it is heap-allocated"). That said it's certainly a smaller change and I'd be interested how much performance improvement we see.

Is there a path forward where we don't overload the existing Cursor and let it solely be a peekable iterator over chars?

I would be happy to explore adding a new data structure with the goal of isolating complexity and improving the overall debug-ability of the lexer (I hope i'm not misunderstanding the intention here).

On another note, I also wanted to ask as to how you came up with this approach? What sorts of considerations did you take into account already?

This approach was heavily inspired by Zig's lexer (https://github.com/ziglang/zig/blob/master/lib/std/zig/tokenizer.zig#L409) which uses a similar approach. My thinking was that if I could prevent the complexity from leaking out of the lexer into the parser that the number of changes to the lexer over time would be relatively small. I did update the lexer twice to keep up with the upstream bug fixes in the existing lexer.

@@ -589,6 +593,7 @@ mod tests {
}

#[test]
#[ignore]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is off by one because this version of the lexer includes the EOF token in the count while main does not. I don't know which one is more correct so ignoring for now because excluding the EOF token in the count is not completely trival.

@allancalix allancalix marked this pull request as ready for review May 23, 2023 07:40
@lrlna
Copy link
Member

lrlna commented May 26, 2023

@allancalix we are putting this into our next milestone. stay tuned!

@lrlna lrlna added this to the [email protected] milestone May 26, 2023
Comment on lines 1 to 7
mutation {
messageSender(
message: "some ok outer string "Tráfico" more ok outer string",
) {
delivered
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the issue with this test is that a string value here must not have quotation marks. Only block string values can have (escaped) quotation marks. So Tráfico gets registered as a named token, and named tokens can only be literal letters and numbers. What was the utf-8 issue that you run into here?

I'll remove this test as it doesn't relate to the PR; we can add a fix once I understand the issue you're facing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test was added to cover a regression I found from real queries. The problem was that the lexer was capturing utf-8 characters for Name tokens and this triggers a panic in the parser.

So with the byte sequence "some ok outer string "Tráfico" more ok outer string" I ended up with something like:

String "some ok outer string "
Ident Tráfico # triggers panic
String " more ok outer string"

I fixed the issue by using prev_str instead of current_str in the branch. This matches the previous lexer's behavior by creating something like:

String "some ok outer string "
Ident Tr
Ident fico
String " more ok outer string"

Error á

@lrlna The alternate way of handling this that might make more sense is to produce a token of Ident Tráfico. The line below sometimes triggers a panic when trying to index into string slices because depending on the character name[1..] may fall inside the boundaries of a utf-8 character and not in between separate characters.

if name.len() >= 2 && !name[1..].chars().all(is_remainder_char) {

@lrlna lrlna self-assigned this Jul 5, 2023
@lrlna
Copy link
Member

lrlna commented Jul 5, 2023

@allancalix thank you so much for all your work on this and for keeping up with all the rebases over the last few months! We are merging it in today aiming to publish sometime next week.

@lrlna lrlna merged commit efa365e into apollographql:main Jul 5, 2023
@allancalix
Copy link
Contributor Author

Thanks for all the feedback, I'm happy to see this get merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants