feat(parser): Stream tokens from the lexer #115

Geal · 2021-11-18T16:44:21Z

this is an exploration of streaming the lexer. The current parser first allocates the list of tokens, then reverses it, then transforms to a syntax tree.

With the benchmark I added, on current main ( 6e8e8a8 ) we're at:

test bench_peek_n ... bench:      11,620 ns/iter (+/- 344)
test bench_supergraph ... bench:     418,199 ns/iter (+/- 6,446)

Here's the flamegraph when parsing the supergraph:

We can see that a large part of the time is spent in allocations in the lexing phase.

In 57647fe we see that we can avoid reversing by replacing the token Vec with a VecDeque (I think the errors will be generated in the wrong order here though).

Then we can remove that vec entirely in 81f94f4 by transforming the Lexer into an iterator of Token or Error. The interesting thing here is that it's actually slower:

test bench_peek_n ... bench:      23,392 ns/iter (+/- 636)
test bench_supergraph ... bench:     725,191 ns/iter (+/- 27,976)

We can see in the global graph that we removed all allocations in the lexer:

But if we zoom in, we see that the peek function is called very often:

The parser is calling that function regularly to make a decision when parsing the rest of the input. When tokens were preallocated, looking at the next one was nearly free but now we're reparsing it often.

So in e743184 we will store in Parser the next token when it is requested, to avoid reparsing it. The cost for that is fixed, it's only enough memory for one token, and we get better performance:

test bench_peek_n ... bench:       9,144 ns/iter (+/- 152)
test bench_supergraph ... bench:     297,023 ns/iter (+/- 11,814)

The graphs get smaller and we don't call peek as often.

To sump up: less allocations, between 20 and 30% improvement, no change in the syntax tree creation code, I feel good about that :)

lrlna · 2021-11-18T18:04:32Z

thanks for this @Geal ! I will take a look and review early next week.

lrlna

Thanks for all the diagnostics!

I really like the idea of a streaming lexer, so I am keen on getting this in.

First I'd like to tighten the implementation a bit, though. Since we are already going for a streaming implementation, we should have the Lexer take a Read and output a Write.

I also think the Lexer is becoming a bit bulky. There is now a LexerIterator, LexerResult and a Cursor. I'd like to see if we can simplify this a bit and make it more readable. Using Read and Write directly is likely going to help with this.

I would also like to see if the parser "navigation" functions, i.e. peek_token, peek_n, pop can be simplified a bit. I think the iteration can be done outside of the individual functions, e.g. parser.parse().

lrlna · 2021-11-29T16:01:03Z

crates/apollo-parser/src/lexer/mod.rs

        self.errors.as_slice()
    }
 }

+#[derive(Clone, Debug)]
+pub struct LexerIterator<'a> {


Since we are taking the streaming route, I think we should go all the way and have the Lexer take a Read and output a Write, the way a through stream would.

why a Write? The output is not meant to be written to a byte buffer, it produces tokens to be consumed by the parser.
For Read, it is a bit tricky, because the next step in optimizing things would be to produce tokens that reference input data instead of copying it, so we quickly get into issues like trying to refill the buffer while we still hold some tokens.
Can we assume that the input data will always fit into memory? Are you planning for other input types (like ropes structures, that would make sense for editors)?

crates/apollo-parser/src/parser/mod.rs

crates/apollo-parser/benches/query.rs

Geal · 2021-11-30T09:42:31Z

I also think the Lexer is becoming a bit bulky. There is now a LexerIterator, LexerResult and a Cursor. I'd like to see if we can simplify this a bit and make it more readable. Using Read and Write directly is likely going to help with this.

I think I can replace the LexerResult with a std Result. Also, LexerIterator can be renamed Lexer, and I'll add a collect() method that produces vecs of tokens and errors

Geal · 2021-11-30T09:50:03Z

rebased over main

Geal · 2021-11-30T10:04:30Z

I would also like to see if the parser "navigation" functions, i.e. peek_token, peek_n, pop can be simplified a bit. I think the iteration can be done outside of the individual functions, e.g. parser.parse().

I'm looking at it, there are ways to simplify them. I'm not sure iterating should be done outside though, since we'd lose the benefit of streaming: iteration is driven by the needs of the tree builder

Geal · 2021-11-30T10:25:56Z

crates/apollo-parser/src/parser/mod.rs

-        let token = self
-            .tokens
-            .pop()
-            .expect("Could not eat a token from the AST");


I kept that possible panic, but is it really ok? There might be case where it is triggered. I'll test that through fuzzing

Make the Lexer type public to access it from the benchmarks

Thanks to the peek_* helper methods, this does not change the parser radically, just the way we go through the tokens

this is currently twice as slow as accumulating all tokens then parsing, because we regularly call the peek_* methods. They're basically free if we accumulate in advance the tokens, but in streaming we are recalculating them all the time

the parser is frequently looking at the next token to make a decision. When we stored the list of tokens, that operation was essentially free, but in streaming, we end up reparsing the next token multiple times. By storing it in advance, we avoid that cost

goto-bus-stop · 2022-11-01T12:35:37Z

So I made the iterator the main lexer API now and added a .lex() method that returns a (Tokens, Errors) tuple. The public API is just for item in Lexer::new(source) or let (tokens,errors) = Lexer::new(source).lex().

For the Parser I think we can keep the lifetime for now, as the typical use case is to do:

let parser = Parser::new(source);
let tree = parser.parse();

And then you just use the tree, which is an owned structure. So you don't end up ever storing the parser with a lifetime on it.

When we add reparsing we can reconsider that, I think the API (from an ownership view, not the exact parameters) would roughly be this:

let mut parser = Parser::new(source); // ← already does the parse
let ast = parser.ast(); // get the already-parsed AST
parser.reparse(new_source);

Parser::new() can then return a structure that contains the owned syntax tree and doesn't need to reference the source string. So it should be easy to remove the lifetime then.

crates/apollo-parser/src/parser/mod.rs

lrlna

yay, let's get this in! I am sure @Geal will be very happy :)))

Co-authored-by: Iryna Shestak <[email protected]>

Geal · 2022-11-03T15:40:26Z

awesome!

Geal requested a review from lrlna November 18, 2021 16:44

Geal force-pushed the streaming-lexer branch 3 times, most recently from 2894c2f to bc1dfc4 Compare November 24, 2021 09:17

lrlna requested changes Nov 29, 2021

View reviewed changes

lrlna reviewed Nov 29, 2021

View reviewed changes

crates/apollo-parser/benches/query.rs Show resolved Hide resolved

Geal force-pushed the streaming-lexer branch from bc1dfc4 to 060be55 Compare November 30, 2021 09:49

Geal commented Nov 30, 2021

View reviewed changes

Geal mentioned this pull request Nov 30, 2021

set up fuzzing for apollo-parser #122

Merged

lrlna mentioned this pull request Sep 22, 2022

improvements to lexer + parser performance #293

Open

allancalix mentioned this pull request Oct 1, 2022

Zero alloc lexer #322

Merged

4 tasks

goto-bus-stop mentioned this pull request Oct 7, 2022

Reduce token copying #323

Merged

Geoffroy Couprie and others added 14 commits October 27, 2022 14:01

add a benchmark for a larger graph

8355c68

Make the Lexer type public to access it from the benchmarks

parse tokens in another order

1ea320c

Thanks to the peek_* helper methods, this does not change the parser radically, just the way we go through the tokens

parse tokens in streaming

82c5ba9

this is currently twice as slow as accumulating all tokens then parsing, because we regularly call the peek_* methods. They're basically free if we accumulate in advance the tokens, but in streaming we are recalculating them all the time

remove LexerResult

a868679

simplify peeking functions

c464703

factor functions iterating over the lexer

df18a1d

lint fixes

51b8c18

move peek_n_inner impl into peek_token_n

c8f5705

Merge branch 'main' into streaming-lexer-rebased

d468663

Update expected errors

229f332

use Vec::is_empty over Iter::is_empty

859d876

Restore old panicking behaviour

8ac9533

make clippy happy maybe.

a9d3277

goto-bus-stop force-pushed the streaming-lexer branch from 2770b5f to a9d3277 Compare October 28, 2022 12:37

goto-bus-stop added 6 commits October 28, 2022 14:55

Make the streaming lexer the only lexer

f98e577

Merge branch 'main' into streaming-lexer

4d8d4b7

chore(parser): update tests for new lexer

a7609a2

cargo fmt

a56ff87

lint

e215633

Merge branch 'main' into streaming-lexer

6047021

goto-bus-stop requested a review from lrlna November 1, 2022 12:35

lrlna reviewed Nov 3, 2022

View reviewed changes

crates/apollo-parser/src/parser/mod.rs Show resolved Hide resolved

lrlna reviewed Nov 3, 2022

View reviewed changes

crates/apollo-parser/src/parser/mod.rs Outdated Show resolved Hide resolved

lrlna approved these changes Nov 3, 2022

View reviewed changes

goto-bus-stop and others added 2 commits November 3, 2022 16:23

Spelling

477d62c

Co-authored-by: Iryna Shestak <[email protected]>

make clippy happy

0bf73cd

goto-bus-stop changed the title ~~Streaming lexer~~ feat(parser): Stream tokens from the lexer Nov 4, 2022

goto-bus-stop merged commit 87522b7 into main Nov 4, 2022

goto-bus-stop deleted the streaming-lexer branch November 4, 2022 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(parser): Stream tokens from the lexer #115

feat(parser): Stream tokens from the lexer #115

Geal commented Nov 18, 2021

lrlna commented Nov 18, 2021

lrlna left a comment

lrlna Nov 29, 2021

Geal Nov 30, 2021

Geal commented Nov 30, 2021 •

edited

Loading

Geal commented Nov 30, 2021

Geal commented Nov 30, 2021 •

edited

Loading

Geal Nov 30, 2021

goto-bus-stop commented Nov 1, 2022

lrlna left a comment

Geal commented Nov 3, 2022

feat(parser): Stream tokens from the lexer #115

feat(parser): Stream tokens from the lexer #115

Conversation

Geal commented Nov 18, 2021

lrlna commented Nov 18, 2021

lrlna left a comment

Choose a reason for hiding this comment

lrlna Nov 29, 2021

Choose a reason for hiding this comment

Geal Nov 30, 2021

Choose a reason for hiding this comment

Geal commented Nov 30, 2021 • edited Loading

Geal commented Nov 30, 2021

Geal commented Nov 30, 2021 • edited Loading

Geal Nov 30, 2021

Choose a reason for hiding this comment

goto-bus-stop commented Nov 1, 2022

lrlna left a comment

Choose a reason for hiding this comment

Geal commented Nov 3, 2022

Geal commented Nov 30, 2021 •

edited

Loading

Geal commented Nov 30, 2021 •

edited

Loading