Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(parser): Stream tokens from the lexer #115

Merged
merged 22 commits into from
Nov 4, 2022
Merged

Conversation

Geal
Copy link
Contributor

@Geal Geal commented Nov 18, 2021

this is an exploration of streaming the lexer. The current parser first allocates the list of tokens, then reverses it, then transforms to a syntax tree.

With the benchmark I added, on current main ( 6e8e8a8 ) we're at:

test bench_peek_n ... bench:      11,620 ns/iter (+/- 344)
test bench_supergraph ... bench:     418,199 ns/iter (+/- 6,446)

Here's the flamegraph when parsing the supergraph:

Screenshot from 2021-11-18 17-23-02

We can see that a large part of the time is spent in allocations in the lexing phase.

In 57647fe we see that we can avoid reversing by replacing the token Vec with a VecDeque (I think the errors will be generated in the wrong order here though).

Then we can remove that vec entirely in 81f94f4 by transforming the Lexer into an iterator of Token or Error. The interesting thing here is that it's actually slower:

test bench_peek_n ... bench:      23,392 ns/iter (+/- 636)
test bench_supergraph ... bench:     725,191 ns/iter (+/- 27,976)

We can see in the global graph that we removed all allocations in the lexer:
Screenshot from 2021-11-18 17-25-55

But if we zoom in, we see that the peek function is called very often:

Screenshot from 2021-11-18 17-26-40

The parser is calling that function regularly to make a decision when parsing the rest of the input. When tokens were preallocated, looking at the next one was nearly free but now we're reparsing it often.

So in e743184 we will store in Parser the next token when it is requested, to avoid reparsing it. The cost for that is fixed, it's only enough memory for one token, and we get better performance:

test bench_peek_n ... bench:       9,144 ns/iter (+/- 152)
test bench_supergraph ... bench:     297,023 ns/iter (+/- 11,814)

The graphs get smaller and we don't call peek as often.

Screenshot from 2021-11-18 17-38-43

Screenshot from 2021-11-18 17-39-24

To sump up: less allocations, between 20 and 30% improvement, no change in the syntax tree creation code, I feel good about that :)

@Geal Geal requested a review from lrlna November 18, 2021 16:44
@lrlna
Copy link
Member

lrlna commented Nov 18, 2021

thanks for this @Geal ! I will take a look and review early next week.

@Geal Geal force-pushed the streaming-lexer branch 3 times, most recently from 2894c2f to bc1dfc4 Compare November 24, 2021 09:17
Copy link
Member

@lrlna lrlna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the diagnostics!

I really like the idea of a streaming lexer, so I am keen on getting this in.

First I'd like to tighten the implementation a bit, though. Since we are already going for a streaming implementation, we should have the Lexer take a Read and output a Write.

I also think the Lexer is becoming a bit bulky. There is now a LexerIterator, LexerResult and a Cursor. I'd like to see if we can simplify this a bit and make it more readable. Using Read and Write directly is likely going to help with this.

I would also like to see if the parser "navigation" functions, i.e. peek_token, peek_n, pop can be simplified a bit. I think the iteration can be done outside of the individual functions, e.g. parser.parse().

self.errors.as_slice()
}
}

#[derive(Clone, Debug)]
pub struct LexerIterator<'a> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are taking the streaming route, I think we should go all the way and have the Lexer take a Read and output a Write, the way a through stream would.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why a Write? The output is not meant to be written to a byte buffer, it produces tokens to be consumed by the parser.
For Read, it is a bit tricky, because the next step in optimizing things would be to produce tokens that reference input data instead of copying it, so we quickly get into issues like trying to refill the buffer while we still hold some tokens.
Can we assume that the input data will always fit into memory? Are you planning for other input types (like ropes structures, that would make sense for editors)?

crates/apollo-parser/src/parser/mod.rs Outdated Show resolved Hide resolved
crates/apollo-parser/src/parser/mod.rs Outdated Show resolved Hide resolved
@Geal
Copy link
Contributor Author

Geal commented Nov 30, 2021

I also think the Lexer is becoming a bit bulky. There is now a LexerIterator, LexerResult and a Cursor. I'd like to see if we can simplify this a bit and make it more readable. Using Read and Write directly is likely going to help with this.

I think I can replace the LexerResult with a std Result. Also, LexerIterator can be renamed Lexer, and I'll add a collect() method that produces vecs of tokens and errors

@Geal
Copy link
Contributor Author

Geal commented Nov 30, 2021

rebased over main

@Geal
Copy link
Contributor Author

Geal commented Nov 30, 2021

I would also like to see if the parser "navigation" functions, i.e. peek_token, peek_n, pop can be simplified a bit. I think the iteration can be done outside of the individual functions, e.g. parser.parse().

I'm looking at it, there are ways to simplify them. I'm not sure iterating should be done outside though, since we'd lose the benefit of streaming: iteration is driven by the needs of the tree builder

let token = self
.tokens
.pop()
.expect("Could not eat a token from the AST");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept that possible panic, but is it really ok? There might be case where it is triggered. I'll test that through fuzzing

Geoffroy Couprie and others added 14 commits October 27, 2022 14:01
Make the Lexer type public to access it from the benchmarks
Thanks to the peek_* helper methods, this does not change the parser
radically, just the way we go through the tokens
this is currently twice as slow as accumulating all tokens then parsing,
because we regularly call the peek_* methods. They're basically free if
we accumulate in advance the tokens, but in streaming we are
recalculating them all the time
the parser is frequently looking at the next token to make a decision.
When we stored the list of tokens, that operation was essentially free,
but in streaming, we end up reparsing the next token multiple times. By
storing it in advance, we avoid that cost
@goto-bus-stop
Copy link
Member

So I made the iterator the main lexer API now and added a .lex() method that returns a (Tokens, Errors) tuple. The public API is just for item in Lexer::new(source) or let (tokens,errors) = Lexer::new(source).lex().

For the Parser I think we can keep the lifetime for now, as the typical use case is to do:

let parser = Parser::new(source);
let tree = parser.parse();

And then you just use the tree, which is an owned structure. So you don't end up ever storing the parser with a lifetime on it.

When we add reparsing we can reconsider that, I think the API (from an ownership view, not the exact parameters) would roughly be this:

let mut parser = Parser::new(source); // ← already does the parse
let ast = parser.ast(); // get the already-parsed AST
parser.reparse(new_source);

Parser::new() can then return a structure that contains the owned syntax tree and doesn't need to reference the source string. So it should be easy to remove the lifetime then.

@goto-bus-stop goto-bus-stop requested a review from lrlna November 1, 2022 12:35
Copy link
Member

@lrlna lrlna left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yay, let's get this in! I am sure @Geal will be very happy :)))

goto-bus-stop and others added 2 commits November 3, 2022 16:23
@Geal
Copy link
Contributor Author

Geal commented Nov 3, 2022

awesome!

@goto-bus-stop goto-bus-stop changed the title Streaming lexer feat(parser): Stream tokens from the lexer Nov 4, 2022
@goto-bus-stop goto-bus-stop merged commit 87522b7 into main Nov 4, 2022
@goto-bus-stop goto-bus-stop deleted the streaming-lexer branch November 4, 2022 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants