fix: avoid collecting iterator during token resolving #112

nfejzic · 2023-10-08T13:49:24Z

What was the problem?

In order to correctly parse inline formats, there is an intermediate step between inlines lexer and parser. This step resolves the tokens, so that they get correctly marked as opening/closing/plain tokens. This step is isolated, because it is very complex and it's worth it to have it pulled out so that we do not introduce a lot of complexity in lexer and/or parser.

Since TokenResolver needs to decide for each token whether it is an opening, closing or plain token, it needs to see what tokens come after it. In particular, this is the case for potentially opening tokens. To make this easy, we collected the whole iterator coming from lexer into a Vec. On large inputs, this caused a very large allocation, which is very slow.

The fix

The fix was to use iterator directly, without collecting it into a Vec. Since we still need to look ahead, a data structure is used to make it possible to look forwards some dynamic amount of tokens. We can now extend the look-ahead until we can resolve the token, and then stop allocating until necessary again.

Fixes #107

Update:

After performing multiple benchmarks, it turned out that change described above did not really improve the performance. It was more-or-less performing the same as the previous implementation. It did remove the allocation of vector for all Tokens when resolving inline tokens.

But, to improve performance following additional changes are made:

Remove check if input is the same in release mode when calling Symbol::flatten. This function should be used internally, and we should uphold this invariant ourselves. The function may panic if inputs are not the same (this is now documented).
Improve algorithm for interrupted tokens: instead of marking ranges of tokens that are interrupted, tokens are now simply removed from the stack of unresolved tokens - so they aren't checked anymore. This reduces the number of iterations in hot-loop significantly (the biggest improvement)
Use fxhash crate's FxHashMap and FxHashSet in Substitutor, faster hashing improves performance.
Use global Substitutor to prevent multiple allocations of same maps/sets.

Quick Benchmark

Tested with file:

wc -l amb.um

99998 amb.um

# Every line is the same as following: 
head -n 1 amb.um

Some ***text* bold**

Comparison between unimarkup-main (on main branch 6ff4562) and unimarkup-optimized:

mhatzl

Very nice improvements.
Only have two remarks about dependencies and unnecessary clone.

I skimmed over the resolver changes, because these will most likely become unnecessary with the new open-token approach I will introduce to the spec.
In short: once a open token is encountered, the element is valid either until an end token is reached, or end of input/element range is reached.
This should make inline parsing significantly simpler, but also improve UX, because one directly sees the impact after a correct open token.

inline/Cargo.toml

inline/src/lexer/resolver/mod.rs

mhatzl

looks good now.
should we merge it now, or wait on PR #111 ?

nfejzic · 2023-10-22T15:38:47Z

should we merge it now, or wait on PR #111 ?

Good question, I'm not really sure. Will there be conflicts? If not, then it does not matter.

mhatzl · 2023-10-23T10:08:32Z

I think there might be some minor conflicts in the scanner module.
But they should be easy to resolve.

I will merge this PR

nfejzic self-assigned this Oct 8, 2023

nfejzic added 9 commits October 9, 2023 16:12

fix: avoid collecting iterator in TokenResolver

d37239f

fix: add test cases for bold ambiguous start and end

c93e9b9

fix: remove unnecessary checks on symbol flattening

85f6e45

fix: remove redundant check for interruption of tokens

3d108f9

fix: use global substitutor to prevent repeated allocs

75c9112

fix: remove dead code

2312013

fix: optimize resolving interrupted tokens as plain

a3e8e65

fix: use fxhash for faster hashing in substitutor

7e6a047

fix: re-enable optimized debug assertions in Symbol::flatten

fae303f

nfejzic force-pushed the iter-instead-vec-inlines branch from 35f9919 to fae303f Compare October 9, 2023 14:14

fix: use alias for TokenMap keys

cf9f950

nfejzic marked this pull request as ready for review October 9, 2023 14:47

mhatzl reviewed Oct 14, 2023

View reviewed changes

inline/Cargo.toml Outdated Show resolved Hide resolved

inline/src/lexer/resolver/mod.rs Outdated Show resolved Hide resolved

mhatzl added the waiting-on-author Assignee or reviewer is awaiting response from issue/PR author label Oct 14, 2023

nfejzic added 2 commits October 21, 2023 13:50

chore: remove unused dependency

7b4e0c5

fix: don't clone iterator if there's no need to

a50bbb2

nfejzic added waiting-on-reviewer Author or assignee is awaiting response from reviewer and removed waiting-on-author Assignee or reviewer is awaiting response from issue/PR author labels Oct 21, 2023

mhatzl added waiting-on-author Assignee or reviewer is awaiting response from issue/PR author and removed waiting-on-reviewer Author or assignee is awaiting response from reviewer labels Oct 22, 2023

mhatzl approved these changes Oct 22, 2023

View reviewed changes

mhatzl merged commit 5c9db5a into main Oct 23, 2023
3 checks passed

mhatzl deleted the iter-instead-vec-inlines branch October 23, 2023 10:08

mhatzl removed the waiting-on-author Assignee or reviewer is awaiting response from issue/PR author label Oct 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: avoid collecting iterator during token resolving #112

fix: avoid collecting iterator during token resolving #112

nfejzic commented Oct 8, 2023 •

edited

Loading

mhatzl left a comment

mhatzl left a comment

nfejzic commented Oct 22, 2023

mhatzl commented Oct 23, 2023

fix: avoid collecting iterator during token resolving #112

fix: avoid collecting iterator during token resolving #112

Conversation

nfejzic commented Oct 8, 2023 • edited Loading

What was the problem?

The fix

Update:

Quick Benchmark

mhatzl left a comment

Choose a reason for hiding this comment

mhatzl left a comment

Choose a reason for hiding this comment

nfejzic commented Oct 22, 2023

mhatzl commented Oct 23, 2023

nfejzic commented Oct 8, 2023 •

edited

Loading