Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid collecting iterator during token resolving #112

Merged
merged 12 commits into from
Oct 23, 2023

Conversation

nfejzic
Copy link
Collaborator

@nfejzic nfejzic commented Oct 8, 2023

What was the problem?

In order to correctly parse inline formats, there is an intermediate step between inlines lexer and parser. This step resolves the tokens, so that they get correctly marked as opening/closing/plain tokens. This step is isolated, because it is very complex and it's worth it to have it pulled out so that we do not introduce a lot of complexity in lexer and/or parser.

Since TokenResolver needs to decide for each token whether it is an opening, closing or plain token, it needs to see what tokens come after it. In particular, this is the case for potentially opening tokens. To make this easy, we collected the whole iterator coming from lexer into a Vec. On large inputs, this caused a very large allocation, which is very slow.

The fix

The fix was to use iterator directly, without collecting it into a Vec. Since we still need to look ahead, a data structure is used to make it possible to look forwards some dynamic amount of tokens. We can now extend the look-ahead until we can resolve the token, and then stop allocating until necessary again.

Fixes #107

Update:

After performing multiple benchmarks, it turned out that change described above did not really improve the performance. It was more-or-less performing the same as the previous implementation. It did remove the allocation of vector for all Tokens when resolving inline tokens.

But, to improve performance following additional changes are made:

  1. Remove check if input is the same in release mode when calling Symbol::flatten. This function should be used internally, and we should uphold this invariant ourselves. The function may panic if inputs are not the same (this is now documented).
  2. Improve algorithm for interrupted tokens: instead of marking ranges of tokens that are interrupted, tokens are now simply removed from the stack of unresolved tokens - so they aren't checked anymore. This reduces the number of iterations in hot-loop significantly (the biggest improvement)
  3. Use fxhash crate's FxHashMap and FxHashSet in Substitutor, faster hashing improves performance.
  4. Use global Substitutor to prevent multiple allocations of same maps/sets.

Quick Benchmark

Tested with file:

wc -l amb.um

99998 amb.um

# Every line is the same as following: 
head -n 1 amb.um

Some ***text* bold**

Comparison between unimarkup-main (on main branch 6ff4562) and unimarkup-optimized:

image

@nfejzic nfejzic self-assigned this Oct 8, 2023
@nfejzic nfejzic marked this pull request as ready for review October 9, 2023 14:47
Copy link
Contributor

@mhatzl mhatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice improvements.
Only have two remarks about dependencies and unnecessary clone.

I skimmed over the resolver changes, because these will most likely become unnecessary with the new open-token approach I will introduce to the spec.
In short: once a open token is encountered, the element is valid either until an end token is reached, or end of input/element range is reached.
This should make inline parsing significantly simpler, but also improve UX, because one directly sees the impact after a correct open token.

inline/Cargo.toml Outdated Show resolved Hide resolved
inline/src/lexer/resolver/mod.rs Outdated Show resolved Hide resolved
@mhatzl mhatzl added the waiting-on-author Assignee or reviewer is awaiting response from issue/PR author label Oct 14, 2023
@nfejzic nfejzic added waiting-on-reviewer Author or assignee is awaiting response from reviewer and removed waiting-on-author Assignee or reviewer is awaiting response from issue/PR author labels Oct 21, 2023
@mhatzl mhatzl added waiting-on-author Assignee or reviewer is awaiting response from issue/PR author and removed waiting-on-reviewer Author or assignee is awaiting response from reviewer labels Oct 22, 2023
Copy link
Contributor

@mhatzl mhatzl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now.
should we merge it now, or wait on PR #111 ?

@nfejzic
Copy link
Collaborator Author

nfejzic commented Oct 22, 2023

should we merge it now, or wait on PR #111 ?

Good question, I'm not really sure. Will there be conflicts? If not, then it does not matter.

@mhatzl
Copy link
Contributor

mhatzl commented Oct 23, 2023

I think there might be some minor conflicts in the scanner module.
But they should be easy to resolve.

I will merge this PR

@mhatzl mhatzl merged commit 5c9db5a into main Oct 23, 2023
3 checks passed
@mhatzl mhatzl deleted the iter-instead-vec-inlines branch October 23, 2023 10:08
@mhatzl mhatzl removed the waiting-on-author Assignee or reviewer is awaiting response from issue/PR author label Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve performance of inlines parsing of large content
2 participants