[Merged by Bors] - Lexer string interning #1758

Razican · 2021-12-22T14:51:56Z

This Pull Request is part of #279.

It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a NonZeroUsize instead. This can move types to the stack (hopefully I'll be able to move Token, for example, maybe some Node types too.

Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a Sym until they are displayed.

Talking about display. I have changed the ParseError type in order to not contain anything that could contain a Sym (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors.

The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a fmt::Formatter (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this.

Then, about cursor.expect(), this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of ParseError to contain an owned string, but maybe we can avoid this by having a &'static str come from a TokenKind with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there.

Oh, and this depends on the VM-only branch, so that has to be merged before :)

Another thing to check: should the interner be in its own module?

codecov · 2021-12-24T11:42:57Z

Codecov Report

Merging #1758 (0d38f91) into main (76a27ce) will decrease coverage by 1.29%.
The diff coverage is 59.01%.

@@            Coverage Diff             @@
##             main    #1758      +/-   ##
==========================================
- Coverage   57.02%   55.72%   -1.30%     
==========================================
  Files         199      201       +2     
  Lines       16842    17336     +494     
==========================================
+ Hits         9604     9661      +57     
- Misses       7238     7675     +437

Impacted Files	Coverage Δ
boa/src/builtins/function/mod.rs	`33.57% <0.00%> (-2.65%)`	⬇️
boa/src/builtins/typed_array/mod.rs	`3.79% <0.00%> (-0.01%)`	⬇️
boa/src/object/mod.rs	`28.76% <ø> (-0.53%)`	⬇️
boa_cli/src/main.rs	`6.25% <0.00%> (+0.36%)`	⬆️
boa_tester/src/exec/js262.rs	`0.00% <ø> (ø)`
boa_unicode/src/lib.rs	`69.69% <ø> (ø)`
boa_wasm/src/lib.rs	`0.00% <0.00%> (ø)`
boa/src/syntax/parser/statement/mod.rs	`39.87% <33.03%> (-2.01%)`	⬇️
boa/src/syntax/parser/function/mod.rs	`45.28% <40.00%> (-4.37%)`	⬇️
...arser/expression/primary/object_initializer/mod.rs	`44.48% <41.75%> (-4.97%)`	⬇️
... and 116 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 76a27ce...0d38f91. Read the comment docs.

github-actions · 2021-12-25T19:26:49Z

Test262 conformance changes

VM implementation

Test result	main count	PR count	difference
Total	87,200	87,200	0
Passed	40,828	40,828	0
Ignored	19,493	19,493	0
Failed	26,879	26,879	0
Panics	0	0	0
Conformance	46.82%	46.82%	0.00%

Razican · 2021-12-25T20:10:51Z

I will try to solve #503 with this.

Razican · 2021-12-26T10:36:40Z

I tried to start implementing this for the parser, but the changes are huge. I created a project and will create new issues to implement the interner for the parser, the compiler and the executor, but I think this is ready for review and merge.

I did some benchmarks to see which backend was faster, and for now, I selected the fastest one in my machine. This might change once we implement the interner in more places of the engine.

RageKnify

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

raskad · 2021-12-28T01:51:19Z

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

@Razican correct me if I'm wrong, but I think that would not work, because the interner must be specific to one parsed block of code, so it can be deallocated when the code block is being dropped right?

Razican · 2021-12-28T07:59:40Z

Having the interner in a global hidden by an API might make it easier to develop boa (since we wouldn't need to pass its reference around everywhere), but I don't feel too strongly either way.

@Razican correct me if I'm wrong, but I think that would not work, because the interner must be specific to one parsed block of code, so it can be deallocated when the code block is being dropped right?

If you don't de-allocate, you could use the same interner multiple times, in theory, but if the application is running for long, you might have a huge memory usage.

RageKnify · 2021-12-28T10:37:09Z

Yeah, I guess depending on the use case you would want to drop the Interner to avoid a never shrinking heap structure.
If boa were embedded in something like node then I don't think this would matter, right? Since once everything is parsed we shouldn't intern that many more strings.
Even in the browser, the only problem is a browser tab which keeps evaluating JS with new identifiers, right?

As I said, I can accept this choice, just not sure it makes much of a difference, if the Interner is part of the Context then the cli and future browsers (in the context of a given browser tab) would only have 1 at a time any ways.

Maybe for game engines or other embedding use cases it would be relevant to not use just 1 (Interner) since it would make sense to create new Contexts and drop old ones when the new ones are created. In that case I can see the possibility of the heap growing because the user keeps changing the name of their variables/functions with "updates" to their script.

(I feel this last point is good enough reason to keep it in the Context)

Razican · 2021-12-28T15:17:48Z

I'm also thinking on servers using it as their scripting language of choice. If they can run multiple scripts, in different days or so, I think this approach could be better.

jasonwilliams

LGTM

I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

jasonwilliams · 2022-01-21T22:09:16Z

boa_interner/src/lib.rs

+/// The string interner for Boa.
+///
+/// This is a type alias that makes it easier to reference it in the code.
+pub type Interner = StringInterner<BucketBackend<Sym>>;


How come you chose bucketbackend over stringbackend? Im guessing because the use of static

Statics seem to be the reason. It probably makes sense that we try out the stringbackend when we use the interner in the parser.

The reason was that I tried locally all backends and this was the one giving better results, but I have no personal preference xD in the future once we use the interner everywhere we can benchmark again.

raskad · 2022-01-22T01:27:08Z

bors r+

This Pull Request is part of #279. It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too. Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed. Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors. The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this. Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there. Oh, and this depends on the VM-only branch, so that has to be merged before :) Another thing to check: should the interner be in its own module?

bors · 2022-01-22T01:30:17Z

Build failed:

Clippy

Razican · 2022-01-22T08:06:07Z

Build failed:

Clippy

I will rebase this, it seems to fail due to Rust 1.58 lints.

Razican · 2022-01-22T08:10:15Z

LGTM

I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

One option here would be to have the interner in the parser, as a reference. Might make sense actually, and I might do so in the other PR, if you're OK with it.

Razican · 2022-01-22T09:55:26Z

I had to revert the upgrade to Wasm-bindgen 0.2.79 due to rustwasm/wasm-bindgen#2774.

boa_wasm/Cargo.toml

RageKnify · 2022-01-22T18:03:15Z

bors r+

This Pull Request is part of #279. It adds a string interner to Boa, which allows many types to not contain heap-allocated strings, and just contain a `NonZeroUsize` instead. This can move types to the stack (hopefully I'll be able to move `Token`, for example, maybe some `Node` types too. Note that the internet is for now only available in the lexer. Next steps (in this PR or future ones) would include also using interning in the parser, and finally in execution. The idea is that strings should be represented with a `Sym` until they are displayed. Talking about display. I have changed the `ParseError` type in order to not contain anything that could contain a `Sym` (basically tokens), which might be a bit faster, but what is important is that we don't depend on the interner when displaying errors. The issue I have now is in order to display tokens. This requires the interner if we want to know identifiers, for example. The issue here is that Rust doesn't allow using a `fmt::Formatter` (only in nightly), which is making my head hurt. Maybe someone of you can find a better way of doing this. Then, about `cursor.expect()`, this is the only place where we don't have the expected token type as a static string, so it's failing to compile. We have the option of changing the type definition of `ParseError` to contain an owned string, but maybe we can avoid this by having a `&'static str` come from a `TokenKind` with the default values, such as "identifier" for an identifier. I wanted for you to think about it and maybe we can just add that and avoid allocations there. Oh, and this depends on the VM-only branch, so that has to be merged before :) Another thing to check: should the interner be in its own module?

bors · 2022-01-22T18:19:12Z

Pull request successfully merged into main.

Build succeeded:

Razican · 2022-01-22T21:46:53Z

LGTM
I agree with Rageknify's concern around this being passed everywhere, but I don't see any other option from what has been discussed

One option here would be to have the interner in the parser, as a reference. Might make sense actually, and I might do so in the other PR, if you're OK with it.

Actually, now that I check it, the only way to do this would be to change all the methods to receive a &mut Parser, which would include a &mut Interner, but it might not be easy. I guess this should be done in a different PR.

This builds on top of #1758 to try to bring #1763 to life. Something that should probably be done here would be to convert `JsString` to a `Sym` internally. Then, further optimizations could be done adding common strings to a custom interner type (those that we know statically). This is definitely work in progress, but I would like to have feedback on the API, and feel free to contribute. Co-authored-by: raskad <[email protected]>

Razican requested review from raskad, jedel1043, HalidOdat, 0x7D2B, jasonwilliams, Lan2u, RageKnify and tofpie December 22, 2021 14:52

Razican force-pushed the feature/interner branch from 66aa45c to 50f0325 Compare December 24, 2021 10:07

Razican changed the base branch from main to justVM December 24, 2021 11:49

bors bot changed the base branch from justVM to main December 25, 2021 17:56

Razican force-pushed the feature/interner branch from 18a2d91 to a90a690 Compare December 25, 2021 19:18

Razican force-pushed the feature/interner branch from dee5b3e to ac882f1 Compare December 25, 2021 20:51

Razican changed the title ~~String interning~~ Lexer string interning Dec 26, 2021

Razican removed the execution Issues or PRs related to code execution label Dec 26, 2021

Razican added this to the v0.14.0 milestone Dec 26, 2021

Razican mentioned this pull request Dec 27, 2021

[Merged by Bors] - Interner support in the parser #1765

Closed

RageKnify approved these changes Dec 28, 2021

View reviewed changes

Razican force-pushed the feature/interner branch from 60034b8 to 6d63566 Compare January 2, 2022 13:13

jasonwilliams approved these changes Jan 21, 2022

View reviewed changes

raskad approved these changes Jan 22, 2022

View reviewed changes

Razican added 3 commits January 22, 2022 10:22

First version of the interner

6ba69ec

Dependency update

96c9fb6

Removed some clippy warnings in nightly

7057a08

Razican force-pushed the feature/interner branch from 6d63566 to 7057a08 Compare January 22, 2022 09:33

Reverting to wasm-bindgen 0.2.78 due to rustwasm/wasm-bindgen#2774

0d38f91

jasonwilliams reviewed Jan 22, 2022

View reviewed changes

boa_wasm/Cargo.toml Show resolved Hide resolved

bors bot changed the title ~~Lexer string interning~~ [Merged by Bors] - Lexer string interning Jan 22, 2022

bors bot closed this Jan 22, 2022

bors bot deleted the feature/interner branch January 22, 2022 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Merged by Bors] - Lexer string interning #1758

[Merged by Bors] - Lexer string interning #1758

Razican commented Dec 22, 2021 •

edited

Loading

codecov bot commented Dec 24, 2021 •

edited

Loading

github-actions bot commented Dec 25, 2021 •

edited

Loading

Razican commented Dec 25, 2021

Razican commented Dec 26, 2021

RageKnify left a comment

raskad commented Dec 28, 2021

Razican commented Dec 28, 2021

RageKnify commented Dec 28, 2021 •

edited

Loading

Razican commented Dec 28, 2021

jasonwilliams left a comment

jasonwilliams Jan 21, 2022 •

edited

Loading

raskad Jan 22, 2022

Razican Jan 22, 2022

raskad commented Jan 22, 2022

bors bot commented Jan 22, 2022

Razican commented Jan 22, 2022

Razican commented Jan 22, 2022

Razican commented Jan 22, 2022

RageKnify commented Jan 22, 2022

bors bot commented Jan 22, 2022

Razican commented Jan 22, 2022

[Merged by Bors] - Lexer string interning #1758

[Merged by Bors] - Lexer string interning #1758

Conversation

Razican commented Dec 22, 2021 • edited Loading

codecov bot commented Dec 24, 2021 • edited Loading

Codecov Report

github-actions bot commented Dec 25, 2021 • edited Loading

Test262 conformance changes

VM implementation

Razican commented Dec 25, 2021

Razican commented Dec 26, 2021

RageKnify left a comment

Choose a reason for hiding this comment

raskad commented Dec 28, 2021

Razican commented Dec 28, 2021

RageKnify commented Dec 28, 2021 • edited Loading

Razican commented Dec 28, 2021

jasonwilliams left a comment

Choose a reason for hiding this comment

jasonwilliams Jan 21, 2022 • edited Loading

Choose a reason for hiding this comment

raskad Jan 22, 2022

Choose a reason for hiding this comment

Razican Jan 22, 2022

Choose a reason for hiding this comment

raskad commented Jan 22, 2022

bors bot commented Jan 22, 2022

Razican commented Jan 22, 2022

Razican commented Jan 22, 2022

Razican commented Jan 22, 2022

RageKnify commented Jan 22, 2022

bors bot commented Jan 22, 2022

Razican commented Jan 22, 2022

Razican commented Dec 22, 2021 •

edited

Loading

codecov bot commented Dec 24, 2021 •

edited

Loading

github-actions bot commented Dec 25, 2021 •

edited

Loading

RageKnify commented Dec 28, 2021 •

edited

Loading

jasonwilliams Jan 21, 2022 •

edited

Loading