Super slow tokenization of long strings containing multi-byte characters #45

alexdima · 2015-11-30T11:30:16Z

I think this impacts also atom on Mac/Linux. Possible repro steps for atom / first-mate:

In first-mate/benchmark/large.min.js
on a long line (e.g. line 4)
insert a multi-byte character such as ü
run the benchmark
observe super slow times

I believe the root cause of the slowness is the charOffset <-> byteOffset conversion:

https://github.com/atom/node-oniguruma/blob/master/src/onig-searcher.cc#L11
that goes into https://github.com/atom/node-oniguruma/blob/master/src/unicode-utils-posix.cc#L38
this makes any string containing multi-byte characters match in O(N^2), since at every advancing step, the code loops back to compute the conversion

In my opinion, the only way to make this conversion fast is to compute it once up-front and use memoization for the conversions. However, the current caching mechanism is sub optimal since each string is cached per OnigScanner. So, if a line needs 20 OnigScanner's to tokenize, the conversion would still happen 20 times. The current caching mechanism is also "leakish" as each OnigScanner keeps references to the last matched string. Moreover, the current caching mechanism forces users of node-oniguruma to reuse to the maximum possible extent a certain OnigScanner instance in order to get caching.

I think that the users of node-oniguruma know best when and how they want to match a certain string.

I would like, when I get some free time, to work on a PR that does the following:

introduces & exposes to JS a new type called OnigString
the OnigString will compute the conversion up-front in its ctor.
the OnigString will provide cache slots for all OnigRegExp.
the OnigString has its lifecycle controlled by v8, so when javascript dereferences it, all cache slots and cached conversion is GCed.
make OnigRegExp::Search cached by only accepting an OnigString. We will therefore get that each individual regular expression is using the cache (no more need to do the trick with a OnigScanner with precisely one regular expression)
make OnigScanner::FindNextMatchSync accept a v8String or an OnigString. If it is called with a v8String it immediately constructs an OnigString, i.e. not doing any caching across calls.
remove all other caching done through OnigStringContext

This would change semantics & performance characteristics of node-oniguruma, requiring a new major version. After this change, the javascript users will be able to use the library with caching or without caching:

with caching

var scanner = new OnigScanner(...);
var str = new OnigString("my string");
var results = scanner.FindNextMatchSync(str, 0)
results = scanner.FindNextMatchSync(str, 5)
...

without caching

var scanner = new OnigScanner(...);
var results = scanner.FindNextMatchSync("my string", 0);
results = scanner.FindNextMatchSync("my string", 5);
...

I wanted to check with you @zcbenz @kevinsawicki if this is a change that takes node-oniguruma in a good direction and that you would agree with such a change ... before I invest time in it.

Thanks!

The text was updated successfully, but these errors were encountered:

alexdima · 2016-01-09T01:08:39Z

@zcbenz Thank you!

egamma mentioned this issue Dec 2, 2015

December Iteration Plan microsoft/vscode#917

Closed

34 tasks

alexdima mentioned this issue Dec 6, 2015

Introduce OnigString as the underlying caching mechanism #46

Merged

egamma mentioned this issue Jan 6, 2016

January Iteration Plan microsoft/vscode#1826

Closed

59 tasks

alexdima closed this as completed Jan 9, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Super slow tokenization of long strings containing multi-byte characters #45

Super slow tokenization of long strings containing multi-byte characters #45

alexdima commented Nov 30, 2015

alexdima commented Jan 9, 2016

Super slow tokenization of long strings containing multi-byte characters #45

Super slow tokenization of long strings containing multi-byte characters #45

Comments

alexdima commented Nov 30, 2015

alexdima commented Jan 9, 2016