-
-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect implementation of unicode case folding #168
Comments
The proposal sounds great to me. Can you measure any signficant performance difference? |
Yes. It is either 30 times faster or 3 times slower depending on how you measure. Here's ASCII range: function test(fn, name) {
let str = ''
for (let i=0; i<=0x7F; i++) {
str += String.fromCodePoint(i)
}
let time = Date.now()
for (let i=0; i<200000; i++) {
fn(str)
}
console.log(name, Date.now() - time)
}
test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')
// Result:
// fold-case 715
// swapcase 1340
// built-in 45 Here's full unicode range: function test(fn, name) {
let str = ''
for (let i=0; i<0x10FFFF; i++) {
str += String.fromCodePoint(i)
}
let time = Date.now()
for (let i=0; i<20; i++) {
fn(str)
}
console.log(name, Date.now() - time)
}
test(require('fold-case'), 'fold-case')
test(require('swapcase'), 'swapcase')
test(s => s.toLowerCase().toUpperCase(), 'built-in')
// Result:
// fold-case 503
// swapcase 490
// built-in 1539 I assume most identifiers will be ascii, so first test applies. |
Looks good. Do you want to do a PR for this change? |
taku0
added a commit
to taku0/cmark-el
that referenced
this issue
Jan 11, 2020
We now use the built in str.toLowerCase().toUpperCase(), which @rlidwka has shown does an accurate unicode case fold. This allows us to remove a huge lookup table and should both decrease the size of the library and speed things up. Closes commonmark/commonmark.js#168. commonmark/commonmark.js@b8fc0c7 Author: John MacFarlane <[email protected]> Date: Fri Jan 10 09:19:35 2020 -0800
6 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Abstract: this code isn't up to unicode spec, partially incorrect (and can probably be replaced by a simple combination of lowercase/uppercase in modern js engines).
So I was looking at this test the other day: commonmark/commonmark-spec#582
And I was trying to figure out what's the difference between "naive" lowercasing and actual Unicode case folding as implemented in commonmark.
So I checked it with the spec published here:
http://unicode.org/Public/UCD/latest/ucd/CaseFolding.txt
And surely enough, I found a difference:
They are corresponding lower and upper case letter, even javascript knows that:
But commonmark.js doesn't treat them as the same character:
Funnily enough, original implementation (https://github.com/mathiasbynens/swapcase) supports this:
So it looks like what we're having here is not simply outdated, but was never correct to begin with.
It is my estimation that around 330 other characters in current implementation are processed incorrectly, as opposed to 170 in original
swapcase
, which are most likely due to changes in unicode between now and 5 years ago when it was published.I have a better solution. Use simple and naive
str.toLowerCase()
. It only gets 125 codepoints wrong, which is better.Or use
str.toUpperCase()
. It is actually almost perfect, as it only gets 6 characters incorrectly (list of exceptions: İ, ϴ, ẞ, Ω, K, Å - those are already uppercased, but have differently uppercased versions).If we combine those approaches,
str.toLowerCase().toUpperCase()
is observed to normalize ALL characters in the same way as case fold spec (as tested on node 12, earlier v8 versions have had some issues).So... why not use that, and save on 30kb of javascript code and its future maintenance time?
The text was updated successfully, but these errors were encountered: