-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch internal representation to UTF8 #365
Conversation
I guess this also addresses #272 in one swoop |
What’s slower? |
I guess it'd be good to see what |
EDIT: Oops, I totally botched processing the CSV as Bodigrim points out below I'll see if I can build https://github.com/hasura/graphql-engine with this branch and run it through our benchmarking infra this week. EDIT: going back to the UTF-8 proposal it looks like for better or worse the only concrete acceptance criteria vis a vis performance are:
so above is more than acceptable according to that spec |
EDIT: Comment is no longer relevant to the discussion; see Bodigrim's response below. |
Thanks for the numbers @jberryman |
@jberryman I'm afraid your table does not make sense:
First two rows are corrupted, so let's start from the third one. Here Ratio does not match time measurements: 4610553675 / 23929606012 = 0.193, which means that UTF8 branch is 5x faster, not 4.361x slower. Everything else is misrepresenting original data in the similar way. Instead of sorting rows by Ratio, you sorted values in Ratio column on their own and reversed the table, apparently obtaining nonsensical results. |
Doh! Sorry Click for regressions...
Removing the "japanese" and "russian" benchmarks doesn't change the picture significantly. Like @chrisdone I'd be interested in an assessment of the regressions from @Bodigrim or someone who knows the benchmarks well. |
As I have written in the starting post, I'm working on a detailed analysis of performance. |
Performance reportThis report compares performance of The original Tom Harper's thesis, "Fusion on Haskell Unicode Strings" discusses performance differences between UTF-8, UTF-16 and UTF-32 encodings. Basically, UTF-32 offers the best performance for string operations in synthetic benchmarks, simple because it is a fixed-length encoding and parsing UTF-32-encoded buffer is no-op. Further, UTF-16 is worse, because characters can take 16 or 32 bits. However, parsing/printing codepoints is still very simple, there are only two branches, and since the vast majority of codepoints are 16-bits-long, CPU branch prediction works wonderfully. However, UTF-8 performance poses certain challenges. Code points can be represented as 1, 2, 3 or 4 bytes, their parsing and printing involves multiple bitwise operations, and CPU branch prediction becomes ineffective, because non-ASCII texts constantly switch between branches. Memory savings from UTF-8 hardly affect synthetic benchmarks, because they usually fit into CPU cache. Synthetic benchmarks also do not account for encoding/decoding data from external sources (usually in UTF-8) and measure pure processing time. That's why The key to a better UTF-8 performance is to avoid parsing/printing of codepoints as much as possible. For example, the most common operations on text are cutting and concatenation - and neither of these requires actual parsing of individual characters, they can be executed over opaque memory buffers. Given that external sources are most likely UTF-8 encoded, an application can spend its entire lifetime without ever interpreting a text character-by-character - thus achieving a very decent performance. Operations, which necessitate a character-by-character interpretation, are likely to regress with UTF-8. E. g., The original proposal stated three performance goals:
While the work on UTF-8 transition was underway, Next, here are results for git checkout master
cabal run text-benchmarks -- -t100 -p encode --csv text-master-encode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p encode --baseline text-master-encode.csv
As expected, we receive astonishing speed up for non-ASCII data. Results for English texts are a bit less impressive, but bear in mind that Moving to git checkout master
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --csv text-master-decode.csv
git checkout utf8
cabal run text-benchmarks -- -t100 -p '$2=="Pure" && $4=="decode"' --baseline text-master-decode.csv
Again, results for non-ASCII inputs are pretty much fantastic, while English texts are less impressive (but still 2x faster), for the similar reasons as above. With regards to the third stated goal, as mentioned earlier, the geometric mean over all benchmarks is 0.33. Obviously, this number does not quite characterise them in full and indeed it's a mixed bag. I'll cover most notable regressions (and most notable improvements!) later this week. |
Great report, thanks @Bodigrim. I am highly enthusiastic about this change. 👏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did a cursory look; limited time right now.
copyI semantics changed renaming for 16 -> 8 use iter* functions from Text directly
I'm really pleased to see this! I have had a look at each of the commits in order to get an overall impression of the PR. I don't feel that I have the expertise in this area to give a review though. |
Performance report: regressions(Bear in mind that I recently fixed more performance issues, so earlier reports are no longer fully relevant) I was composing the report below piecewise, so numbers do not belong to the same commit (because this is a moving target and benchmarks take hours) or the same machine. However, I believe it reliably characterises key classes of performance issues. I'm happy to delve deeper and answer questions about specific instances or use cases, if desirable.
|
Just running to confirm the benchmarks on my machine, I get the following with GHC 8.10.4: On
On
Some of the benches in roughly the same ballpark, but others are drastically faster. Notably, some of the lazy text on my machine are much faster. This is great! I'll have a substantive review shortly. |
Perhaps I've missed this, but how is the performance difference on non-x86-64 platforms? |
@L-as underlying |
Speaking of not-so-synthetic benchmarks, here is
|
…n downstream packages
|
I addressed all the feedback above and pushed changes. Unless there are critical bugs, I'd appreciate if we wrap up and merge the branch by the end of the week, so that further work is unblocked. This is a final call for reviews and approvals. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some questions about the doc updates.
@Bodigrim, is there a way to use the new text package in projects for testing, ahead of when it's included in GHC? or does this question not make sense? |
@ketzacoatl Are you looking for something like this: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work @Bodigrim - I read through what I could and it looks good to me.
@Bodigrim, more like your confirmation that you'd expect that type of reference to work. And I'll take your answer as a yes. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Amazing job @Bodigrim 🎉
I think after 150 comments and 200 likes we are in a good position to merge. Thanks everyone for active participation, feel free to provide more feedback here or in separate issues. |
Congratulations @Bodigrim, fantastic work, well done for getting it across the line. |
This is old news, but the PR has been released as a part of |
Results
Benchmark results for GHC 8.10 can be found at https://gist.github.com/Bodigrim/365e388e080b17de45e80ab50a55fb4f. I'll publish a detailed analysis later, but here are the most notable improvements:
decodeUtf8
is up to 10x faster for non-ASCII texts.encodeUtf8
is ~1.5-4x faster for strict and up to 10x faster for lazyText
.take
/drop
/length
are up to 20x faster.toUpper
/toLower
are 10-30% faster.Eq
instance of strictText
is 10x faster.Ord
instance is typically 30%+ faster.isInfixOf
and search routines are up to 10x faster.replicate
ofChar
is up to 20x faster.Geometric mean of benchmark times is 0.33.
How to review
I'd like to encourage as many reviewers as possible, and the branch is structured to facilitate it. Each commit builds and passes tests, so they can be reviewed individually. Actual switch from UTF16 to UTF8 happens in 1369cd3 and remaining TODOs resolved in 99f3c48. Everything else is performance improvements. There are two commits of intimidating size: 2cb3b30, which dumps autogenerated case mappings, and c3ccdb3, which bundles amalgamated
simdutf8
. I tried to keep other commits to a reasonable size and scope, hopefully it is palatable.I'm happy to answer questions and add comments to any code, which is not clear enough. Do not hesitate to ask for this.
Known issues
This branch uses
simdutf8
library, which is massive and written in C++. Since this is a bit inconvenient for a boot package, @kozross is currently finalizing a pure C replacement, which will be submitted for review in a separate PR.