benchmarking/profiling #12

stevengj · 2014-07-18T15:42:08Z

It would be good to perform some benchmarking of utf8proc against ICU, and in general to perform some profiling to see if there are any easy targets for optimization.

jiahao · 2014-07-18T16:01:59Z

How about http://www.mediawiki.org/wiki/PHP_5.2_benchmarks#Unicode_normalization_benchmark

stevengj · 2014-07-18T16:11:31Z

Sounds like a decent benchmark. (Does PHP use ICU?)

stevengj · 2014-07-18T23:32:44Z

See also these Ruby benchmarks. The fastest code in that benchmark (although it didn't seem to look at utf8proc or ICU) was unf, which seems to be a wrapper around this unf_ext C++ code.

stevengj · 2014-07-18T23:41:05Z

The eprun library claims to be fast, albeit in pure Ruby.

A hopeful sign is that eprun was designed by a Unicode expert and includes benchmark data extracted from random Wikipedia pages.

stevengj · 2014-07-19T02:28:50Z

I just pushed a little benchmark program based on the eprun data files. I also pushed a corresponding benchmark of ICU (although it gives a slightly unfair advantage to ICU by preallocating a huge buffer for the output, whereas utf8proc figures out the output size dynamically).

Results on my machine:

$ cat bench.out 
Deutsch_.txt: 0.00390216
Japanese_.txt: 0.00458185
Korean_.txt: 0.00171503
Vietnamese_.txt: 0.0036274

$ cat icu.out 
Deutsch_.txt: 0.00011213
Japanese_.txt: 0.00026891
Korean_.txt: 9.649e-05
Vietnamese_.txt: 0.00151726

So, unless I am messing something up, ICU is significantly faster, albeit with a far more painful API.

Just to make sure that ICU is not "cheating" by doing some kind of caching that helps it for repeated normalization of the same string (the above benchmark loops 100x), I tried normalizing a single long string formed by concatenating the above files a few dozen times, and got 0.679s for utf8proc and 0.104s for ICU.

(Compiling with -O3 instead of -O2 makes only a slight (< 5%) difference.)

Would be nice to also benchmark against GNU libunistring and perhaps unf_ext from above.

stevengj · 2014-07-19T02:38:19Z

Note that eprun, which basically makes clever use of regular expressions to get decent performance in Perl or Ruby, is (unsurprisingly) significantly slower than utf8proc on my machine (though not terrible). Its times in seconds for NFKC from Perl eprun are:

Deutsch_.txt: 0.0102
Japanese_.txt: 0.0127
Korean_.txt: 0.0062
Vietnamese_.txt: 0.0089

I couldn't get the Ruby eprun working on my machine, but I don't really understand Ruby.

stevengj · 2014-07-19T12:18:04Z

In order to figure out the correct buffer size, utf8proc_map has to perform the canonical decomposition twice, so we have a factor of 2 penalty from that compared to the (somewhat artificial) way I am calling ICU. But this does not correspond to a factor of 2 overall, because decomposition is only part of the process. If I hack out this doubled decomposition, the benchmark numbers improve slightly to:

Deutsch_.txt: 0.00256355
Japanese_.txt: 0.00316227
Korean_.txt: 0.00118366
Vietnamese_.txt: 0.00255491

stevengj · 2014-07-19T12:30:54Z

From gprof, it looks like 40% of the time is spent in utf8proc_decompose_char, 32% in utf8proc_iterate, 14% in utf8proc_decompose, and 11% in utf8proc_encode_char.

stevengj · 2014-07-19T18:57:32Z

GNU libunistring (benchmark added in a39c1a6), which has a similar API (operates directly on UTF8 data and does not require the output buffer to be preallocated) looks very comparable to utf8proc:

Deutsch_.txt: 0.00335499
Japanese_.txt: 0.00381766
Korean_.txt: 0.0018932
Vietnamese_.txt: 0.00256382

stevengj · 2014-12-08T02:37:51Z

We could make utf8proc_iterate faster if we assumed that the string was valid UTF-8, and hence could remove the checks. This is the case in Julia, because UTF-8 strings are validated when they are created.

One possibility would be to have an additional flag to assume valid input, in which case a codepath with fewer checks is called. (May be somewhat annoying to implement without a bunch of cut-and-paste, though some preprocessor hacks could be used: e.g. have a file with the decompose and iterate functions, and #include it twice with different #defines to get two versions of the functions.)

ScottPJones · 2015-05-07T10:39:15Z

This is the case in Julia, because UTF-8 strings are validated when they are created.

@stevengj I agree that having valid UTF-8 input could be used to make utf8proc_iterate faster, unfortunately, that's not correct currently in Julia (although @JeffBezanson [I think] and I would like to make that true). About libunistring, pretty please avoid anything with GPL (except in packages), it makes Julia useless (or even dangerous) for people like me, trying to use Julia in a commericial project. For really frequent operations, I'll see if I can get this up to ICU speeds...

tkelman · 2015-12-04T03:47:03Z

The author of utf8rewind pings me once in a while on reddit to let me know he's been adding new features, so this would be another thing to compare against eventually: https://bitbucket.org/knight666/utf8rewind/overview

It's MIT licensed and pretty small like utf8proc is, so we could borrow anything that turned out to be worth using.

xhochy · 2020-06-17T10:01:18Z

In the Apache Arrow project, we also evaluating using utf8proc and are running some benchmarks, see apache/arrow#7449 (comment) Currently unilib seems to perform better for us.

stevengj · 2020-06-17T19:02:09Z

One could probably speed up upper/lowercase conversions, e.g. by adding a specialized table just for this, but it's not clear from the issue whether that functionality is actually performance critical or just a test case?

stevengj · 2020-11-23T19:36:15Z

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

sgllama · 2023-04-02T17:04:57Z

This thesis claims to have implemented similar functionality with a considerable speedup: https://bearworks.missouristate.edu/theses/2731/

However, the source code does not seem publicly available. The thesis advisor is tragically deceased, and I'm not sure about the contact information for the author.

This may be the thesis author: https://www.linkedin.com/in/jpdurham

stevengj added the enhancement label Jul 18, 2014

stevengj added a commit that referenced this issue Jul 19, 2014

add benchmark (issue #12)

20cff07

stevengj mentioned this issue Mar 6, 2015

Ruby / PostgreSQL plug-ins #22

Open

jiahao mentioned this issue Sep 18, 2015

Feature request: Full Case Folding #54

Closed

xhochy mentioned this issue Jun 17, 2020

ARROW-9133: [C++] Add utf8_upper and utf8_lower apache/arrow#7449

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

benchmarking/profiling #12

benchmarking/profiling #12

stevengj commented Jul 18, 2014

jiahao commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Dec 8, 2014

ScottPJones commented May 7, 2015

tkelman commented Dec 4, 2015

xhochy commented Jun 17, 2020

stevengj commented Jun 17, 2020

stevengj commented Nov 23, 2020

sgllama commented Apr 2, 2023

benchmarking/profiling #12

benchmarking/profiling #12

Comments

stevengj commented Jul 18, 2014

jiahao commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 18, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Jul 19, 2014

stevengj commented Dec 8, 2014

ScottPJones commented May 7, 2015

tkelman commented Dec 4, 2015

xhochy commented Jun 17, 2020

stevengj commented Jun 17, 2020

stevengj commented Nov 23, 2020

sgllama commented Apr 2, 2023