Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

ScottPJones · 2015-04-23T14:46:24Z

These all tend to be rather slow because they create an empty array, and append to it 1 character at a time... also, instead of having specific converters for things like:
ASCIIString -> UTF16String, UTF8String -> UTF16String, UTF32String -> UTF16String
ASCIIString -> UTF32String, UTF8String -> UTF32String, UTF16String -> UTF32String
these are handled by a single generic function (encode16 for UTF16String)

jiahao · 2015-04-23T15:23:51Z

We will gladly take a pull request to improve performance.

StefanKarpinski · 2015-04-23T15:24:40Z

But just pointing where there are performance issues is also helpful. Thanks, @ScottPJones!

stevengj · 2015-04-23T16:46:51Z

We should avoid a combinatorial explosion of optimized methods, though; it's not clear to me which of these conversions are performance-critical. Probably UTF8 and ASCII to/from UTF16 are the most important, for calling APIs that use UTF16 (e.g. Windows or ICU).

Note that the cost of repeated allocations can be avoided simply by calling sizehint on the array.

ScottPJones · 2015-04-23T19:01:29Z

I actually will be using all of them, ASCIIString, UTF16String, and UTF32String (and converting UTF8String to the most appropriate of the 3) for my string processing... gotta have UTF32String for all the emoji’s people are using in chat, twitter, etc.!
I need all of them to perform well...
I don’t think adding 2-3 optimized methods for each type will make a combinatorial explosion though...
there are already quite a few for each type...

stevengj · 2015-04-23T19:21:34Z

I don't quite understand your comment; all of the UTF-xx types can store emoji. Can you give an example of a common string-processing task that is substantially faster in UTF-32? ("Counting codepoints" is not usually a common need, and things like searching can be fast in UTF-8.) I'd rather focus on optimizing important UTF-8 operations...

ScottPJones · 2015-04-23T21:09:09Z

I frequently have to deal with records where the data is packed into records based on character position... what was a O(1) operation to extract a field becomes an O(n) operation in Julia...
Because of that, I will use a byte array to process records that don't contain any characters > 255,
use an array of 16-bit words to process records that only have characters up to 65535, and 32-bit characters for the rest.
So, if I have a data source with lots of records with emoji, I'll be using UTF-32 a lot...

I really don't like UTF-8 for anything but a means of saving / getting text from an external source / destination... I came up with my own packed scheme for storing Unicode data much more efficiently back 18 years ago... made a huge difference in performance (because performance actually depended a lot more on how much of your data would fit in your shared memory database cache, not on the relatively small amount of time it took to pack/unpack the data into that format...)

stevengj · 2015-04-23T21:49:02Z

Wouldn't you just use a fast in-memory compression scheme nowadays, like Blosc, in that case?

ScottPJones · 2015-04-23T23:08:27Z

Many many reasons... Blosc isn't going to help in compressing short fields of a small number of bytes... whereas my scheme did a very good job at just that... (I actually had a combination of two things... one to store variable length elements in just a few bytes, and then a compaction scheme for Unicode strings, that took advantage of the sorts of things you'd see in real-world text data, esp. data you'd find in databases, for different sorts of languages, such as Japanese...)
I might use a LZW sort of compression on LOBs for what I'm working on now... Blosc may be useful...

ScottPJones · 2015-04-24T02:16:24Z

I ran into a problem when testing my performance improvements for utf16()...
I found a test in strings.jl where it tries to check the return value of is_valid_utf16(), however, my new code actually throws an error instead of producing an invalid UTF-16 string.
In fact, another problem is that the parser doesn't give an error on an invalid literal string... in this case:
"\ud800".

while loading strings.jl, in expression starting on line 1284
From worker 5: * linalg/triangular in 55.11 seconds
From worker 4: * numbers in 24.77 seconds
From worker 3: * dates in 27.11 seconds
ERROR: LoadError: LoadError: test error in expression: is_valid_utf16(utf16("\ud800")) == false
ArgumentError: missing trailing Unicode surrogate character at index 1 (0xd800)

What should I do?
I can change the expression in strings.jl to:

is_valid_utf16(UTF16String([0xd800,0x0]))

which will make the test work... there still is the larger issue of why Julia allows creating invalid UTF-8 string literals...

stevengj · 2015-04-24T15:36:21Z

I would just use is_valid_utf16(UTF16String([0xd800,0x0])). Whether we should allow creation of invalid UTF-8 literals should be discussed in a separate issue. (Well, it will always be possible by mucking around with s.data, but the question is how easy we should make it.)

ScottPJones · 2015-04-24T23:13:24Z

No, Blosc, and things like it, are fine for large things, but don't help at all with a bunch of short string fields... Blosc in particular is optimized to handle lots of fixed length things like ints, floats, doubles, etc.,,
to get performance it is normally shuffles the data... (additionally it uses multithreading to gain performance (which is fine if you don't have up to 64K other processes all wanting to go as fast as possible at the same time...)

ScottPJones · 2015-04-25T08:14:48Z

Here are my first results: using the nightly build, and a build from roughly the same time tonight, with my changes to utf16.jl and utf32.jl:
First, the nightly build:

Looping 1000000 times

strAscii: sizeof=16 length=16
strUTF8:  sizeof=26  length=16
strUTF16: sizeof=36 length=16
strUTF32: sizeof=64 length=16

ASCII length
elapsed time: 1.139e-6 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.024775922 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.060517614 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.22e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.006927128 seconds (22 MB allocated, 22.76% gc time in 1 pauses with 0 full sweep)
ASCII convert to UTF-16
elapsed time: 0.221942112 seconds (228 MB allocated, 8.68% gc time in 11 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 0.066822061 seconds (160 MB allocated, 13.26% gc time in 7 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 0.368932996 seconds (228 MB allocated, 4.69% gc time in 11 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 0.225690501 seconds (160 MB allocated, 5.14% gc time in 7 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 1.044142034 seconds (602 MB allocated, 6.45% gc time in 27 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 0.176482338 seconds (160 MB allocated, 7.91% gc time in 8 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 0.635408295 seconds (308 MB allocated, 7.21% gc time in 14 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 0.257192549 seconds (228 MB allocated, 6.14% gc time in 10 pauses with 0 full sweep)
Looping 10000 times

strAscii: sizeof=65536 length=65536
strUTF8:  sizeof=106496  length=65536
strUTF16: sizeof=147456 length=65536
strUTF32: sizeof=262144 length=65536

ASCII length
elapsed time: 4.01e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.820893334 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 2.351822159 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.07e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 6.3431e-5 seconds (234 kB allocated)
ASCII convert to UTF-16
elapsed time: 5.402134634 seconds (5011 MB allocated, 6.07% gc time in 228 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 1.629907455 seconds (2500 MB allocated, 9.79% gc time in 113 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 11.667746952 seconds (5011 MB allocated, 2.97% gc time in 228 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 8.249162805 seconds (2500 MB allocated, 2.66% gc time in 113 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 28.845708047 seconds (7425 MB allocated, 2.99% gc time in 339 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 5.833579708 seconds (2500 MB allocated, 3.52% gc time in 114 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 15.46437403 seconds (2267 MB allocated, 1.25% gc time in 103 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 6.703502905 seconds (5011 MB allocated, 4.79% gc time in 228 pauses with 0 full sweep)

Now my build:

Looping 1000000 times

strAscii: sizeof=16 length=16
strUTF8:  sizeof=26  length=16
strUTF16: sizeof=36 length=16
strUTF32: sizeof=64 length=16

ASCII length
elapsed time: 3.21e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.022165305 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.018233583 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 3.49e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.014409868 seconds (22 MB allocated, 19.62% gc time in 1 pauses with 0 full sweep)
ASCII convert to UTF-16
elapsed time: 0.070732848 seconds (129 MB allocated, 10.59% gc time in 6 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 0.096013652 seconds (160 MB allocated, 26.13% gc time in 8 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 0.144519669 seconds (129 MB allocated, 4.92% gc time in 5 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 0.147018295 seconds (160 MB allocated, 20.07% gc time in 8 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 0.137035481 seconds (114 MB allocated, 12.43% gc time in 5 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 0.117507099 seconds (160 MB allocated, 21.49% gc time in 7 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 0.117562026 seconds (114 MB allocated, 17.88% gc time in 6 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 0.112164033 seconds (129 MB allocated, 7.41% gc time in 6 pauses with 0 full sweep)
Looping 10000 times

strAscii: sizeof=65536 length=65536
strUTF8:  sizeof=106496  length=65536
strUTF16: sizeof=147456 length=65536
strUTF32: sizeof=262144 length=65536

ASCII length
elapsed time: 2.8e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.744378667 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.596741436 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.36e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.000111846 seconds (234 kB allocated)
ASCII convert to UTF-16
elapsed time: 1.029583397 seconds (1250 MB allocated, 9.28% gc time in 57 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 1.353186115 seconds (2500 MB allocated, 9.39% gc time in 113 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 3.340274404 seconds (1407 MB allocated, 3.67% gc time in 64 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 3.278156343 seconds (2500 MB allocated, 5.30% gc time in 114 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 2.792735616 seconds (1016 MB allocated, 3.78% gc time in 46 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 3.495122776 seconds (2500 MB allocated, 6.08% gc time in 114 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 3.431119066 seconds (1016 MB allocated, 2.92% gc time in 46 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 3.304848523 seconds (1407 MB allocated, 3.80% gc time in 64 pauses with 0 full sweep)

In the ones that I changed, there is a significant improvement in performance, and a lot less memory is allocated, and they also completely check for validity when converting.

Rewrote a number of the conversions between ASCIIString, UTF8String, UTF16String, and UTF32String, and also rewrote length() for UTF16String().

timholy · 2015-04-25T08:59:54Z

I know nothing about strings, but the performance improvement looks great!

…rent UTF & ASCII string types

… to 10x faster

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation Fix AbstractVector{UInt16} conversion Remove support for converting Vector{UInt16} to UTF8String Add Unicode validation function and fix UTF-16 conversion bugs

Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Fix #10959 bugs with UTF-16 conversions

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Update for search change Updated to use unsafe_checkstring, fix comments

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

Update for search change Updated to use unsafe_checkstring, fix comments

Fix #10959 bugs with UTF-32 conversions

@inline

Update for search change Updated to use unsafe_checkstring, fix comments Update comments Remove @inline from test function Removed conversions with Vector{Char} Ensure all changes included

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

Fix #10959, fix #11463 bugs with UTF-8 conversions

jiahao added the unicode Related to unicode characters and encodings label Apr 23, 2015

StefanKarpinski added the performance Must go faster label Apr 23, 2015

ScottPJones added a commit to ScottPJones/julia that referenced this issue Apr 26, 2015

Fix JuliaLang#10959 performance issues with conversions between diffe…

fbb0ed9

…rent UTF & ASCII string types

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 2, 2015

JuliaLang#10959 JuliaLang#11004 Improve performance of UTF conversions

ba1ef17

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 2, 2015

JuliaLang#10959 JuliaLang#11004 Get rid of trailing spaces on lines

8864f88

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015

Fix JuliaLang#10959 pure Julia code for UTF conversions

7278d95

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015

Fix JuliaLang#10959 Added @inbounds for better performance

7c6f5ac

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

c68bba0

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

0e103b8

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

ad6f907

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 16, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

0f7a072

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 17, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

e9ef5f5

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 21, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

c432c08

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 21, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

618fbf7

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 22, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

619851a

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue May 22, 2015

Fix JuliaLang#10959 Fix Unicode bugs with string conversions, make up…

dbf909c

… to 10x faster

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

9175a02

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015

Fix JuliaLang#10959 bugs with UTF-16/UTF-32 conversions

e555d2d

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

1758764

tkelman closed this as completed in a286ce0 Jul 1, 2015

tkelman added a commit that referenced this issue Jul 1, 2015

Merge pull request #11551 from ScottPJones/spj/fixutf

9071f14

Fix #10959 bugs with UTF-16 conversions

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

239ff39

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

93376fb

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

d598bea

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

d3f9619

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

d75b981

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 10, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

e17ecf4

Update for search change Updated to use unsafe_checkstring, fix comments

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 12, 2015

Fix JuliaLang#10959 problems with UTF-8 conversions

ca87916

Update for search change Updated to use unsafe_checkstring, fix comments

jakebolewski added a commit that referenced this issue Jul 19, 2015

Merge pull request #11607 from ScottPJones/spj/fixutf32

c08b1bb

Fix #10959 bugs with UTF-32 conversions

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 24, 2015

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

2fca588

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 27, 2015

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

37650ef

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 28, 2015

Fix JuliaLang#10959, fix JuliaLang#11463 bugs with UTF-8 conversions

91305f7

Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start

StefanKarpinski added a commit that referenced this issue Jul 28, 2015

Merge pull request #11624 from ScottPJones/spj/fixutf8

416a23e

Fix #10959, fix #11463 bugs with UTF-8 conversions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

ScottPJones commented Apr 23, 2015

jiahao commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 24, 2015

stevengj commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

ScottPJones commented Apr 25, 2015

timholy commented Apr 25, 2015

Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

Comments

ScottPJones commented Apr 23, 2015

jiahao commented Apr 23, 2015

StefanKarpinski commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

stevengj commented Apr 23, 2015

ScottPJones commented Apr 23, 2015

ScottPJones commented Apr 24, 2015

stevengj commented Apr 24, 2015

ScottPJones commented Apr 24, 2015

ScottPJones commented Apr 25, 2015

timholy commented Apr 25, 2015