Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959

Closed
ScottPJones opened this issue Apr 23, 2015 · 13 comments
Labels
performance Must go faster unicode Related to unicode characters and encodings

Comments

@ScottPJones
Copy link
Contributor

These all tend to be rather slow because they create an empty array, and append to it 1 character at a time... also, instead of having specific converters for things like:
ASCIIString -> UTF16String, UTF8String -> UTF16String, UTF32String -> UTF16String
ASCIIString -> UTF32String, UTF8String -> UTF32String, UTF16String -> UTF32String
these are handled by a single generic function (encode16 for UTF16String)

@jiahao
Copy link
Member

jiahao commented Apr 23, 2015

We will gladly take a pull request to improve performance.

@jiahao jiahao added the unicode Related to unicode characters and encodings label Apr 23, 2015
@StefanKarpinski
Copy link
Member

But just pointing where there are performance issues is also helpful. Thanks, @ScottPJones!

@StefanKarpinski StefanKarpinski added the performance Must go faster label Apr 23, 2015
@stevengj
Copy link
Member

We should avoid a combinatorial explosion of optimized methods, though; it's not clear to me which of these conversions are performance-critical. Probably UTF8 and ASCII to/from UTF16 are the most important, for calling APIs that use UTF16 (e.g. Windows or ICU).

Note that the cost of repeated allocations can be avoided simply by calling sizehint on the array.

@ScottPJones
Copy link
Contributor Author

I actually will be using all of them, ASCIIString, UTF16String, and UTF32String (and converting UTF8String to the most appropriate of the 3) for my string processing... gotta have UTF32String for all the emoji’s people are using in chat, twitter, etc.!
I need all of them to perform well...
I don’t think adding 2-3 optimized methods for each type will make a combinatorial explosion though...
there are already quite a few for each type...

@stevengj
Copy link
Member

I don't quite understand your comment; all of the UTF-xx types can store emoji. Can you give an example of a common string-processing task that is substantially faster in UTF-32? ("Counting codepoints" is not usually a common need, and things like searching can be fast in UTF-8.) I'd rather focus on optimizing important UTF-8 operations...

@ScottPJones
Copy link
Contributor Author

I frequently have to deal with records where the data is packed into records based on character position... what was a O(1) operation to extract a field becomes an O(n) operation in Julia...
Because of that, I will use a byte array to process records that don't contain any characters > 255,
use an array of 16-bit words to process records that only have characters up to 65535, and 32-bit characters for the rest.
So, if I have a data source with lots of records with emoji, I'll be using UTF-32 a lot...

I really don't like UTF-8 for anything but a means of saving / getting text from an external source / destination... I came up with my own packed scheme for storing Unicode data much more efficiently back 18 years ago... made a huge difference in performance (because performance actually depended a lot more on how much of your data would fit in your shared memory database cache, not on the relatively small amount of time it took to pack/unpack the data into that format...)

@stevengj
Copy link
Member

Wouldn't you just use a fast in-memory compression scheme nowadays, like Blosc, in that case?

@ScottPJones
Copy link
Contributor Author

Many many reasons... Blosc isn't going to help in compressing short fields of a small number of bytes... whereas my scheme did a very good job at just that... (I actually had a combination of two things... one to store variable length elements in just a few bytes, and then a compaction scheme for Unicode strings, that took advantage of the sorts of things you'd see in real-world text data, esp. data you'd find in databases, for different sorts of languages, such as Japanese...)
I might use a LZW sort of compression on LOBs for what I'm working on now... Blosc may be useful...

@ScottPJones
Copy link
Contributor Author

I ran into a problem when testing my performance improvements for utf16()...
I found a test in strings.jl where it tries to check the return value of is_valid_utf16(), however, my new code actually throws an error instead of producing an invalid UTF-16 string.
In fact, another problem is that the parser doesn't give an error on an invalid literal string... in this case:
"\ud800".

while loading strings.jl, in expression starting on line 1284
From worker 5: * linalg/triangular in 55.11 seconds
From worker 4: * numbers in 24.77 seconds
From worker 3: * dates in 27.11 seconds
ERROR: LoadError: LoadError: test error in expression: is_valid_utf16(utf16("\ud800")) == false
ArgumentError: missing trailing Unicode surrogate character at index 1 (0xd800)

What should I do?
I can change the expression in strings.jl to:

is_valid_utf16(UTF16String([0xd800,0x0]))

which will make the test work... there still is the larger issue of why Julia allows creating invalid UTF-8 string literals...

@stevengj
Copy link
Member

I would just use is_valid_utf16(UTF16String([0xd800,0x0])). Whether we should allow creation of invalid UTF-8 literals should be discussed in a separate issue. (Well, it will always be possible by mucking around with s.data, but the question is how easy we should make it.)

@ScottPJones
Copy link
Contributor Author

No, Blosc, and things like it, are fine for large things, but don't help at all with a bunch of short string fields... Blosc in particular is optimized to handle lots of fixed length things like ints, floats, doubles, etc.,,
to get performance it is normally shuffles the data... (additionally it uses multithreading to gain performance (which is fine if you don't have up to 64K other processes all wanting to go as fast as possible at the same time...)

@ScottPJones
Copy link
Contributor Author

Here are my first results: using the nightly build, and a build from roughly the same time tonight, with my changes to utf16.jl and utf32.jl:
First, the nightly build:

Looping 1000000 times

strAscii: sizeof=16 length=16
strUTF8:  sizeof=26  length=16
strUTF16: sizeof=36 length=16
strUTF32: sizeof=64 length=16

ASCII length
elapsed time: 1.139e-6 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.024775922 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.060517614 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.22e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.006927128 seconds (22 MB allocated, 22.76% gc time in 1 pauses with 0 full sweep)
ASCII convert to UTF-16
elapsed time: 0.221942112 seconds (228 MB allocated, 8.68% gc time in 11 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 0.066822061 seconds (160 MB allocated, 13.26% gc time in 7 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 0.368932996 seconds (228 MB allocated, 4.69% gc time in 11 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 0.225690501 seconds (160 MB allocated, 5.14% gc time in 7 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 1.044142034 seconds (602 MB allocated, 6.45% gc time in 27 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 0.176482338 seconds (160 MB allocated, 7.91% gc time in 8 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 0.635408295 seconds (308 MB allocated, 7.21% gc time in 14 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 0.257192549 seconds (228 MB allocated, 6.14% gc time in 10 pauses with 0 full sweep)
Looping 10000 times

strAscii: sizeof=65536 length=65536
strUTF8:  sizeof=106496  length=65536
strUTF16: sizeof=147456 length=65536
strUTF32: sizeof=262144 length=65536

ASCII length
elapsed time: 4.01e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.820893334 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 2.351822159 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.07e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 6.3431e-5 seconds (234 kB allocated)
ASCII convert to UTF-16
elapsed time: 5.402134634 seconds (5011 MB allocated, 6.07% gc time in 228 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 1.629907455 seconds (2500 MB allocated, 9.79% gc time in 113 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 11.667746952 seconds (5011 MB allocated, 2.97% gc time in 228 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 8.249162805 seconds (2500 MB allocated, 2.66% gc time in 113 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 28.845708047 seconds (7425 MB allocated, 2.99% gc time in 339 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 5.833579708 seconds (2500 MB allocated, 3.52% gc time in 114 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 15.46437403 seconds (2267 MB allocated, 1.25% gc time in 103 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 6.703502905 seconds (5011 MB allocated, 4.79% gc time in 228 pauses with 0 full sweep)

Now my build:

Looping 1000000 times

strAscii: sizeof=16 length=16
strUTF8:  sizeof=26  length=16
strUTF16: sizeof=36 length=16
strUTF32: sizeof=64 length=16

ASCII length
elapsed time: 3.21e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.022165305 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.018233583 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 3.49e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.014409868 seconds (22 MB allocated, 19.62% gc time in 1 pauses with 0 full sweep)
ASCII convert to UTF-16
elapsed time: 0.070732848 seconds (129 MB allocated, 10.59% gc time in 6 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 0.096013652 seconds (160 MB allocated, 26.13% gc time in 8 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 0.144519669 seconds (129 MB allocated, 4.92% gc time in 5 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 0.147018295 seconds (160 MB allocated, 20.07% gc time in 8 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 0.137035481 seconds (114 MB allocated, 12.43% gc time in 5 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 0.117507099 seconds (160 MB allocated, 21.49% gc time in 7 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 0.117562026 seconds (114 MB allocated, 17.88% gc time in 6 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 0.112164033 seconds (129 MB allocated, 7.41% gc time in 6 pauses with 0 full sweep)
Looping 10000 times

strAscii: sizeof=65536 length=65536
strUTF8:  sizeof=106496  length=65536
strUTF16: sizeof=147456 length=65536
strUTF32: sizeof=262144 length=65536

ASCII length
elapsed time: 2.8e-7 seconds (0 bytes allocated)
UTF-8 length
elapsed time: 0.744378667 seconds (0 bytes allocated)
UTF-16 length
elapsed time: 0.596741436 seconds (0 bytes allocated)
UTF-32 length
elapsed time: 4.36e-7 seconds (0 bytes allocated)
ASCII convert to UTF-8
elapsed time: 0.000111846 seconds (234 kB allocated)
ASCII convert to UTF-16
elapsed time: 1.029583397 seconds (1250 MB allocated, 9.28% gc time in 57 pauses with 0 full sweep)
ASCII convert to UTF-32
elapsed time: 1.353186115 seconds (2500 MB allocated, 9.39% gc time in 113 pauses with 0 full sweep)
UTF-8 convert to UTF-16
elapsed time: 3.340274404 seconds (1407 MB allocated, 3.67% gc time in 64 pauses with 0 full sweep)
UTF-8 convert to UTF-32
elapsed time: 3.278156343 seconds (2500 MB allocated, 5.30% gc time in 114 pauses with 0 full sweep)
UTF-16 convert to UTF-8
elapsed time: 2.792735616 seconds (1016 MB allocated, 3.78% gc time in 46 pauses with 0 full sweep)
UTF-16 convert to UTF-32
elapsed time: 3.495122776 seconds (2500 MB allocated, 6.08% gc time in 114 pauses with 0 full sweep)
UTF-32 convert to UTF-8
elapsed time: 3.431119066 seconds (1016 MB allocated, 2.92% gc time in 46 pauses with 0 full sweep)
UTF-32 convert to UTF-16
elapsed time: 3.304848523 seconds (1407 MB allocated, 3.80% gc time in 64 pauses with 0 full sweep)

In the ones that I changed, there is a significant improvement in performance, and a lot less memory is allocated, and they also completely check for validity when converting.

ScottPJones added a commit to ScottPJones/julia that referenced this issue Apr 25, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String,
UTF16String, and UTF32String, and also rewrote length() for
UTF16String().
@timholy
Copy link
Member

timholy commented Apr 25, 2015

I know nothing about strings, but the performance improvement looks great!

ScottPJones added a commit to ScottPJones/julia that referenced this issue Apr 26, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 2, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 2, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 7, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 14, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 16, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 17, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 21, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 21, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 22, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue May 22, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String.
Rewrote length() for UTF16String().
Improved reverse() for UTF16String().

Added over 150 lines of testing code to detect the above conversion problems

Added (in a gist) code to show other conversion problems not yet fixed:
https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc

Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity
checking did not adversely affect performance (in fact, performance was greatly improved).
https://gist.github.com/ScottPJones/79ed895f05f85f333d84

Updated based on review comments

Changes to error handling and check_string

Rebased against JuliaLang#11575
Updated comment to go before function, not indented by 4

Updated to use unsafe_checkstring

Removed redundant argument documentation

Fix AbstractVector{UInt16} conversion

Remove support for converting Vector{UInt16} to UTF8String

Add Unicode validation function and fix UTF-16 conversion bugs
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
Added new `convert` methods that use the `check_string` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jun 22, 2015
@tkelman tkelman closed this as completed in a286ce0 Jul 1, 2015
tkelman added a commit that referenced this issue Jul 1, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 1, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 6, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 9, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 10, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 10, 2015
Update for search change

Updated to use unsafe_checkstring, fix comments
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 12, 2015
Added new `convert` methods that use the `checkstring` function to validate input
Added tests for many sorts of valid/invalid data
Depends on PR JuliaLang#11551 and JuliaLang#11575

Updated to use unsafe_checkstring, fix comments

Remove conversions from Vector{UInt32}

Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 12, 2015
Update for search change

Updated to use unsafe_checkstring, fix comments
jakebolewski added a commit that referenced this issue Jul 19, 2015
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 19, 2015
Update for search change

Updated to use unsafe_checkstring, fix comments

Update comments

Remove @inline from test function

Removed conversions with Vector{Char}

Ensure all changes included
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 24, 2015
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 27, 2015
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
ScottPJones added a commit to ScottPJones/julia that referenced this issue Jul 28, 2015
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
StefanKarpinski added a commit that referenced this issue Jul 28, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

5 participants