-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance (and excessive garbage collection) in utf8/utf16/utf32 conversion functions #10959
Comments
We will gladly take a pull request to improve performance. |
But just pointing where there are performance issues is also helpful. Thanks, @ScottPJones! |
We should avoid a combinatorial explosion of optimized methods, though; it's not clear to me which of these conversions are performance-critical. Probably UTF8 and ASCII to/from UTF16 are the most important, for calling APIs that use UTF16 (e.g. Windows or ICU). Note that the cost of repeated allocations can be avoided simply by calling |
I actually will be using all of them, ASCIIString, UTF16String, and UTF32String (and converting UTF8String to the most appropriate of the 3) for my string processing... gotta have UTF32String for all the emoji’s people are using in chat, twitter, etc.! |
I don't quite understand your comment; all of the UTF-xx types can store emoji. Can you give an example of a common string-processing task that is substantially faster in UTF-32? ("Counting codepoints" is not usually a common need, and things like searching can be fast in UTF-8.) I'd rather focus on optimizing important UTF-8 operations... |
I frequently have to deal with records where the data is packed into records based on character position... what was a O(1) operation to extract a field becomes an O(n) operation in Julia... I really don't like UTF-8 for anything but a means of saving / getting text from an external source / destination... I came up with my own packed scheme for storing Unicode data much more efficiently back 18 years ago... made a huge difference in performance (because performance actually depended a lot more on how much of your data would fit in your shared memory database cache, not on the relatively small amount of time it took to pack/unpack the data into that format...) |
Wouldn't you just use a fast in-memory compression scheme nowadays, like Blosc, in that case? |
Many many reasons... Blosc isn't going to help in compressing short fields of a small number of bytes... whereas my scheme did a very good job at just that... (I actually had a combination of two things... one to store variable length elements in just a few bytes, and then a compaction scheme for Unicode strings, that took advantage of the sorts of things you'd see in real-world text data, esp. data you'd find in databases, for different sorts of languages, such as Japanese...) |
I ran into a problem when testing my performance improvements for utf16()...
What should I do?
which will make the test work... there still is the larger issue of why Julia allows creating invalid UTF-8 string literals... |
I would just use |
No, Blosc, and things like it, are fine for large things, but don't help at all with a bunch of short string fields... Blosc in particular is optimized to handle lots of fixed length things like ints, floats, doubles, etc.,, |
Here are my first results: using the nightly build, and a build from roughly the same time tonight, with my changes to utf16.jl and utf32.jl:
Now my build:
In the ones that I changed, there is a significant improvement in performance, and a lot less memory is allocated, and they also completely check for validity when converting. |
Rewrote a number of the conversions between ASCIIString, UTF8String, UTF16String, and UTF32String, and also rewrote length() for UTF16String().
I know nothing about strings, but the performance improvement looks great! |
…rent UTF & ASCII string types
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
… to 10x faster
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation Fix AbstractVector{UInt16} conversion Remove support for converting Vector{UInt16} to UTF8String Add Unicode validation function and fix UTF-16 conversion bugs
Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Fix #10959 bugs with UTF-16 conversions
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Update for search change Updated to use unsafe_checkstring, fix comments
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Update for search change Updated to use unsafe_checkstring, fix comments
Fix #10959 bugs with UTF-32 conversions
Update for search change Updated to use unsafe_checkstring, fix comments Update comments Remove @inline from test function Removed conversions with Vector{Char} Ensure all changes included
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
Use generic is_valid_continuation from unicode/checkstring instead of is_utf8_continuation/is_utf8_start
These all tend to be rather slow because they create an empty array, and append to it 1 character at a time... also, instead of having specific converters for things like:
ASCIIString -> UTF16String, UTF8String -> UTF16String, UTF32String -> UTF16String
ASCIIString -> UTF32String, UTF8String -> UTF32String, UTF16String -> UTF32String
these are handled by a single generic function (encode16 for UTF16String)
The text was updated successfully, but these errors were encountered: