-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UTF encoding validity functions #11575
Conversation
Prepare your 🍅s! I'm off for dinner, please let me know what you think of this and #11573 |
@@ -84,6 +84,8 @@ include("iterator.jl") | |||
include("osutils.jl") | |||
|
|||
# strings & printing | |||
include("utferror.jl") | |||
include("utftypes.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you'll need to include("utfcheck.jl")
here too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... somehow that got lost... was definitely part of my source locally...
6714044
to
f1c1583
Compare
Please rebase relative to latest master now that #11573 is merged |
Of course! Thanks very much. Should I squash everything also? |
It's only 2 commits here, and rebasing would remove 1 since that's now on master. If you want to amend your commit message while rebasing to make it a little more descriptive that would be welcomed. |
f1c1583
to
79f2618
Compare
As I mentioned at #11551 (comment), I think this code is essentially ready now and good to merge, assuming the tests pass. |
Is that what you wanted, as far as the commit message? Thanks! |
The indentation is a little odd, but it's fine. Nice and descriptive, I like that part. |
Ah, I see about what happened to the indentation... I updated that directly on GitHub. |
elseif ch < 0x800 | ||
num2byte += 1 | ||
flags |= UTF_UNICODE2 | ||
elseif T != Vector{UInt16} && ch > 0x0ffff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the type check here actually necessary? if T==UInt16, then typemax(T) == 0xffff && ch > 0xffff should be false. no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's checking against Vector{UInt16}
, not UInt16
, so I don't think Julia can figure it out by itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
julia> function t16(x)
local a::UInt32
@inbounds a, i = next(x, 1)
if a > 0xffff
return 1
else
return 2
end
end
t16 (generic function with 1 method)
julia> code_native(t16,Tuple{Vector{UInt16}})
.section __TEXT,__text,regular,pure_instructions
Filename: none
Source line: 4
pushq %rbp
movq %rsp, %rbp
movl $2, %eax
Source line: 4
popq %rbp
ret
julia> code_llvm(t16,Tuple{Vector{UInt16}})
define i64 @julia_t16_20941(%jl_value_t*) {
top:
ret i64 2
}
that looks pretty successful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I changed that. I know that at one point (before Jeff put in the change that fixed the big performance regression compared to 0.3.x I'd seen) it seemed not to get that correctly... (unless my eyes where playing tricks on me!), but I hadn't thought to try again (also my test wasn't the same as yours, it was using an AbstractString)
79f2618
to
da43b2a
Compare
The only failures are due to the OOM problems affecting everybody. |
Added new convert methods that use the check_string function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
end | ||
|
||
function check_string(dat::Vector{UInt8}, len = sizeof(dat), pos = 0 ; options::Integer=0) | ||
" Validates and calculates number of characters in a UTF-8 encoded vector of UInt8 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe @nalimilan said this in one of the other PR's, but the doc string convention is more likely to be docs immediately preceding the function, not just inside the function. Not positive though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hadn't seen that, but, why introduce a different convention from what's very commonly in use (i.e. Python documentation)? I'd heard that void strings (which I took to mean any string inside or outside of a method, since it wasn't at all specified, that wasn't used (as an argument, to be returned, or set into a variable), would in the future be picked out for documentation.
Since it seems there is a lot of overlap between Python and Julia programmers (all the SciPy, NumPy, etc. people), I think it would be a very good think to be consistent with that, if there is no compelling reason not to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compelling reason being Julia has multiple dispatch, with multiple method implementations for the same generic function but different argument types, which Python doesn't have. That and the @doc
macro works for docstring-before-function, so I believe that's the direction that things are going.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of the @doc
macro apparently will no longer be necessary (or even the doc
string prefix, if the latest version of Docile were put in Base. I'm not sure why multiple dispatch feels like a compelling reason to you... This doesn't remove the ability to (later in the build process, at least) use @doc doc"""...""" ->
syntax before the first function of a set of methods...
I think you are going to see this style whether you like it or not... just this week I had to tell a Python programmer learning Julia that he had to move his comments out of the function and use @doc doc"""...""" ->
in order to get Help to pick it up. Allowing this style will make a lot of people moving from Python happy (or less confused, at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
People coming from Doxygen, Javadoc or gtk-doc are used to having docs before the body of the function, so you can't please everybody. Anyway, didn't you support a solution similar to Doxygen? :-p
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doxygen doesn't have anything to do with it... you can use doxygen with Python, which has the docstring inside the function.
Also, I'm not saying that the default shouldn't be to have the documentation before... but why couldn't both be allowed?
About doxygen - I just like having things documented, and using doxygen pushes that a bit, but I've never been particularly attached to the syntax, just to having the necessary information preserved to make cross-references, etc. Since nobody here seems to like the doxygen syntax, and somebody on another thread talked about parsing the comments to build the metadata, (which could then be used to produce output for doxygen), I decided that that was the better solution for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just pick one convention, and use it without arguing please? Docstrings still need some work, but for code outside of base, docstring before function has been implemented, docstring inside of function hasn't. If someone wants to make a PR to support docstrings inside of functions in Docile or the base docs system, then that convention could be an alternate to accommodate people coming from python. For now, wouldn't it be easier to just make this change than waste more bandwidth resisting trivial changes that multiple reviewers are asking you to make?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with changing it... I was responding via my phone until a couple of hours ago... out enjoying the day... I'm back at the laptop, doing it now.
Updated comment to go before function, not indented by 4
Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
I believe I've handled everything... please take a look! |
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Ah, something that @mbauman's comment on the other thread reminded me of: to check bounds in |
I'm already changing it - it doesn't really matter that it's shorter though, it is pretty important that it is correct! 😀 Thanks! |
@nalimilan Turns out, there would still be problems, using |
@mbauman Could you please take a look at my use of |
Yes, I'm sorry, I was missing the context of this PR, and was answering all your questions (both here and in #11713) in the general case for all I think that's just fine for internal code like this, but this is outside of my purview and I won't be helping to maintain this code directly. Edit: I will be working more on array indexing, and I do not foresee the implementation of iteration for |
@mbauman Thanks very much. I'll be looking forward to your planned Interfaces manual page! |
Adds `check_string` function, which checks a vector of bytes, 16-bit or 32-bit words, or an AbstractString for validity, either for UTF-8, UTF-16, or UTF-32 encoding. By default, `Modified UTF-8 (long \0 encoding)` and `CESU-8 (surrogate pairs encoded as 2 UTF-8 3-byte sequences)` are allowed, but other over long encoded sequences are not allowed, but this can be changed by the keyword options argument. Add unit tests of all the errors found by `check_string` Updated documentation to not use doxygen tags. Move documentation strings from line after to line before Add testing of valid strings Improve/consolidate documentation Add bounds checking Change name to unsafe_checkstring, warn that doesn't check bounds Add checkstring, which does check bounds Add tests of bounds checking Change order of start/end positions Update bounds checking tests Change 1 to start(dat) Use checkbounds()
c457bfb
to
e462158
Compare
OK, I've changed this according to the very last round of suggestions in #11713. |
Unless anyone objects strongly, I'll merge this in a day or two so we can continue on to the actual bug fixes. This part of the code looks pretty much done to me, so for the sake of future unicode bugfix backporting if nothing else let's get it on master. |
Add UTF encoding validity functions
🎉 Thank you very much, @tkelman! I'll go rebase the rest of my PRs now... |
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation Fix AbstractVector{UInt16} conversion Remove support for converting Vector{UInt16} to UTF8String Add Unicode validation function and fix UTF-16 conversion bugs
Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575
Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical
This introduces the
Base.check_string
function, with methods for handling UTF-8, UTF-16, and UTF-32 as vectors ofUInt8
,UInt16
, andUInt32
, respectively, as well as anAbstractString
version that operates on Unicode characters (making sure that they are all valid code points,0 <= ch < 0xd800
,0xe000 <= ch < 0x10ffff
).There are options to accept or not accept things like
Modified UTF-8
encoding, orCESU-8
encoding,or "overly long" encodings. (See
utfcheck.jl
for documentation)These methods either throw a
UnicodeError
, with information about which character was invalid, andits position in the input, or they return a tuple, of the number of logical characters in the string, a bit flag to indicate what types of data were found (i.e. all ASCII, all Latin1, surrogates present, overlong characters present, etc.), the number of characters that would take 2 bytes to encode in UTF-8, the number that would take 3 bytes, and the number that would take 4 bytes.
This information is designed to be useful for future conversion routines, because they allow one to calculate the exact size needed to represent a valid string in UTF-8, UTF-16, or UTF-32, and also
indicate whether an optimized "widening" or "narrowing" conversion can be done.