-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid second UTF8String indices don't cause errors #14158
Comments
Also, the docs for |
Done, including doc changes and tests. Cheers. |
Cool. Cheers. Closing. |
Rather than closing, if you want to make a PR to raise an error, it would be great. |
So both Plus it would be breaking for people who were doing the wrong thing previously (not right at the end of the string). Is everyone cool with that? Seems like might be better for someone (hint hint) to actually address #9297? Happy to do it of course, just want to get the green light first. |
Ha ok, well I'm going to point the finger at you guys when the hordes come baying for blood :) |
Of course, that wouldn't be backported to 0.4, to leave some time for people to adapt. We could also simply print a deprecation warning (see |
code point Plus a few tests fixes JuliaLang#14158
Say a UTF8String
Do we want this to throw "UnicodeError: invalid char index" because However, it does mean that What are your thoughts? See the WIP which implements this #14217 Btw, when you do |
those tests should be getting included by a main strings and/or unicode test |
ahh indeed they are |
@stevengj re This change is not really very breaking (didn't break anything obvious (AFAICT) or make any tests fail in base). I think it's the smallest possible change that makes the behaviour of the OP in this issue consistent. Anything that's more breaking, especially that conceptually changes the shorthand of |
Moving forward, I think the consensus has been that anything you put inside the brackets in You especially shouldn't be using (You wouldn't expect |
But e.g. |
String indices are best thought of as an opaque type that happens to be represented by integers. Technically, it is a code-unit index, where a code-unit is 1 byte for UTF-8 but 2-bytes for UTF-16 and 4 bytes for UTF-32. In the future, the string index type may be even further abstracted. For example, in |
(Valid indices are always the indices of code units that begin the encoding of a code point.) |
ok but say |
@JobJob, yes, because the code-unit index refers to the whole codepoint that follows. e.g. |
As explained in the manual, Conceptually, a string is a partial function from indices to characters — for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings. Though probably this should be corrected to say "code-unit index" rather than "byte index". |
Yeah I read that, it still makes more sense to me that if the indices are indexing code units then if you want code units 1 to 8 you should use |
No, because |
As a practical matter, any code that passes an invalid index is likely to be doing so because of a bug, e.g. it is using |
Looking into what breaks with |
btw @nalimilan seemed to be implying with this comment #14160 (comment) that an index at the end of the character might be ok to him (though it's not completely unambiguous what he meant by middle - I assumed not at the start or end). This code breaks: https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L48-L59 |
Oh actually the |
By "middle" I meant "anything not the first byte of a code point". I fully agree with @stevengj here: better be strict, as anyway people should never get indices starting elsewhere except when doing buggy computations. Why would |
because https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L50 it returns
|
Oh, right, that's trickier than it sounds. Your solution sounds OK to me. |
Hmm, this example makes me want to change my mind here; without accessing the internal |
Honestly, it's funny you say that because the same thing happened to me :) I actually started implementing it originally as per your current plan a few days back, and wrote the tests in the PR (that chop and chomp fail on) after seeing these potential bugs (if end indices have to pass |
Having said that this is all string library code that should know what it's doing, so there's an argument that these are implementation aware optimisations. Edit: actually, I guess what you're getting at is user code that would have to do things like that. I think |
Anyway, if we are continuing down this path, This could change to:
I'm certain there's a faster implementation for UTF16, UTF8. I'm off to bed now, I've pushed my changes to that same branch in the PR with this test fails: https://github.com/JuliaLang/julia/blob/master/test/dates/io.jl#L192 And I get an issue with the REPL crashing, by doing the following:
press up (to get the previous line), then backspace to delete the end double quote
press down, then up and the repl crashes:
Cheers. |
The trouble is that if we allow/encourage |
Ok, as stated earlier, I agree that a string index type (or something like it) is needed to actually fix these problems, and the PR I proposed is just a minor fix for the inconsistency in the OP. Shall I revert the PR back to its original state with second indices at the end of code points allowed (and incorporating the other code improvements suggested by @stevengj)? Btw in the discussion above I wasn't familiar with the general usage of |
Very good to see somebody else taking an interest in fixing some of the string issues! |
So essentially only for UTF8? That was the original intent of the PR: #14217 the "original state" I mentioned reverting back to just above was the first commit of that PR, I then added the second commit to give people a look at what happens if you don't allow the second index to be at the end of a code point. |
At least among the current string types supported in Julia, yes, only |
@JobJob, after having slept on this I think we should go forward with disallowing ranges with invalid endpoints. Correct |
(Similarly for |
Agree that it's more semantically consistent this way, given the way e.g. s[1] works, but not sure if it's more intuitive or user friendly, still thinking about it. In any case it doesn't matter because it's all going to change hopefully for much the better. Pushed an initial attempt at |
code point Plus a few tests fixes JuliaLang#14158
PR updated, I'm fine with your decision. I think it's agreed all round that something like a string index type is needed to properly resolve these issues. Regarding the REPL issue, I wasn't able to fix it after a quick look at the code, I note a very similar error was happening before this change, using the steps above but just hitting backspace after the final step. The main difference is that those steps cause the REPL to crash completely with these changes. |
code point Plus a few tests fixes JuliaLang#14158
This works consistently now. In the OP's example: |
The issue seems to be that:
I tried changing https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L141 to
and it seems to fix the problem. Additionally
julia5 runtests.jl unicode string strings/basic strings/io strings/search strings/types strings/util
succeeds after making this change and recompiling.Should I add some tests and submit a PR?
The text was updated successfully, but these errors were encountered: