Invalid second UTF8String indices don't cause errors #14158

JobJob · 2015-11-26T17:31:01Z

s = "🛀🛀" #a bath codepoint is 4 bytes long
s12 = s[1:2] # "🛀"
s56 = s[5:6] # "?" #the expected behaviour is this should be the same as s12?
sizeof(s12)  # 4
sizeof(s56)  # 2

The issue seems to be that:

nextind(s, 2) # 5
nextind(s, 6) # 7 (prob should be 9)

I tried changing https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L141 to

return sizeof(s)+1

and it seems to fix the problem. Additionally julia5 runtests.jl unicode string strings/basic strings/io strings/search strings/types strings/util succeeds after making this change and recompiling.

Should I add some tests and submit a PR?

The text was updated successfully, but these errors were encountered:

nalimilan · 2015-11-26T17:49:46Z

Also, the docs for nextind say "Returns a value greater than endof(str) at or after the end of the string." So they should be adapted to mention sizeof(str).

…JuliaLang#14158

JobJob · 2015-11-26T18:17:22Z

Done, including doc changes and tests. Cheers.

stevengj · 2015-11-26T19:14:41Z

You really shouldn't be relying on string ranges to work when you supply invalid indices. (This may not even be possible in the future; #9297.) See also the discussion in #5446.

JobJob · 2015-11-27T11:47:48Z

Cool. Cheers. Closing.

nalimilan · 2015-11-27T12:27:20Z

Rather than closing, if you want to make a PR to raise an error, it would be great.

JobJob · 2015-11-27T14:15:43Z

So both s[1:2] and s[5:6] would throw a UnicodeError: invalid character index? I'd need to rename the issue to something like "Invalid string indices don't cause errors"?

Plus it would be breaking for people who were doing the wrong thing previously (not right at the end of the string). Is everyone cool with that? Seems like might be better for someone (hint hint) to actually address #9297?

Happy to do it of course, just want to get the green light first.

stevengj · 2015-11-27T14:25:37Z

I think that throwing an error on any invalid range would be a good warmup for #9297. If this throws lots of errors, it's an indication that #9297 will be traumatic.

JobJob · 2015-11-27T14:28:43Z

Ha ok, well I'm going to point the finger at you guys when the hordes come baying for blood :)

nalimilan · 2015-11-27T20:15:08Z

Of course, that wouldn't be backported to 0.4, to leave some time for people to adapt. We could also simply print a deprecation warning (see depwarn) to ease the transition.

code point Plus a few tests fixes JuliaLang#14158

JobJob · 2015-12-01T18:07:33Z

Say a UTF8String s is 3 bytes long, e.g. this test:

# make sure substrings handle last code unit even if not start of code point
let s = "x\u0302" 
    #s is "x̂", sizeof(s) == 3, "\u0302" is the (circumflex) diacritic (2 bytes)
    @test s[1:3] == s
end

Do we want this to throw "UnicodeError: invalid char index" because isvalid(s,3) is false (as was briefly suggested here: #14160 (comment))? My inclination is no. If the start of the range is at the first byte of a code point and the end of the range is at the last byte of a code point, I think it should be regarded as valid.

However, it does mean that s[1:3] is the same as s[1:2], which to me is a little inconsistent, though it is consistent with s[2] returning Char(s[2:3]) (the circumflex in this example).

What are your thoughts? See the WIP which implements this #14217

Btw, when you do julia runtests.jl the tests in test/strings/*.jl and test/unicode/*.jl don't seem to be run, is that intentional?

tkelman · 2015-12-01T18:24:56Z

those tests should be getting included by a main strings and/or unicode test

JobJob · 2015-12-01T18:29:55Z

ahh indeed they are

JobJob · 2015-12-01T19:07:22Z

@stevengj re j not being allowed to be the last byte of a code point, it seems weird that since:
a) say s="🐨🐨" s[1:4] covers all the bytes of the first koala. Shouldn't that be ok? It's not like they have a partial (invalid) code point.
b) As it stands s[i:j] basically is a shorthand for s.data[i:j] (with some bounds checks and checking for not fully formed code points, plus the additional convenience of s[start of code point] returning the whole code point), that s[1:lastidx(s)] or equivalently s[1:sizeof(x)] would throw a unicode error and not be equivalent to s seems weird.

This change is not really very breaking (didn't break anything obvious (AFAICT) or make any tests fail in base). I think it's the smallest possible change that makes the behaviour of the OP in this issue consistent. Anything that's more breaking, especially that conceptually changes the shorthand of s[i:j] to being something other than an index into the bytes of s.data for UTF8Strings should be done with one of the more comprehensive proposals in #9297 IMO.

stevengj · 2015-12-01T20:24:07Z

Moving forward, I think the consensus has been that anything you put inside the brackets in s[...] should be valid indices. The indices are not bytes. Hence the endpoints of a non-empty range should be valid indices (the empty range s[1:0] should be allowed).

You especially shouldn't be using sizeof(s) to compute indices, that is just asking for trouble, especially if your code accepts any AbstractString. (Think of what this does for a UTF-32 string!) Anyone who is doing s[1:sizeof(s)] needs to get an exception ASAP, because their code is really ByteString-centric.

(You wouldn't expect a[1:sizeof(a)] to work for arrays, so why should it work for strings?)

JobJob · 2015-12-01T20:29:32Z

But e.g. s="🐨🐨" s[5:5] is not a byte index (well I guess it's not but what about s[5])? It's definitely not a code point or grapheme index. Currently it's just a little inconsistent IMO, that's why it needs a thorough overhaul. Are you happy with the PR in it's current state (pending the fixes you recommend) as a minor (if fairly inconsequential) fix for this particular case?

stevengj · 2015-12-01T20:32:43Z

String indices are best thought of as an opaque type that happens to be represented by integers. Technically, it is a code-unit index, where a code-unit is 1 byte for UTF-8 but 2-bytes for UTF-16 and 4 bytes for UTF-32. In the future, the string index type may be even further abstracted.

For example, in utf16("🐨🐨")[3:3], the 3 (a valid index) is not a byte index, because the code units are 16-bit words.

stevengj · 2015-12-01T20:34:34Z

(Valid indices are always the indices of code units that begin the encoding of a code point.)

JobJob · 2015-12-01T20:34:39Z

ok but say s[5:5]for the example above returns code units 5 to 8

stevengj · 2015-12-01T20:35:22Z

@JobJob, yes, because the code-unit index refers to the whole codepoint that follows. e.g. s[5] returns a codepoint, not a code unit.

stevengj · 2015-12-01T20:35:59Z

As explained in the manual, Conceptually, a string is a partial function from indices to characters — for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings. Though probably this should be corrected to say "code-unit index" rather than "byte index".

JobJob · 2015-12-01T20:38:01Z

Yeah I read that, it still makes more sense to me that if the indices are indexing code units then if you want code units 1 to 8 you should use s[1:8] not s[1:5]

stevengj · 2015-12-01T20:39:28Z

No, because 8 is not a valid string index. You are only allowed to index the code unit that begins a codepoint. Again, the fact that the indices are actually offsets of code units is an implementation detail that the user is not intended to muck around with. The string is a partial function from integers to codepoints.

stevengj · 2015-12-01T20:40:51Z

As a practical matter, any code that passes an invalid index is likely to be doing so because of a bug, e.g. it is using sizeof to compute indices, or is adding/subtracting 1 rather than calling nextind/prevind. It is better to throw an exception to catch the bug rather than to try to guess the user's true intention.

JobJob · 2015-12-01T20:48:25Z

Looking into what breaks with s[1:8] and s[1:4] being invalid (this wasn't the behaviour previously).

JobJob · 2015-12-01T21:22:58Z

btw @nalimilan seemed to be implying with this comment #14160 (comment) that an index at the end of the character might be ok to him (though it's not completely unambiguous what he meant by middle - I assumed not at the start or end).

This code breaks: https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L48-L59

JobJob · 2015-12-01T21:29:41Z

Oh actually the chomp! doesn't break, only the chomp for say s="foø\n"

nalimilan · 2015-12-01T21:32:33Z

By "middle" I meant "anything not the first byte of a code point". I fully agree with @stevengj here: better be strict, as anyway people should never get indices starting elsewhere except when doing buggy computations.

Why would chomp break? It indexes directly s.data, so the checks do not apply there. EDIT: And s[1:end-1] is guaranteed to work since the code checks that the last byte is a line break (one byte in ASCII and UTF-8).

JobJob · 2015-12-01T21:39:03Z

because https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L50 it returns s[1:end-1] which is s[1:4] and invalid (if the second index can't be the end of a code point). Is this the right change there?

chomp{T<:ByteString}(s::T) =
       (endof(s) < 1 || s.data[end]   != 0x0a) ? s :
       (endof(s) < 2 || s.data[end-1] != 0x0d) ? T(s.data[1:end-1]) : T(s.data[1:end-2])

nalimilan · 2015-12-01T21:41:22Z

Oh, right, that's trickier than it sounds. Your solution sounds OK to me.

stevengj · 2015-12-01T21:57:25Z

Hmm, this example makes me want to change my mind here; without accessing the internal s.data, the "correct" thing here would be s[1:prevind(s,endof(s))], which is awfully awkward. Maybe we should allow the end of a range to be the last codeunit in a codepoint after all.

JobJob · 2015-12-01T22:08:27Z

Honestly, it's funny you say that because the same thing happened to me :)

I actually started implementing it originally as per your current plan a few days back, and wrote the tests in the PR (that chop and chomp fail on) after seeing these potential bugs (if end indices have to pass isvalid), and then I saw that it was sort of assumed in a few places that s[i:j] is ok if j is the last byte of a code point (was focussed on utf8), and I thought you know what, changing this is probably quite a big change, and thus came to the view expressed above :)

JobJob · 2015-12-01T22:10:07Z

Having said that this is all string library code that should know what it's doing, so there's an argument that these are implementation aware optimisations. Edit: actually, I guess what you're getting at is user code that would have to do things like that. I think s[1:prevind(s,endof(s))] is probably the worst example, usually you'd have say a match m and would do say s[1:prevind(s,m.offset)] as opposed to say s[1:m.offset-1] to get the sub-string up to the match (the prevind version is not awful IMO)

JobJob · 2015-12-01T23:06:10Z

Anyway, if we are continuing down this path, chop("foø") fails with the current chop(s::AbstractString) = s[1:end-1]

This could change to:

chop(s::Union{ASCIIString, UTF32String}) = s[1:end-1]
chop(s::AbstractString) = s[1:prevind(s,endof(s))]

I'm certain there's a faster implementation for UTF16, UTF8.

I'm off to bed now, I've pushed my changes to that same branch in the PR with !isvalid(j) throwing, here are the other places I've seen failures:

this test fails: https://github.com/JuliaLang/julia/blob/master/test/dates/io.jl#L192

And I get an issue with the REPL crashing, by doing the following:

julia> s = "🐨🐨"
"🐨🐨"
julia>

press up (to get the previous line), then backspace to delete the end double quote

julia> s = "🐨🐨"
"🐨🐨"
julia> s = "🐨🐨

press down, then up and the repl crashes:

julia> s = "🐨🐨"
"🐨🐨"
julia> s = "🐨🐨  WARNING: Caught an exception in the keymap:
ERROR: UnicodeError: invalid character index
ERROR: type PrefixSearchState has no field p
 in run_interface at ./LineEdit.jl:1610
 in run_frontend at ./REPL.jl:864
 in run_repl at ./REPL.jl:167
 in _start at ./client.jl:419

Cheers.

nalimilan · 2015-12-02T09:30:33Z

The trouble is that if we allow/encourage s[1:idx-1], people will (quite logically) believe that s[1:idx+1] is fine too. All of this calls for the dedicated string index type. In the meantime, we could continue allowing to index the last byte of the code point, but I don't see it as a terrible issue: people need to use nextind() anyway.

JobJob · 2015-12-02T13:38:28Z

Ok, as stated earlier, I agree that a string index type (or something like it) is needed to actually fix these problems, and the PR I proposed is just a minor fix for the inconsistency in the OP. Shall I revert the PR back to its original state with second indices at the end of code points allowed (and incorporating the other code improvements suggested by @stevengj)?

Btw in the discussion above I wasn't familiar with the general usage of sizeof and had thought it was only defined for these string types, and since I had only really looked at UTF8 and ASCII implementations up to that point, was mistakenly assuming sizeof was effectively equivalent to lastidx (i.e. that they were the index of the last code unit in s.data). My point was that it seemed strange to me that s[1:lastidx(s)] was not equivalent to s - though I acknowledge that as lastidx is not exported it is an implementation detail that shouldn't be relied on.

ScottPJones · 2015-12-02T14:20:04Z

Very good to see somebody else taking an interest in fixing some of the string issues!
If possible, it would be good if for non-DirectIndexString types, that only the 1st and last code-units are allowed (and only for the end point), but not any middle code-units, if any (not an issue for UTF16String, since will only have 1 or 2 code-units per character).

JobJob · 2015-12-02T15:16:53Z

So essentially only for UTF8? That was the original intent of the PR: #14217 the "original state" I mentioned reverting back to just above was the first commit of that PR, I then added the second commit to give people a look at what happens if you don't allow the second index to be at the end of a code point.

ScottPJones · 2015-12-02T15:51:58Z

At least among the current string types supported in Julia, yes, only UTF8String is a problem.

stevengj · 2015-12-02T19:25:17Z

@JobJob, after having slept on this I think we should go forward with disallowing ranges with invalid endpoints. Correct s[1:end-1] cases seems like they should be rare enough that I don't think it makes sense for us to contort the semantics to handle this. Also, it should make the transition to a StringIndex type easier to catch these kinds of things now (and will increase the impetus to implement the latter).

stevengj · 2015-12-02T19:32:02Z

(Similarly for UTF16String: utf16("🐨x")[1:end-1] should be an error.)

JobJob · 2015-12-02T23:13:27Z

Agree that it's more semantically consistent this way, given the way e.g. s[1] works, but not sure if it's more intuitive or user friendly, still thinking about it. In any case it doesn't matter because it's all going to change hopefully for much the better.

Pushed an initial attempt at getindex UTF16. Will only have time to look at this more again perhaps later tomorrow.

code point Plus a few tests fixes JuliaLang#14158

JobJob · 2015-12-03T12:22:35Z

PR updated, I'm fine with your decision. I think it's agreed all round that something like a string index type is needed to properly resolve these issues.

Regarding the REPL issue, I wasn't able to fix it after a quick look at the code, I note a very similar error was happening before this change, using the steps above but just hitting backspace after the final step. The main difference is that those steps cause the REPL to crash completely with these changes.

code point Plus a few tests fixes JuliaLang#14158

JobJob · 2016-11-26T03:21:32Z

This works consistently now. In the OP's example: s[1:2] == s[5:6] == "🛀" and it doesn't error. Closing.

JobJob added a commit to JobJob/julia that referenced this issue Nov 26, 2015

Fix nextind(s,i) for UTF8Strings where endof(s) < i < sizeof(s). Fixes …

6622640

…JuliaLang#14158

JobJob mentioned this issue Nov 26, 2015

Fixes #14158 #14160

Closed

JobJob closed this as completed Nov 27, 2015

JobJob added a commit to JobJob/julia that referenced this issue Dec 1, 2015

Throw UnicodeError for s[i:j] when j is not at the start or the end of a

00c8cf5

code point Plus a few tests fixes JuliaLang#14158

JobJob changed the title ~~nextind UTF8String issue~~ Invalid second UTF8String indices don't cause errors Dec 1, 2015

JobJob mentioned this issue Dec 1, 2015

WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

Closed

JobJob reopened this Dec 1, 2015

JobJob added a commit to JobJob/julia that referenced this issue Dec 3, 2015

Throw UnicodeError for s[i:j] when j is not at the start or the end of a

d3b3174

code point Plus a few tests fixes JuliaLang#14158

JobJob added a commit to JobJob/julia that referenced this issue Dec 4, 2015

Throw UnicodeError for s[i:j] when j is not at the start or the end of a

f814cc5

code point Plus a few tests fixes JuliaLang#14158

JobJob closed this as completed Nov 26, 2016

nalimilan mentioned this issue Jun 26, 2017

checkbounds for strings #22548

Closed

Invalid second UTF8String indices don't cause errors #14158

Invalid second UTF8String indices don't cause errors #14158

Comments

JobJob commented Nov 26, 2015

nalimilan commented Nov 26, 2015

JobJob commented Nov 26, 2015

stevengj commented Nov 26, 2015

JobJob commented Nov 27, 2015

nalimilan commented Nov 27, 2015

JobJob commented Nov 27, 2015

stevengj commented Nov 27, 2015

JobJob commented Nov 27, 2015

nalimilan commented Nov 27, 2015

JobJob commented Dec 1, 2015

tkelman commented Dec 1, 2015

JobJob commented Dec 1, 2015

JobJob commented Dec 1, 2015

stevengj commented Dec 1, 2015

JobJob commented Dec 1, 2015

stevengj commented Dec 1, 2015

stevengj commented Dec 1, 2015

JobJob commented Dec 1, 2015

stevengj commented Dec 1, 2015

stevengj commented Dec 1, 2015

JobJob commented Dec 1, 2015

stevengj commented Dec 1, 2015

stevengj commented Dec 1, 2015

JobJob commented Dec 1, 2015

JobJob commented Dec 1, 2015

JobJob commented Dec 1, 2015

nalimilan commented Dec 1, 2015

JobJob commented Dec 1, 2015

nalimilan commented Dec 1, 2015

stevengj commented Dec 1, 2015

JobJob commented Dec 1, 2015

JobJob commented Dec 1, 2015

JobJob commented Dec 1, 2015

nalimilan commented Dec 2, 2015

JobJob commented Dec 2, 2015

ScottPJones commented Dec 2, 2015

JobJob commented Dec 2, 2015

ScottPJones commented Dec 2, 2015

stevengj commented Dec 2, 2015

stevengj commented Dec 2, 2015

JobJob commented Dec 2, 2015

JobJob commented Dec 3, 2015

JobJob commented Nov 26, 2016