Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invalid second UTF8String indices don't cause errors #14158

Closed
JobJob opened this issue Nov 26, 2015 · 43 comments
Closed

Invalid second UTF8String indices don't cause errors #14158

JobJob opened this issue Nov 26, 2015 · 43 comments

Comments

@JobJob
Copy link

JobJob commented Nov 26, 2015

s = "🛀🛀" #a bath codepoint is 4 bytes long
s12 = s[1:2] # "🛀"
s56 = s[5:6] # "?" #the expected behaviour is this should be the same as s12?
sizeof(s12)  # 4
sizeof(s56)  # 2

The issue seems to be that:

nextind(s, 2) # 5
nextind(s, 6) # 7 (prob should be 9)

I tried changing https://github.com/JuliaLang/julia/blob/master/base/strings/basic.jl#L141 to

return sizeof(s)+1

and it seems to fix the problem. Additionally julia5 runtests.jl unicode string strings/basic strings/io strings/search strings/types strings/util succeeds after making this change and recompiling.

Should I add some tests and submit a PR?

@nalimilan
Copy link
Member

Also, the docs for nextind say "Returns a value greater than endof(str) at or after the end of the string." So they should be adapted to mention sizeof(str).

@JobJob
Copy link
Author

JobJob commented Nov 26, 2015

Done, including doc changes and tests. Cheers.

@stevengj
Copy link
Member

You really shouldn't be relying on string ranges to work when you supply invalid indices. (This may not even be possible in the future; #9297.) See also the discussion in #5446.

@JobJob
Copy link
Author

JobJob commented Nov 27, 2015

Cool. Cheers. Closing.

@JobJob JobJob closed this as completed Nov 27, 2015
@nalimilan
Copy link
Member

Rather than closing, if you want to make a PR to raise an error, it would be great.

@JobJob
Copy link
Author

JobJob commented Nov 27, 2015

So both s[1:2] and s[5:6] would throw a UnicodeError: invalid character index? I'd need to rename the issue to something like "Invalid string indices don't cause errors"?

Plus it would be breaking for people who were doing the wrong thing previously (not right at the end of the string). Is everyone cool with that? Seems like might be better for someone (hint hint) to actually address #9297?

Happy to do it of course, just want to get the green light first.

@stevengj
Copy link
Member

I think that throwing an error on any invalid range would be a good warmup for #9297. If this throws lots of errors, it's an indication that #9297 will be traumatic.

@JobJob
Copy link
Author

JobJob commented Nov 27, 2015

Ha ok, well I'm going to point the finger at you guys when the hordes come baying for blood :)

@nalimilan
Copy link
Member

Of course, that wouldn't be backported to 0.4, to leave some time for people to adapt. We could also simply print a deprecation warning (see depwarn) to ease the transition.

JobJob added a commit to JobJob/julia that referenced this issue Dec 1, 2015
@JobJob JobJob changed the title nextind UTF8String issue Invalid second UTF8String indices don't cause errors Dec 1, 2015
@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Say a UTF8String s is 3 bytes long, e.g. this test:

# make sure substrings handle last code unit even if not start of code point
let s = "x\u0302" 
    #s is "x̂", sizeof(s) == 3, "\u0302" is the (circumflex) diacritic (2 bytes)
    @test s[1:3] == s
end

Do we want this to throw "UnicodeError: invalid char index" because isvalid(s,3) is false (as was briefly suggested here: #14160 (comment))? My inclination is no. If the start of the range is at the first byte of a code point and the end of the range is at the last byte of a code point, I think it should be regarded as valid.

However, it does mean that s[1:3] is the same as s[1:2], which to me is a little inconsistent, though it is consistent with s[2] returning Char(s[2:3]) (the circumflex in this example).

What are your thoughts? See the WIP which implements this #14217

Btw, when you do julia runtests.jl the tests in test/strings/*.jl and test/unicode/*.jl don't seem to be run, is that intentional?

@JobJob JobJob reopened this Dec 1, 2015
@tkelman
Copy link
Contributor

tkelman commented Dec 1, 2015

those tests should be getting included by a main strings and/or unicode test

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

ahh indeed they are

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

@stevengj re j not being allowed to be the last byte of a code point, it seems weird that since:
a) say s="🐨🐨" s[1:4] covers all the bytes of the first koala. Shouldn't that be ok? It's not like they have a partial (invalid) code point.
b) As it stands s[i:j] basically is a shorthand for s.data[i:j] (with some bounds checks and checking for not fully formed code points, plus the additional convenience of s[start of code point] returning the whole code point), that s[1:lastidx(s)] or equivalently s[1:sizeof(x)] would throw a unicode error and not be equivalent to s seems weird.

This change is not really very breaking (didn't break anything obvious (AFAICT) or make any tests fail in base). I think it's the smallest possible change that makes the behaviour of the OP in this issue consistent. Anything that's more breaking, especially that conceptually changes the shorthand of s[i:j] to being something other than an index into the bytes of s.data for UTF8Strings should be done with one of the more comprehensive proposals in #9297 IMO.

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

Moving forward, I think the consensus has been that anything you put inside the brackets in s[...] should be valid indices. The indices are not bytes. Hence the endpoints of a non-empty range should be valid indices (the empty range s[1:0] should be allowed).

You especially shouldn't be using sizeof(s) to compute indices, that is just asking for trouble, especially if your code accepts any AbstractString. (Think of what this does for a UTF-32 string!) Anyone who is doing s[1:sizeof(s)] needs to get an exception ASAP, because their code is really ByteString-centric.

(You wouldn't expect a[1:sizeof(a)] to work for arrays, so why should it work for strings?)

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

But e.g. s="🐨🐨" s[5:5] is not a byte index (well I guess it's not but what about s[5])? It's definitely not a code point or grapheme index. Currently it's just a little inconsistent IMO, that's why it needs a thorough overhaul. Are you happy with the PR in it's current state (pending the fixes you recommend) as a minor (if fairly inconsequential) fix for this particular case?

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

String indices are best thought of as an opaque type that happens to be represented by integers. Technically, it is a code-unit index, where a code-unit is 1 byte for UTF-8 but 2-bytes for UTF-16 and 4 bytes for UTF-32. In the future, the string index type may be even further abstracted.

For example, in utf16("🐨🐨")[3:3], the 3 (a valid index) is not a byte index, because the code units are 16-bit words.

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

(Valid indices are always the indices of code units that begin the encoding of a code point.)

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

ok but say s[5:5]for the example above returns code units 5 to 8

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

@JobJob, yes, because the code-unit index refers to the whole codepoint that follows. e.g. s[5] returns a codepoint, not a code unit.

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

As explained in the manual, Conceptually, a string is a partial function from indices to characters — for some index values, no character value is returned, and instead an exception is thrown. This allows for efficient indexing into strings by the byte index of an encoded representation rather than by a character index, which cannot be implemented both efficiently and simply for variable-width encodings of Unicode strings. Though probably this should be corrected to say "code-unit index" rather than "byte index".

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Yeah I read that, it still makes more sense to me that if the indices are indexing code units then if you want code units 1 to 8 you should use s[1:8] not s[1:5]

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

No, because 8 is not a valid string index. You are only allowed to index the code unit that begins a codepoint. Again, the fact that the indices are actually offsets of code units is an implementation detail that the user is not intended to muck around with. The string is a partial function from integers to codepoints.

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

As a practical matter, any code that passes an invalid index is likely to be doing so because of a bug, e.g. it is using sizeof to compute indices, or is adding/subtracting 1 rather than calling nextind/prevind. It is better to throw an exception to catch the bug rather than to try to guess the user's true intention.

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Looking into what breaks with s[1:8] and s[1:4] being invalid (this wasn't the behaviour previously).

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

btw @nalimilan seemed to be implying with this comment #14160 (comment) that an index at the end of the character might be ok to him (though it's not completely unambiguous what he meant by middle - I assumed not at the start or end).

This code breaks: https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L48-L59

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Oh actually the chomp! doesn't break, only the chomp for say s="foø\n"

@nalimilan
Copy link
Member

By "middle" I meant "anything not the first byte of a code point". I fully agree with @stevengj here: better be strict, as anyway people should never get indices starting elsewhere except when doing buggy computations.

Why would chomp break? It indexes directly s.data, so the checks do not apply there. EDIT: And s[1:end-1] is guaranteed to work since the code checks that the last byte is a line break (one byte in ASCII and UTF-8).

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

because https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L50 it returns s[1:end-1] which is s[1:4] and invalid (if the second index can't be the end of a code point). Is this the right change there?

chomp{T<:ByteString}(s::T) =
       (endof(s) < 1 || s.data[end]   != 0x0a) ? s :
       (endof(s) < 2 || s.data[end-1] != 0x0d) ? T(s.data[1:end-1]) : T(s.data[1:end-2])

@nalimilan
Copy link
Member

Oh, right, that's trickier than it sounds. Your solution sounds OK to me.

@stevengj
Copy link
Member

stevengj commented Dec 1, 2015

Hmm, this example makes me want to change my mind here; without accessing the internal s.data, the "correct" thing here would be s[1:prevind(s,endof(s))], which is awfully awkward. Maybe we should allow the end of a range to be the last codeunit in a codepoint after all.

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Honestly, it's funny you say that because the same thing happened to me :)

I actually started implementing it originally as per your current plan a few days back, and wrote the tests in the PR (that chop and chomp fail on) after seeing these potential bugs (if end indices have to pass isvalid), and then I saw that it was sort of assumed in a few places that s[i:j] is ok if j is the last byte of a code point (was focussed on utf8), and I thought you know what, changing this is probably quite a big change, and thus came to the view expressed above :)

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Having said that this is all string library code that should know what it's doing, so there's an argument that these are implementation aware optimisations. Edit: actually, I guess what you're getting at is user code that would have to do things like that. I think s[1:prevind(s,endof(s))] is probably the worst example, usually you'd have say a match m and would do say s[1:prevind(s,m.offset)] as opposed to say s[1:m.offset-1] to get the sub-string up to the match (the prevind version is not awful IMO)

@JobJob
Copy link
Author

JobJob commented Dec 1, 2015

Anyway, if we are continuing down this path, chop("foø") fails with the current chop(s::AbstractString) = s[1:end-1]

This could change to:

chop(s::Union{ASCIIString, UTF32String}) = s[1:end-1]
chop(s::AbstractString) = s[1:prevind(s,endof(s))]

I'm certain there's a faster implementation for UTF16, UTF8.

I'm off to bed now, I've pushed my changes to that same branch in the PR with !isvalid(j) throwing, here are the other places I've seen failures:

this test fails: https://github.com/JuliaLang/julia/blob/master/test/dates/io.jl#L192

And I get an issue with the REPL crashing, by doing the following:

julia> s = "🐨🐨"
"🐨🐨"
julia>

press up (to get the previous line), then backspace to delete the end double quote

julia> s = "🐨🐨"
"🐨🐨"
julia> s = "🐨🐨

press down, then up and the repl crashes:

julia> s = "🐨🐨"
"🐨🐨"
julia> s = "🐨🐨  WARNING: Caught an exception in the keymap:
ERROR: UnicodeError: invalid character index
ERROR: type PrefixSearchState has no field p
 in run_interface at ./LineEdit.jl:1610
 in run_frontend at ./REPL.jl:864
 in run_repl at ./REPL.jl:167
 in _start at ./client.jl:419

Cheers.

@nalimilan
Copy link
Member

The trouble is that if we allow/encourage s[1:idx-1], people will (quite logically) believe that s[1:idx+1] is fine too. All of this calls for the dedicated string index type. In the meantime, we could continue allowing to index the last byte of the code point, but I don't see it as a terrible issue: people need to use nextind() anyway.

@JobJob
Copy link
Author

JobJob commented Dec 2, 2015

Ok, as stated earlier, I agree that a string index type (or something like it) is needed to actually fix these problems, and the PR I proposed is just a minor fix for the inconsistency in the OP. Shall I revert the PR back to its original state with second indices at the end of code points allowed (and incorporating the other code improvements suggested by @stevengj)?

Btw in the discussion above I wasn't familiar with the general usage of sizeof and had thought it was only defined for these string types, and since I had only really looked at UTF8 and ASCII implementations up to that point, was mistakenly assuming sizeof was effectively equivalent to lastidx (i.e. that they were the index of the last code unit in s.data). My point was that it seemed strange to me that s[1:lastidx(s)] was not equivalent to s - though I acknowledge that as lastidx is not exported it is an implementation detail that shouldn't be relied on.

@ScottPJones
Copy link
Contributor

Very good to see somebody else taking an interest in fixing some of the string issues!
If possible, it would be good if for non-DirectIndexString types, that only the 1st and last code-units are allowed (and only for the end point), but not any middle code-units, if any (not an issue for UTF16String, since will only have 1 or 2 code-units per character).

@JobJob
Copy link
Author

JobJob commented Dec 2, 2015

So essentially only for UTF8? That was the original intent of the PR: #14217 the "original state" I mentioned reverting back to just above was the first commit of that PR, I then added the second commit to give people a look at what happens if you don't allow the second index to be at the end of a code point.

@ScottPJones
Copy link
Contributor

At least among the current string types supported in Julia, yes, only UTF8String is a problem.

@stevengj
Copy link
Member

stevengj commented Dec 2, 2015

@JobJob, after having slept on this I think we should go forward with disallowing ranges with invalid endpoints. Correct s[1:end-1] cases seems like they should be rare enough that I don't think it makes sense for us to contort the semantics to handle this. Also, it should make the transition to a StringIndex type easier to catch these kinds of things now (and will increase the impetus to implement the latter).

@stevengj
Copy link
Member

stevengj commented Dec 2, 2015

(Similarly for UTF16String: utf16("🐨x")[1:end-1] should be an error.)

@JobJob
Copy link
Author

JobJob commented Dec 2, 2015

Agree that it's more semantically consistent this way, given the way e.g. s[1] works, but not sure if it's more intuitive or user friendly, still thinking about it. In any case it doesn't matter because it's all going to change hopefully for much the better.

Pushed an initial attempt at getindex UTF16. Will only have time to look at this more again perhaps later tomorrow.

JobJob added a commit to JobJob/julia that referenced this issue Dec 3, 2015
@JobJob
Copy link
Author

JobJob commented Dec 3, 2015

PR updated, I'm fine with your decision. I think it's agreed all round that something like a string index type is needed to properly resolve these issues.

Regarding the REPL issue, I wasn't able to fix it after a quick look at the code, I note a very similar error was happening before this change, using the steps above but just hitting backspace after the final step. The main difference is that those steps cause the REPL to crash completely with these changes.

JobJob added a commit to JobJob/julia that referenced this issue Dec 4, 2015
@JobJob
Copy link
Author

JobJob commented Nov 26, 2016

This works consistently now. In the OP's example: s[1:2] == s[5:6] == "🛀" and it doesn't error. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants