WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

JobJob · 2015-12-01T18:07:20Z

Previously if j was in the middle of a code point then it was moved to the end of the code point to be more forgiving. After discussion with @stevengj and @nalimilan in #14158 the recommendation was it would be more sensible to throw an error.
Also added some more tests.

stevengj · 2015-12-01T18:27:49Z

base/unicode/utf8.jl

@@ -101,26 +101,33 @@ sizeof(s::UTF8String) = sizeof(s.data)

 lastidx(s::UTF8String) = length(s.data)

+isfirstbyte(c::UInt8) = (c & 0xc0) != 0x80 # == !is_valid_continuation(c)


@code_llvm says that !is_valid_continuation(c) compiles to code identical to firstbyte(c), so I don't think we need a separate firstbyte function.

I wasn't sure if it would be optimised to the same code, thanks for the clarification. For me it also aids readability, but I'm happy to remove.

Better to remove.

code point Plus a few tests fixes JuliaLang#14158

adds UTF16 getindex Fixes chop and date format parsing more tests improves Unicode invalid index error message

JobJob mentioned this pull request Dec 1, 2015

Invalid second UTF8String indices don't cause errors #14158

Closed

tkelman added the unicode Related to unicode characters and encodings label Dec 1, 2015

stevengj reviewed Dec 1, 2015
View reviewed changes

JobJob force-pushed the jj/utf8strictidx branch from 3ca28d3 to a22a2d4 Compare December 3, 2015 12:06

JobJob changed the title ~~WIP: Throw UnicodeError for s[i:j], where j is not at the start or end of a code point~~ WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point Dec 3, 2015

JobJob added 2 commits December 4, 2015 14:27

Throw UnicodeError for s[i:j] when j is not at the start or the end of a

f814cc5

code point Plus a few tests fixes JuliaLang#14158

getindex(s::UTF8/16,i:j) now throws an error if isvalid(s,j) is false

7721cb3

adds UTF16 getindex Fixes chop and date format parsing more tests improves Unicode invalid index error message

JobJob force-pushed the jj/utf8strictidx branch from 8043f6d to 7721cb3 Compare December 4, 2015 12:32

JobJob closed this May 28, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

JobJob commented Dec 1, 2015

stevengj Dec 1, 2015

JobJob Dec 1, 2015

stevengj Dec 1, 2015

		@@ -101,26 +101,33 @@ sizeof(s::UTF8String) = sizeof(s.data)

		lastidx(s::UTF8String) = length(s.data)

		isfirstbyte(c::UInt8) = (c & 0xc0) != 0x80 # == !is_valid_continuation(c)

WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

WIP: Throw UnicodeError for s[i:j], where j is not at the start of a code point #14217

Conversation

JobJob commented Dec 1, 2015

stevengj Dec 1, 2015

Choose a reason for hiding this comment

JobJob Dec 1, 2015

Choose a reason for hiding this comment

stevengj Dec 1, 2015

Choose a reason for hiding this comment