-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add AbstractChar supertype of Char #26286
Conversation
An exception to my comment above about injectively mapping other character sets into Unicode is the Han unification, in which different character variants in non-Unicode Japanese encodings are mapped to the same codepoint in Unicode (which viewed them as "font" variations rather than semantically distinct characters). If you want to define an |
One thing we had discussed was whether continuing to use |
|
SuRe ;) |
base/char.jl
Outdated
@@ -24,6 +61,11 @@ function isoverlong(c::Char) | |||
is_overlong_enc(u) | |||
end | |||
|
|||
# fallback: other AbstractChar types, by default, are assumed | |||
# not to support malformed or overlong encodings. | |||
ismalformed(c::AbstractChar) = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own AbstractChar
type, it's good to think about this question precisely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of the proposed AbstractChar
types I've ever seen, other than Char
, don't support malformed characters. So it makes sense to me to make this the default. You can always override it.
Another reason to make this the default is that if ismalformed(c)
returns true
, you need to define additional methods to decode the character.
base/char.jl
Outdated
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y)) | ||
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y) | ||
hash(x::AbstractChar, h::UInt) = | ||
hash_uint64(((UInt32(x) + UInt64(0xd060fad0)) << 32) ⊻ UInt64(h)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use reinterpret
, so that invalid codepoints can be hashed (just like for Char
)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AbstractChar
need not be reinterpretable to UInt32.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But you could convert to Char
first...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that's a good point I guess, since equality is defined in terms of comparison, to Char
, hashing should match.
base/strings/basic.jl
Outdated
* Only the index of the first code unit of a `Char` is a valid index | ||
* The encoding of a `Char` is independent of what precedes or follows it | ||
* Each `AbstractChar` in a string is encoded by one or more code units | ||
* Only the index of the first code unit of a `AbstractChar` is a valid index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an". Same below.
"try map(f, collect(s)) or a comprehension instead")) | ||
write(out, c′::Char) | ||
write(out, c′::AbstractChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we require all AbstractChar
implementations to implement write
so that it uses UTF-8? This code relies on this assumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a fallback write
method that works via conversion to Char
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. But I guess it should be mentioned in the docstring for AbstractChar
too? It could sound natural to somebody implementing e.g. a Latin1Char
to write it in ISO-8859-1, which wouldn't be correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could define print
to always output UTF-8, and write
to output a raw encoded value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But I tend to think that the I/O encoding should be a property of the stream, not the string or character type, because usually all strings in a given stream should use the same encoding.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀')
where io
is a UTF-16 encoded stream it should definitely not write UTF-8 data to io
because '∀'
is a UTF-8 encode character type. If you have a CharU16
value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print
specializations.
base/strings/util.jl
Outdated
@@ -410,7 +410,7 @@ If `count` is provided, replace at most `count` occurrences. | |||
or a regular expression. | |||
If `r` is a function, each occurrence is replaced with `r(s)` | |||
where `s` is the matched substring (when `pat`is a `Regex` or `AbstractString`) or | |||
character (when `pat` is a `Char` or a collection of `Char`). | |||
character (when `pat` is a `AbstractChar` or a collection of `AbstractChar`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"an"
doc/src/manual/strings.md
Outdated
@@ -29,8 +29,7 @@ There are a few noteworthy high-level features about Julia's strings: | |||
a string argument, you should declare the type as `AbstractString` in order to accept any string | |||
type. | |||
* Like C and Java, but unlike most dynamic languages, Julia has a first-class type representing | |||
a single character, called `Char`. This is just a special kind of 32-bit primitive type whose numeric | |||
value represents a Unicode code point. | |||
a single character, called `AbstractChar`. This is just a special kind of 32-bit primitive type whose numeric value represents a Unicode code point. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not necessarily 32-bit now. Maybe mention Char
or say "by default"?
base/char.jl
Outdated
""" | ||
Char | ||
|
||
struct InvalidCharError{T<:AbstractChar} <: Exception |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe InexactError
could be used here? Just an idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I'm not sure if that change belongs in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I hadn't realized this type already existed in master.
I'm not sure what |
The idea of codepoint is to make clear what the conversion is about. |
@Keno, we could define the constructors ( Rather than defining a single new type, it makes more sense to me to define a new function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice. I approve modulo comments.
base/char.jl
Outdated
|
||
rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T) | ||
rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(UInt32(x), T) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems a bit off that this supports any kind of Number
rather than just Integer
. Pre-existing issue, I know.
+(x::Integer, y::Char) = y + x | ||
# fallbacks: | ||
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y)) | ||
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems better for the fallback comparisons to be done in UInt32
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Char
, but not vice versa.
Converting to UInt32
would mean that ASCIIChar('a') == typemax(Char)
would throw an InvalidCharError
rather than returning false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I see what you're saying. Perhaps this then:
isless(x::Char, y::AbstractChar) = isless(x, Char(y))
isless(x::AbstractChar, y::Char) = isless(Char(x), y)
isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y))
==(x::Char, y::AbstractChar) = x == Char(y)
==(x::AbstractChar, y::Char) = Char(x) == y
==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y)
base/char.jl
Outdated
|
||
const hex_chars = UInt8['0':'9';'a':'z'] | ||
|
||
function show(io::IO, c::Char) | ||
function show(io::IO, c::AbstractChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is generic because of the call to reinterpret(UInt32, c)
– we have no idea what the representation of an abstract character type is – it may not even be 32-bits or a bits type at all. We could extract show_invalid(c::AbstractChar)
as a method that characters have to implement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in another thread, we could use reinterpret(UInt32, Char(c))
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the reinterpret
call is only for invalid chars. I agree that refactoring this seems like a good choice.
@@ -140,7 +140,7 @@ See also: [`getindex`](@ref), [`start`](@ref), [`done`](@ref), | |||
|
|||
start(s::AbstractString) = 1 | |||
done(s::AbstractString, i::Integer) = i > ncodeunits(s) | |||
eltype(::Type{<:AbstractString}) = Char | |||
eltype(::Type{<:AbstractString}) = Char # some string types may use another AbstractChar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way we can put an ::eltype(s)
type assert somewhere to catch this if it's wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure…where would it go?
"try map(f, collect(s)) or a comprehension instead")) | ||
write(out, c′::Char) | ||
write(out, c′::AbstractChar) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
100% agree: letting the value that's being printed determine the encoding would be a mess. We do need external support for encoded streams that handle this appropriately, but print(io, '∀')
where io
is a UTF-16 encoded stream it should definitely not write UTF-8 data to io
because '∀'
is a UTF-8 encode character type. If you have a CharU16
value and a UTF-16 encoded stream it should know that it can output code units directly, of course, but that's just a matter of providing the right print
specializations.
function length(g::GraphemeIterator) | ||
c0 = typemax(Char) | ||
function length(g::GraphemeIterator{S}) where {S} | ||
c0 = eltype(S)(0x00000000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is '\0'
a safe character for this? It's not a malformed character – does the Unicode spec ensure that it can't ever combine with anything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, \0
has the Grapheme_Cluster_Break property.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(I used \0
here since hopefully every conceivable AbstractChar
type can encode this.)
doc/src/manual/strings.md
Outdated
value represents a Unicode code point. | ||
* Like C and Java, but unlike most dynamic languages, Julia has a first-class type for representing | ||
a single character, called `AbstractChar`. The built-in `Char` subtype of `AbstractChar` | ||
is a 32-bit primitive type that can represent any Unicode character. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps mention that Char
represents Unicode characters as zero-padded UTF-8 bytes rather than as code point values? This seems like as good a place as any to mention that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought that we don't want to document the Char
representation, lest people rely on it not changing?
doc/src/manual/strings.md
Outdated
@@ -41,9 +41,10 @@ There are a few noteworthy high-level features about Julia's strings: | |||
|
|||
## [Characters](@id man-characters) | |||
|
|||
A `Char` value represents a single character: it is just a 32-bit primitive type with a special literal | |||
An `AbstractChar` value represents a single character: it is just a 32-bit primitive type with a special literal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer accurate since an AbstractChar
could have any representation at all. The simplest fix is just to leave this as Char
.
(That is, there would be a fallback |
re: |
…and write when encoding should be determined by the argument type
…SubString{String}
I updated the PR to make the I also did a pass over the source code and I found lots of |
@Keno, I tried getting rid of the But, as mentioned by @StefanKarpinski above, I think it should be safe to deprecate the implicit |
I'm actually much more comfortable with an implicit conversion to |
But fine to do in a different PR. |
… boot.jl to char.jl
Added the |
base/char.jl
Outdated
their own implementations of `write` and `read`. | ||
via `codepoint(char)` will not reveal this encoding because it always returns the | ||
Unicode value of the character. `print(io, c)` of any `c::AbstractChar` | ||
produces UTF-8 output by default (via conversion to `Char` if necessary). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps clarify that it's not necessary that print(io, c)
produces UTF-8 but rather that the output encoding is determined by io
– and the built-in IO
types are all UTF-8.
Note that to really support non-UTF8 Not only do we want to know whether it is UTF-8 (so that e.g. the fast path for That will involve a separate design process, and should be non-breaking as far as I can tell, so it can happen outside of this PR. |
AppVeyor CI failure seems to be an unrelated timeout. |
Fixes #25302. Wherever practical, functions that took
::Char
arguments are changed to accept any::AbstractChar
, with fallbacks defined as necessary.For
T<:AbstractChar
,UInt32(c::T)
is defined to return the Unicode codepoint represented byc
(or throw an error ifc
lies outside of Unicode), andT(x::UInt32)
should create aT
from the Unicode codepointx
(or throw an error …T
may represent only a subset of Unicode). This makes it possible to define generic fallbacks for comparison toChar
, output, etcetera. Even ancient character sets like EBCDIC have a well-defined injective mapping into Unicode, so I don't think we sacrifice any useful generality this way.