-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add AbstractChar supertype of Char #26286
Changes from 7 commits
63e04bf
07c665d
402f9ed
cc5e445
26c6ade
fbfbcb3
954a5df
bd21bf9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,75 @@ | ||
# This file is a part of Julia. License is MIT: https://julialang.org/license | ||
|
||
struct InvalidCharError <: Exception | ||
char::Char | ||
""" | ||
The `AbstractChar` type is the supertype of all character implementations | ||
in Julia. A character represents a Unicode code point, and can be converted | ||
to an integer via the [`codepoint`](@ref) function in order to obtain the | ||
numerical value of the code point, or constructed from the same integer. | ||
These numerical values determine how characters are compared with `<` and `==`, | ||
for example. New `T <: AbstractChar` types should define a `codepoint(::T)` | ||
method and a `T(::UInt32)` constructor, at minimum. | ||
|
||
A given `AbstractChar` subtype may be capable of representing only a subset | ||
of Unicode, in which case conversion from an unsupported `UInt32` value | ||
may throw an error. Conversely, the built-in [`Char`](@ref) type represents | ||
a *superset* of Unicode (in order to losslessly encode invalid byte streams), | ||
in which case conversion of a non-Unicode value *to* `UInt32` throws an error. | ||
The [`isvalid`](@ref) function can be used to check which codepoints are | ||
representable in a given `AbstractChar` type. | ||
|
||
Internally, an `AbstractChar` type may use a variety of encodings. Conversion | ||
via `codepoint(char)` will not reveal this encoding because it always returns the | ||
Unicode value of the character. `print(io, c)` of any `c::AbstractChar` | ||
produces UTF-8 output by default (via conversion to `Char` if necessary). | ||
|
||
`write(io, c)`, in contrast, may emit a different encoding depending on | ||
`typeof(c)`, and `read(io, typeof(c))` should read the same encoding as `write`. | ||
New `AbstractChar` types should typically provide their own implementations of | ||
`write` and `read`. | ||
""" | ||
AbstractChar | ||
|
||
""" | ||
Char(c::Union{Number,AbstractChar}) | ||
|
||
`Char` is a 32-bit [`AbstractChar`](@ref) type that is the default representation | ||
of characters in Julia. `Char` is the type used for character literals like `'x'` | ||
and it is also the element type of [`String`](@ref). | ||
|
||
In order to losslessly represent arbitrary byte streams stored in a `String`, | ||
a `Char` value may store information that cannot be converted to a Unicode | ||
codepoint — converting such a `Char` to `UInt32` will throw an error. | ||
The [`isvalid(c::Char)`](@ref) function can be used to query whether `c` | ||
represents a valid Unicode character. | ||
""" | ||
Char | ||
|
||
(::Type{T})(x::Integer) where {T<:AbstractChar} = T(UInt32(x)) | ||
(::Type{AbstractChar})(x::Number) = Char(x) | ||
(::Type{T})(x::AbstractChar) where {T<:Union{Number,AbstractChar}} = T(codepoint(x)) | ||
(::Type{T})(x::T) where {T<:AbstractChar} = x | ||
|
||
codepoint(c::Char) = UInt32(c) | ||
|
||
""" | ||
codepoint(c::AbstractChar) | ||
|
||
Return the Unicode codepoint (an unsigned integer) corresponding | ||
to the character `c` (or throw an exception if `c` does not represent | ||
a valid character). For `Char`, this is a `UInt32` value, but | ||
`AbstractChar` types that represent only a subset of Unicode may | ||
return a different-sized integer (e.g. `UInt8`). | ||
""" | ||
codepoint # defined for Char in boot.jl | ||
|
||
struct InvalidCharError{T<:AbstractChar} <: Exception | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Possibly, but I'm not sure if that change belongs in this PR. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry, I hadn't realized this type already existed in master. |
||
char::T | ||
end | ||
struct CodePointError <: Exception | ||
code::Integer | ||
struct CodePointError{T<:Integer} <: Exception | ||
code::T | ||
end | ||
@noinline invalid_char(c::Char) = throw(InvalidCharError(c)) | ||
@noinline code_point_err(u::UInt32) = throw(CodePointError(u)) | ||
@noinline invalid_char(c::AbstractChar) = throw(InvalidCharError(c)) | ||
@noinline code_point_err(u::Integer) = throw(CodePointError(u)) | ||
|
||
function ismalformed(c::Char) | ||
u = reinterpret(UInt32, c) | ||
|
@@ -24,6 +86,27 @@ function isoverlong(c::Char) | |
is_overlong_enc(u) | ||
end | ||
|
||
# fallback: other AbstractChar types, by default, are assumed | ||
# not to support malformed or overlong encodings. | ||
|
||
""" | ||
ismalformed(c::AbstractChar) | ||
|
||
Return `true` if `c` represents malformed (non-Unicode) data according to the | ||
encoding used by `c`. Defaults to `false` for non-`Char` types. See also | ||
[`show_invalid`](@ref). | ||
""" | ||
ismalformed(c::AbstractChar) = false | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. All of the proposed Another reason to make this the default is that if |
||
|
||
""" | ||
isoverlong(c::AbstractChar) | ||
|
||
Return `true` if `c` represents an overlong UTF-8 sequence. Defaults | ||
to `false` for non-`Char` types. See also [`decode_overlong`](@ref) | ||
and [`show_invalid`](@ref). | ||
""" | ||
isoverlong(c::AbstractChar) = false | ||
|
||
function UInt32(c::Char) | ||
# TODO: use optimized inline LLVM | ||
u = reinterpret(UInt32, c) | ||
|
@@ -49,6 +132,15 @@ function decode_overlong(c::Char) | |
(u & 0x007f0000 >> 4) | (u & 0x7f000000 >> 6) | ||
end | ||
|
||
""" | ||
decode_overlong(c::AbstractChar) | ||
|
||
When [`isoverlong(c)`](@ref) is `true`, `decode_overlong(c)` returns | ||
the Unicode codepoint value of `c`. `AbstractChar` implementations | ||
that support overlong encodings should implement `Base.decode_overlong`. | ||
""" | ||
decode_overlong | ||
|
||
function Char(u::UInt32) | ||
u < 0x80 && return reinterpret(Char, u << 24) | ||
u < 0x00200000 || code_point_err(u)::Union{} | ||
|
@@ -69,50 +161,85 @@ function Char(b::Union{Int8,UInt8}) | |
0 ≤ b ≤ 0x7f ? reinterpret(Char, (b % UInt32) << 24) : Char(UInt32(b)) | ||
end | ||
|
||
convert(::Type{Char}, x::Number) = Char(x) | ||
convert(::Type{T}, x::Char) where {T<:Number} = T(x) | ||
convert(::Type{AbstractChar}, x::Number) = Char(x) # default to Char | ||
convert(::Type{T}, x::Number) where {T<:AbstractChar} = T(x) | ||
convert(::Type{T}, x::AbstractChar) where {T<:Number} = T(x) | ||
convert(::Type{T}, c::AbstractChar) where {T<:AbstractChar} = T(c) | ||
convert(::Type{T}, c::T) where {T<:AbstractChar} = c | ||
|
||
rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T) | ||
rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(codepoint(x), T) | ||
|
||
typemax(::Type{Char}) = reinterpret(Char, typemax(UInt32)) | ||
typemin(::Type{Char}) = reinterpret(Char, typemin(UInt32)) | ||
|
||
size(c::Char) = () | ||
size(c::Char,d) = convert(Int, d) < 1 ? throw(BoundsError()) : 1 | ||
ndims(c::Char) = 0 | ||
ndims(::Type{Char}) = 0 | ||
length(c::Char) = 1 | ||
firstindex(c::Char) = 1 | ||
lastindex(c::Char) = 1 | ||
getindex(c::Char) = c | ||
getindex(c::Char, i::Integer) = i == 1 ? c : throw(BoundsError()) | ||
getindex(c::Char, I::Integer...) = all(x -> x == 1, I) ? c : throw(BoundsError()) | ||
first(c::Char) = c | ||
last(c::Char) = c | ||
eltype(::Type{Char}) = Char | ||
|
||
start(c::Char) = false | ||
next(c::Char, state) = (c, true) | ||
done(c::Char, state) = state | ||
isempty(c::Char) = false | ||
in(x::Char, y::Char) = x == y | ||
size(c::AbstractChar) = () | ||
size(c::AbstractChar,d) = convert(Int, d) < 1 ? throw(BoundsError()) : 1 | ||
ndims(c::AbstractChar) = 0 | ||
ndims(::Type{<:AbstractChar}) = 0 | ||
length(c::AbstractChar) = 1 | ||
firstindex(c::AbstractChar) = 1 | ||
lastindex(c::AbstractChar) = 1 | ||
getindex(c::AbstractChar) = c | ||
getindex(c::AbstractChar, i::Integer) = i == 1 ? c : throw(BoundsError()) | ||
getindex(c::AbstractChar, I::Integer...) = all(x -> x == 1, I) ? c : throw(BoundsError()) | ||
first(c::AbstractChar) = c | ||
last(c::AbstractChar) = c | ||
eltype(::Type{T}) where {T<:AbstractChar} = T | ||
|
||
start(c::AbstractChar) = false | ||
next(c::AbstractChar, state) = (c, true) | ||
done(c::AbstractChar, state) = state | ||
isempty(c::AbstractChar) = false | ||
in(x::AbstractChar, y::AbstractChar) = x == y | ||
|
||
==(x::Char, y::Char) = reinterpret(UInt32, x) == reinterpret(UInt32, y) | ||
isless(x::Char, y::Char) = reinterpret(UInt32, x) < reinterpret(UInt32, y) | ||
hash(x::Char, h::UInt) = | ||
hash_uint64(((reinterpret(UInt32, x) + UInt64(0xd4d64234)) << 32) ⊻ UInt64(h)) | ||
widen(::Type{Char}) = Char | ||
|
||
-(x::Char, y::Char) = Int(x) - Int(y) | ||
-(x::Char, y::Integer) = Char(Int32(x) - Int32(y)) | ||
+(x::Char, y::Integer) = Char(Int32(x) + Int32(y)) | ||
+(x::Integer, y::Char) = y + x | ||
# fallbacks: | ||
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y)) | ||
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems better for the fallback comparisons to be done in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Converting to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah, I see what you're saying. Perhaps this then: isless(x::Char, y::AbstractChar) = isless(x, Char(y))
isless(x::AbstractChar, y::Char) = isless(Char(x), y)
isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y))
==(x::Char, y::AbstractChar) = x == Char(y)
==(x::AbstractChar, y::Char) = Char(x) == y
==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y) |
||
hash(x::AbstractChar, h::UInt) = hash(Char(x), h) | ||
widen(::Type{T}) where {T<:AbstractChar} = T | ||
|
||
-(x::AbstractChar, y::AbstractChar) = Int(x) - Int(y) | ||
-(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) - Int32(y)) | ||
+(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) + Int32(y)) | ||
+(x::Integer, y::AbstractChar) = y + x | ||
|
||
# `print` should output UTF-8 by default for all AbstractChar types. | ||
# (Packages may implement other IO subtypes to specify different encodings.) | ||
# In contrast, `write(io, c)` outputs a `c` in an encoding determined by typeof(c). | ||
print(io::IO, c::Char) = (write(io, c); nothing) | ||
print(io::IO, c::AbstractChar) = print(io, Char(c)) # fallback: convert to output UTF-8 | ||
|
||
const hex_chars = UInt8['0':'9';'a':'z'] | ||
|
||
function show(io::IO, c::Char) | ||
function show_invalid(io::IO, c::Char) | ||
write(io, 0x27) | ||
u = reinterpret(UInt32, c) | ||
while true | ||
a = hex_chars[((u >> 28) & 0xf) + 1] | ||
b = hex_chars[((u >> 24) & 0xf) + 1] | ||
write(io, 0x5c, UInt8('x'), a, b) | ||
(u <<= 8) == 0 && break | ||
end | ||
write(io, 0x27) | ||
end | ||
|
||
""" | ||
show_invalid(io::IO, c::AbstractChar) | ||
|
||
Called by `show(io, c)` when [`isoverlong(c)`](@ref) or | ||
[`ismalformed(c)`](@ref) return `true`. Subclasses | ||
of `AbstractChar` should define `Base.show_invalid` methods | ||
if they support storing invalid character data. | ||
""" | ||
show_invalid | ||
|
||
# show c to io, assuming UTF-8 encoded output | ||
function show(io::IO, c::AbstractChar) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think this is generic because of the call to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As noted in another thread, we could use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, the |
||
if c <= '\\' | ||
b = c == '\0' ? 0x30 : | ||
c == '\a' ? 0x61 : | ||
|
@@ -131,19 +258,13 @@ function show(io::IO, c::Char) | |
end | ||
end | ||
if isoverlong(c) || ismalformed(c) | ||
show_invalid(io, c) | ||
elseif isprint(c) | ||
write(io, 0x27) | ||
u = reinterpret(UInt32, c) | ||
while true | ||
a = hex_chars[((u >> 28) & 0xf) + 1] | ||
b = hex_chars[((u >> 24) & 0xf) + 1] | ||
write(io, 0x5c, 'x', a, b) | ||
(u <<= 8) == 0 && break | ||
end | ||
print(io, c) # use print, not write, to use UTF-8 for any AbstractChar | ||
write(io, 0x27) | ||
elseif isprint(c) | ||
write(io, 0x27, c, 0x27) | ||
else # unprintable, well-formed, non-overlong Unicode | ||
u = UInt32(c) | ||
u = codepoint(c) | ||
write(io, 0x27, 0x5c, c <= '\x7f' ? 0x78 : c <= '\uffff' ? 0x75 : 0x55) | ||
d = max(2, 8 - (leading_zeros(u) >> 2)) | ||
while 0 < d | ||
|
@@ -154,16 +275,16 @@ function show(io::IO, c::Char) | |
return | ||
end | ||
|
||
function show(io::IO, ::MIME"text/plain", c::Char) | ||
function show(io::IO, ::MIME"text/plain", c::T) where {T<:AbstractChar} | ||
show(io, c) | ||
if !ismalformed(c) | ||
print(io, ": ") | ||
if isoverlong(c) | ||
print(io, "[overlong] ") | ||
u = decode_overlong(c) | ||
c = Char(u) | ||
c = T(u) | ||
else | ||
u = UInt32(c) | ||
u = codepoint(c) | ||
end | ||
h = string(u, base = 16, pad = u ≤ 0xffff ? 4 : 6) | ||
print(io, (isascii(c) ? "ASCII/" : ""), "Unicode U+", h) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -563,6 +563,7 @@ export | |
bytes2hex, | ||
chomp, | ||
chop, | ||
codepoint, | ||
codeunit, | ||
codeunits, | ||
digits, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps clarify that it's not necessary that
print(io, c)
produces UTF-8 but rather that the output encoding is determined byio
– and the built-inIO
types are all UTF-8.