Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add AbstractChar supertype of Char #26286

Merged
merged 8 commits into from
Mar 7, 2018
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -458,6 +458,9 @@ Library improvements
* The function `thisind(s::AbstractString, i::Integer)` returns the largest valid index
less or equal than `i` in the string `s` or `0` if no such index exists ([#24414]).

* `Char` is now a subtype of `AbstractChar`, and most of the functions that
take character arguments now accept any `AbstractChar` ([#26286]).

* `Irrational` is now a subtype of `AbstractIrrational` ([#24245]).

* Introduced the `empty` function, the functional pair to `empty!` which returns a new,
Expand Down
2 changes: 1 addition & 1 deletion base/arrayshow.jl
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ methods. By default returns a string of the same width as original with a
centered cdot, used in printing of structural zeros of structured matrices.
Accept keyword args `c` for alternate single character marker.
"""
function replace_with_centered_mark(s::AbstractString;c::Char = '⋅')
function replace_with_centered_mark(s::AbstractString;c::AbstractChar = '⋅')
N = length(s)
return join(setindex!([" " for i=1:N],string(c),ceil(Int,N/2)))
end
Expand Down
11 changes: 4 additions & 7 deletions base/boot.jl
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@ export
Signed, Int, Int8, Int16, Int32, Int64, Int128,
Unsigned, UInt, UInt8, UInt16, UInt32, UInt64, UInt128,
# string types
Char, AbstractString, String, IO,
AbstractChar, Char, AbstractString, String, IO,
# errors
ErrorException, BoundsError, DivideError, DomainError, Exception,
InterruptException, InexactError, OutOfMemoryError, ReadOnlyMemoryError,
Expand Down Expand Up @@ -177,7 +177,8 @@ primitive type Float32 <: AbstractFloat 32 end
primitive type Float64 <: AbstractFloat 64 end

#primitive type Bool <: Integer 8 end
primitive type Char 32 end
abstract type AbstractChar end
primitive type Char <: AbstractChar 32 end

primitive type Int8 <: Signed 8 end
#primitive type UInt8 <: Unsigned 8 end
Expand Down Expand Up @@ -460,7 +461,7 @@ function write(io::IO, x::String)
end

show(io::IO, @nospecialize x) = ccall(:jl_static_show, Cvoid, (Ptr{Cvoid}, Any), io_pointer(io), x)
print(io::IO, x::Char) = ccall(:jl_uv_putc, Cvoid, (Ptr{Cvoid}, Char), io_pointer(io), x)
print(io::IO, x::AbstractChar) = ccall(:jl_uv_putc, Cvoid, (Ptr{Cvoid}, Char), io_pointer(io), x)
print(io::IO, x::String) = (write(io, x); nothing)
print(io::IO, @nospecialize x) = show(io, x)
print(io::IO, @nospecialize(x), @nospecialize a...) = (print(io, x); print(io, a...))
Expand Down Expand Up @@ -701,10 +702,6 @@ UInt32(x::BuiltinInts) = toUInt32(x)::UInt32
UInt64(x::BuiltinInts) = toUInt64(x)::UInt64
UInt128(x::BuiltinInts) = toUInt128(x)::UInt128

Char(x::Number) = Char(UInt32(x))
Char(x::Char) = x
(::Type{T})(x::Char) where {T<:Number} = T(UInt32(x))

(::Type{T})(x::T) where {T<:Number} = x

Int(x::Ptr) = bitcast(Int, x)
Expand Down
215 changes: 168 additions & 47 deletions base/char.jl
Original file line number Diff line number Diff line change
@@ -1,13 +1,75 @@
# This file is a part of Julia. License is MIT: https://julialang.org/license

struct InvalidCharError <: Exception
char::Char
"""
The `AbstractChar` type is the supertype of all character implementations
in Julia. A character represents a Unicode code point, and can be converted
to an integer via the [`codepoint`](@ref) function in order to obtain the
numerical value of the code point, or constructed from the same integer.
These numerical values determine how characters are compared with `<` and `==`,
for example. New `T <: AbstractChar` types should define a `codepoint(::T)`
method and a `T(::UInt32)` constructor, at minimum.

A given `AbstractChar` subtype may be capable of representing only a subset
of Unicode, in which case conversion from an unsupported `UInt32` value
may throw an error. Conversely, the built-in [`Char`](@ref) type represents
a *superset* of Unicode (in order to losslessly encode invalid byte streams),
in which case conversion of a non-Unicode value *to* `UInt32` throws an error.
The [`isvalid`](@ref) function can be used to check which codepoints are
representable in a given `AbstractChar` type.

Internally, an `AbstractChar` type may use a variety of encodings. Conversion
via `codepoint(char)` will not reveal this encoding because it always returns the
Unicode value of the character. `print(io, c)` of any `c::AbstractChar`
produces UTF-8 output by default (via conversion to `Char` if necessary).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps clarify that it's not necessary that print(io, c) produces UTF-8 but rather that the output encoding is determined by io – and the built-in IO types are all UTF-8.


`write(io, c)`, in contrast, may emit a different encoding depending on
`typeof(c)`, and `read(io, typeof(c))` should read the same encoding as `write`.
New `AbstractChar` types should typically provide their own implementations of
`write` and `read`.
"""
AbstractChar

"""
Char(c::Union{Number,AbstractChar})

`Char` is a 32-bit [`AbstractChar`](@ref) type that is the default representation
of characters in Julia. `Char` is the type used for character literals like `'x'`
and it is also the element type of [`String`](@ref).

In order to losslessly represent arbitrary byte streams stored in a `String`,
a `Char` value may store information that cannot be converted to a Unicode
codepoint — converting such a `Char` to `UInt32` will throw an error.
The [`isvalid(c::Char)`](@ref) function can be used to query whether `c`
represents a valid Unicode character.
"""
Char

(::Type{T})(x::Integer) where {T<:AbstractChar} = T(UInt32(x))
(::Type{AbstractChar})(x::Number) = Char(x)
(::Type{T})(x::AbstractChar) where {T<:Union{Number,AbstractChar}} = T(codepoint(x))
(::Type{T})(x::T) where {T<:AbstractChar} = x

codepoint(c::Char) = UInt32(c)

"""
codepoint(c::AbstractChar)

Return the Unicode codepoint (an unsigned integer) corresponding
to the character `c` (or throw an exception if `c` does not represent
a valid character). For `Char`, this is a `UInt32` value, but
`AbstractChar` types that represent only a subset of Unicode may
return a different-sized integer (e.g. `UInt8`).
"""
codepoint # defined for Char in boot.jl

struct InvalidCharError{T<:AbstractChar} <: Exception
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe InexactError could be used here? Just an idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I'm not sure if that change belongs in this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I hadn't realized this type already existed in master.

char::T
end
struct CodePointError <: Exception
code::Integer
struct CodePointError{T<:Integer} <: Exception
code::T
end
@noinline invalid_char(c::Char) = throw(InvalidCharError(c))
@noinline code_point_err(u::UInt32) = throw(CodePointError(u))
@noinline invalid_char(c::AbstractChar) = throw(InvalidCharError(c))
@noinline code_point_err(u::Integer) = throw(CodePointError(u))

function ismalformed(c::Char)
u = reinterpret(UInt32, c)
Expand All @@ -24,6 +86,27 @@ function isoverlong(c::Char)
is_overlong_enc(u)
end

# fallback: other AbstractChar types, by default, are assumed
# not to support malformed or overlong encodings.

"""
ismalformed(c::AbstractChar)

Return `true` if `c` represents malformed (non-Unicode) data according to the
encoding used by `c`. Defaults to `false` for non-`Char` types. See also
[`show_invalid`](@ref).
"""
ismalformed(c::AbstractChar) = false
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it really a good idea to define these fallbacks? They can be wrong for some types, and they are easy to implement. If you implement your own AbstractChar type, it's good to think about this question precisely.

Copy link
Member Author

@stevengj stevengj Mar 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the proposed AbstractChar types I've ever seen, other than Char, don't support malformed characters. So it makes sense to me to make this the default. You can always override it.

Another reason to make this the default is that if ismalformed(c) returns true, you need to define additional methods to decode the character.


"""
isoverlong(c::AbstractChar)

Return `true` if `c` represents an overlong UTF-8 sequence. Defaults
to `false` for non-`Char` types. See also [`decode_overlong`](@ref)
and [`show_invalid`](@ref).
"""
isoverlong(c::AbstractChar) = false

function UInt32(c::Char)
# TODO: use optimized inline LLVM
u = reinterpret(UInt32, c)
Expand All @@ -49,6 +132,15 @@ function decode_overlong(c::Char)
(u & 0x007f0000 >> 4) | (u & 0x7f000000 >> 6)
end

"""
decode_overlong(c::AbstractChar)

When [`isoverlong(c)`](@ref) is `true`, `decode_overlong(c)` returns
the Unicode codepoint value of `c`. `AbstractChar` implementations
that support overlong encodings should implement `Base.decode_overlong`.
"""
decode_overlong

function Char(u::UInt32)
u < 0x80 && return reinterpret(Char, u << 24)
u < 0x00200000 || code_point_err(u)::Union{}
Expand All @@ -69,50 +161,85 @@ function Char(b::Union{Int8,UInt8})
0 ≤ b ≤ 0x7f ? reinterpret(Char, (b % UInt32) << 24) : Char(UInt32(b))
end

convert(::Type{Char}, x::Number) = Char(x)
convert(::Type{T}, x::Char) where {T<:Number} = T(x)
convert(::Type{AbstractChar}, x::Number) = Char(x) # default to Char
convert(::Type{T}, x::Number) where {T<:AbstractChar} = T(x)
convert(::Type{T}, x::AbstractChar) where {T<:Number} = T(x)
convert(::Type{T}, c::AbstractChar) where {T<:AbstractChar} = T(c)
convert(::Type{T}, c::T) where {T<:AbstractChar} = c

rem(x::Char, ::Type{T}) where {T<:Number} = rem(UInt32(x), T)
rem(x::AbstractChar, ::Type{T}) where {T<:Number} = rem(codepoint(x), T)

typemax(::Type{Char}) = reinterpret(Char, typemax(UInt32))
typemin(::Type{Char}) = reinterpret(Char, typemin(UInt32))

size(c::Char) = ()
size(c::Char,d) = convert(Int, d) < 1 ? throw(BoundsError()) : 1
ndims(c::Char) = 0
ndims(::Type{Char}) = 0
length(c::Char) = 1
firstindex(c::Char) = 1
lastindex(c::Char) = 1
getindex(c::Char) = c
getindex(c::Char, i::Integer) = i == 1 ? c : throw(BoundsError())
getindex(c::Char, I::Integer...) = all(x -> x == 1, I) ? c : throw(BoundsError())
first(c::Char) = c
last(c::Char) = c
eltype(::Type{Char}) = Char

start(c::Char) = false
next(c::Char, state) = (c, true)
done(c::Char, state) = state
isempty(c::Char) = false
in(x::Char, y::Char) = x == y
size(c::AbstractChar) = ()
size(c::AbstractChar,d) = convert(Int, d) < 1 ? throw(BoundsError()) : 1
ndims(c::AbstractChar) = 0
ndims(::Type{<:AbstractChar}) = 0
length(c::AbstractChar) = 1
firstindex(c::AbstractChar) = 1
lastindex(c::AbstractChar) = 1
getindex(c::AbstractChar) = c
getindex(c::AbstractChar, i::Integer) = i == 1 ? c : throw(BoundsError())
getindex(c::AbstractChar, I::Integer...) = all(x -> x == 1, I) ? c : throw(BoundsError())
first(c::AbstractChar) = c
last(c::AbstractChar) = c
eltype(::Type{T}) where {T<:AbstractChar} = T

start(c::AbstractChar) = false
next(c::AbstractChar, state) = (c, true)
done(c::AbstractChar, state) = state
isempty(c::AbstractChar) = false
in(x::AbstractChar, y::AbstractChar) = x == y

==(x::Char, y::Char) = reinterpret(UInt32, x) == reinterpret(UInt32, y)
isless(x::Char, y::Char) = reinterpret(UInt32, x) < reinterpret(UInt32, y)
hash(x::Char, h::UInt) =
hash_uint64(((reinterpret(UInt32, x) + UInt64(0xd4d64234)) << 32) ⊻ UInt64(h))
widen(::Type{Char}) = Char

-(x::Char, y::Char) = Int(x) - Int(y)
-(x::Char, y::Integer) = Char(Int32(x) - Int32(y))
+(x::Char, y::Integer) = Char(Int32(x) + Int32(y))
+(x::Integer, y::Char) = y + x
# fallbacks:
isless(x::AbstractChar, y::AbstractChar) = isless(Char(x), Char(y))
==(x::AbstractChar, y::AbstractChar) = Char(x) == Char(y)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems better for the fallback comparisons to be done in UInt32.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally had it that way, but this way is more general. The issue is that all Unicode codepoints can be represented by Char, but not vice versa.

Converting to UInt32 would mean that ASCIIChar('a') == typemax(Char) would throw an InvalidCharError rather than returning false.

Copy link
Member

@StefanKarpinski StefanKarpinski Mar 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see what you're saying. Perhaps this then:

isless(x::Char, y::AbstractChar) = isless(x, Char(y))
isless(x::AbstractChar, y::Char) = isless(Char(x), y)
isless(x::AbstractChar, y::AbstractChar) = isless(UInt32(x), UInt32(y))

==(x::Char, y::AbstractChar) = x == Char(y)
==(x::AbstractChar, y::Char) = Char(x) == y
==(x::AbstractChar, y::AbstractChar) = UInt32(x) == UInt32(y)

hash(x::AbstractChar, h::UInt) = hash(Char(x), h)
widen(::Type{T}) where {T<:AbstractChar} = T

-(x::AbstractChar, y::AbstractChar) = Int(x) - Int(y)
-(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) - Int32(y))
+(x::T, y::Integer) where {T<:AbstractChar} = T(Int32(x) + Int32(y))
+(x::Integer, y::AbstractChar) = y + x

# `print` should output UTF-8 by default for all AbstractChar types.
# (Packages may implement other IO subtypes to specify different encodings.)
# In contrast, `write(io, c)` outputs a `c` in an encoding determined by typeof(c).
print(io::IO, c::Char) = (write(io, c); nothing)
print(io::IO, c::AbstractChar) = print(io, Char(c)) # fallback: convert to output UTF-8

const hex_chars = UInt8['0':'9';'a':'z']

function show(io::IO, c::Char)
function show_invalid(io::IO, c::Char)
write(io, 0x27)
u = reinterpret(UInt32, c)
while true
a = hex_chars[((u >> 28) & 0xf) + 1]
b = hex_chars[((u >> 24) & 0xf) + 1]
write(io, 0x5c, UInt8('x'), a, b)
(u <<= 8) == 0 && break
end
write(io, 0x27)
end

"""
show_invalid(io::IO, c::AbstractChar)

Called by `show(io, c)` when [`isoverlong(c)`](@ref) or
[`ismalformed(c)`](@ref) return `true`. Subclasses
of `AbstractChar` should define `Base.show_invalid` methods
if they support storing invalid character data.
"""
show_invalid

# show c to io, assuming UTF-8 encoded output
function show(io::IO, c::AbstractChar)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is generic because of the call to reinterpret(UInt32, c) – we have no idea what the representation of an abstract character type is – it may not even be 32-bits or a bits type at all. We could extract show_invalid(c::AbstractChar) as a method that characters have to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted in another thread, we could use reinterpret(UInt32, Char(c)).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the reinterpret call is only for invalid chars. I agree that refactoring this seems like a good choice.

if c <= '\\'
b = c == '\0' ? 0x30 :
c == '\a' ? 0x61 :
Expand All @@ -131,19 +258,13 @@ function show(io::IO, c::Char)
end
end
if isoverlong(c) || ismalformed(c)
show_invalid(io, c)
elseif isprint(c)
write(io, 0x27)
u = reinterpret(UInt32, c)
while true
a = hex_chars[((u >> 28) & 0xf) + 1]
b = hex_chars[((u >> 24) & 0xf) + 1]
write(io, 0x5c, 'x', a, b)
(u <<= 8) == 0 && break
end
print(io, c) # use print, not write, to use UTF-8 for any AbstractChar
write(io, 0x27)
elseif isprint(c)
write(io, 0x27, c, 0x27)
else # unprintable, well-formed, non-overlong Unicode
u = UInt32(c)
u = codepoint(c)
write(io, 0x27, 0x5c, c <= '\x7f' ? 0x78 : c <= '\uffff' ? 0x75 : 0x55)
d = max(2, 8 - (leading_zeros(u) >> 2))
while 0 < d
Expand All @@ -154,16 +275,16 @@ function show(io::IO, c::Char)
return
end

function show(io::IO, ::MIME"text/plain", c::Char)
function show(io::IO, ::MIME"text/plain", c::T) where {T<:AbstractChar}
show(io, c)
if !ismalformed(c)
print(io, ": ")
if isoverlong(c)
print(io, "[overlong] ")
u = decode_overlong(c)
c = Char(u)
c = T(u)
else
u = UInt32(c)
u = codepoint(c)
end
h = string(u, base = 16, pad = u ≤ 0xffff ? 4 : 6)
print(io, (isascii(c) ? "ASCII/" : ""), "Unicode U+", h)
Expand Down
2 changes: 1 addition & 1 deletion base/compiler/validation.jl
Original file line number Diff line number Diff line change
Expand Up @@ -209,7 +209,7 @@ is_valid_lvalue(x) = isa(x, Slot) || isa(x, SSAValue) || isa(x, GlobalRef)
function is_valid_argument(x)
if isa(x, Slot) || isa(x, SSAValue) || isa(x, GlobalRef) || isa(x, QuoteNode) ||
(isa(x,Expr) && (x.head in (:static_parameter, :boundscheck, :copyast))) ||
isa(x, Number) || isa(x, AbstractString) || isa(x, Char) || isa(x, Tuple) ||
isa(x, Number) || isa(x, AbstractString) || isa(x, AbstractChar) || isa(x, Tuple) ||
isa(x, Type) || isa(x, Core.Box) || isa(x, Module) || x === nothing
return true
end
Expand Down
1 change: 1 addition & 0 deletions base/exports.jl
Original file line number Diff line number Diff line change
Expand Up @@ -563,6 +563,7 @@ export
bytes2hex,
chomp,
chop,
codepoint,
codeunit,
codeunits,
digits,
Expand Down
1 change: 1 addition & 0 deletions base/filesystem.jl
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@ function read(f::File, ::Type{Char})
end
return reinterpret(Char, c)
end
read(f::File, ::Type{T}) where {T<:AbstractChar} = T(read(f, Char)) # fallback

function unsafe_read(f::File, p::Ptr{UInt8}, nel::UInt)
check_open(f)
Expand Down
Loading