-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove RevString
; efficient generic reverseind
#24708
Conversation
RevString
; efficient generic reverseind
RevString
; efficient generic reverseind
c5178ef
to
81b8cc4
Compare
@StefanKarpinski Per discussion in #24414 will you rename |
(Spelling out |
After considering it for a while, I think I still prefer the trio of |
base/strings/string.jl
Outdated
""" | ||
ncodeunits(s::AbstractString) | ||
|
||
The number of code units in a string. For eample, for UTF-8-like data such as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"example"
base/strings/string.jl
Outdated
|
||
The number of code units in a string. For eample, for UTF-8-like data such as | ||
the default `String` type, the number of code units is the number of bytes in | ||
the string, aka `sizeof(s)`. For a UTF-16 encoded string type, however, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"a.k.a."
base/strings/substring.jl
Outdated
reverse(s::AbstractString) -> AbstractString | ||
|
||
Reverses a string. | ||
reverse(s::AbstractString) -> String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd keep AbstractString
(or remove any mention of the return type), since an implementation e.g. for UTF32String
would rather return a string of the same type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, sorry, I changed this before I realized that reverse
needs to return the same type.
base/strings/substring.jl
Outdated
|
||
Technically, this function reverses the codepoints in a string, and its | ||
main utility is for reversed-order string processing, especially for reversed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mention of the "main utility" was useful IMHO. It helps understand that the result does not have any particular meaning except for searches. Else people can complain that "Julia doesn't know how to reverse a Unicode string correctly" due to things like reverse("noël") == "l̈eon"
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
base/strings/substring.jl
Outdated
reverse(s::RevString) = s.string | ||
|
||
## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)] | ||
function reverse(s::AbstractString)::String |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought in #24613 you wanted to remove the reverse(::AbstractString)
fallback, and always return a string of the same encoding (so that reverseind
works)?
So shouldn't this be s::Union{String,SubString{String}}
, here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I should definitely fix this. This is a holdover from the first pass through this when I thought that we could have reverse
always produce a String
.
base/strings/substring.jl
Outdated
|
||
## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)] | ||
function reverse(s::AbstractString)::String | ||
sprint() do io |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input is a String
or substring thereof, can't we do better than sprint
because we know exactly the size of the buffer that is required? i.e. we can just allocate v = StringVector(sizeof(s))
, write into that, and return String(v)
? Also, you can just write codeunits directly, rather than converting to/from Char
via s[j]
...
Oh, I guess you are keeping reverse(s::String)
in string.jl .... it seems like that routine could be generalized to work for SubString{String}
too, by calling codeunit
rather than converting the input to Vector{UInt8}
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 I'm honestly still just trying to get all the tests to pass 😅
So the test situation is a bit of a shitshow. But they all seem to be unrelated... so yay? |
Thanks for the careful review @nalimilan and @stevengj – I'll fix and update soon. |
81b8cc4
to
c46559e
Compare
base/strings/substring.jl
Outdated
|
||
## reverse an index i so that reverse(s)[i] == s[reverseind(s,i)] | ||
function reverse(s::S)::S where {S<:AbstractString} | ||
# TODO: generic API for sprint to a particular encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure whether it's a good idea to have this generic reverse
method or not. It does ensure that the reversed string type is the same as the input string type, but it doesn't do so in a particularly efficient way. Calling sprint
with a string encoding based on the input string type could make this both generic and efficient, but we don't currently have a way to do that in Base. Thoughts, @nalimilan, @stevengj?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this implementation ensures that the reversed string type is the same as the input string type. Doesn't sprint()
always return String
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd just implement this for String
and SubString
, and leave it for custom implementations to implement it. We can always add a generic method after 1.0 if we want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It has a return type annotation which implicitly calls convert for you. Too subtle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed that; still not used to looking for return-type annotations. +1 for omitting this fallback entirely.
ecc0bf0
to
f1ea3aa
Compare
f1ea3aa
to
9f9f39d
Compare
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
9f9f39d
to
5167f17
Compare
Well that's a delightful surprise! I guess the remaining question here is deprecation. Should we deprecate |
Ok, all yours, @ararslan! |
These seem unrelated, but they're actually linked:
If you reverse generic strings by wrapping them in
RevString
then this genericreverseind
is incorrect.In order to have a correct generic
reverseind
one needs to assume thatreverse(s)
returns a string of the same type and encoding ass
with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs.
Defining
reverseind
generically to handle corner cases correctly also required introducingncodeunits
and adjusting the behavior ofthisind
somewhat. I could make separate PRs for those changes, but since they're required here I figured I'd leave them here. The commits are logically separate and can/should reviewed separately.See also: #10593, #23612, #24103
Closes: #22611, #24613