-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make reverseind generic #24613
Comments
I am now convinced this works, purely empirically. Fuzzer:
no errors. |
What you want is the number of code units, which I think should be Since we have a |
Yes, that's right. So the generic definitions would be: reverseind(s::AbstractString, i::Int) = prevind(s, ncodeunits(s)-i+2)
ncodeunits(s::AbstractString) = nextind(s, endof(s))-1 and you'd have these specific definitions for speed: ncodeunits(s::String) = sizeof(s)
ncodeunits(s::UTF16String) = sizeof(s) >> 1
ncodeunits(s::UTF32String) = sizeof(s) >> 2 I think everything else falls out of the definition of |
I would just define |
[Not breaking, so removing the |
This relies on the indices into a string being the same as its code unit indices. We haven't formally required that before, but I think that we should – that's how all actual string types we've seen work and it's hard to imagine any other way to do this. |
It won't work if we define a |
I suspect we're not going to move ahead with #9297, but if we do then string types will just have to define |
I've realized that there's a complication here. The contract of s[reverseind(s,i)] == reverse(s)[i] However, there's an assumption baked into this which is a bit of an issue: the type and encoding of
We can only have a correct generic fallback for
|
@StefanKarpinski, recall that this was discussed in #23612, and in consequence we documented that |
I think that requiring people to define |
Having |
That seems to argue for |
So, no fallback for reverse(s::AbstractString)? I’m fine with that. |
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close #22611 Close #24613 See also: #10593 #23612 #24103
These seem unrelated, but they're actually linked: * If you reverse generic strings by wrapping them in `RevString` then then this generic `reverseind` is incorrect. * In order to have a correct generic `reverseind` one needs to assume that `reverse(s)` returns a string of the same type and encoding as `s` with code points in reverse order; one also needs to assume that the code units encoding each character remain the same when reversed. This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32. Reverse string search functions are pretty messed up by this and I've fixed them well enough to work but they may be quite inefficient for long strings now. I'm not going to spend too much time on this since there's other work going on to generalize and unify searching APIs. Close JuliaLang#22611 Close JuliaLang#24613 See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103
It has always bothered me that strings have to define
reverseind
. But figuring out a correct generic definition for this function has eluded me – until now. I think I've finally figured it out:reverseind(s,i)
gives the index ins
of the character beginning at bytei
inreverse(s)
.ncodeunits(s)-i+1
is index of the end of that character ins
andncodeunits(s)-i+2
is the index of the beginning of the next character ins
(or the index right after the end ofs
).prevind(s, ncodeunits(s)-i+2)
is always the index of the character in question ins
. In other words, this is a generic expression forreverseind(s,i)
in terms ofprevind
andncodeunits
.Edit: I've replaced
sizeof
withncodeunits
as suggested below.This does actually work out:
The only problem with this is that it requires a generic definition of
sizeof(s)
which does not exist, and arguably should not exist for string types that may not be backed by bytes in the usual way. Instead, I would suggest usingnextind(s, endof(s))
and giving this some generic function name. This function is something that specific string types may want to overwrite, but that's much easier to do since for typical string types, it's just the storage size of the string.cc: @stevengj
The text was updated successfully, but these errors were encountered: