Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make reverseind generic #24613

Closed
StefanKarpinski opened this issue Nov 15, 2017 · 14 comments
Closed

make reverseind generic #24613

StefanKarpinski opened this issue Nov 15, 2017 · 14 comments
Assignees
Labels
strings "Strings!"
Milestone

Comments

@StefanKarpinski
Copy link
Member

StefanKarpinski commented Nov 15, 2017

It has always bothered me that strings have to define reverseind. But figuring out a correct generic definition for this function has eluded me – until now. I think I've finally figured it out:

  • reverseind(s,i) gives the index in s of the character beginning at byte i in reverse(s).
  • Then ncodeunits(s)-i+1 is index of the end of that character in s and ncodeunits(s)-i+2 is the index of the beginning of the next character in s (or the index right after the end of s).
  • Therefore prevind(s, ncodeunits(s)-i+2) is always the index of the character in question in s. In other words, this is a generic expression for reverseind(s,i) in terms of prevind and ncodeunits.

Edit: I've replaced sizeof with ncodeunits as suggested below.

This does actually work out:

julia> s = "∀ x ∃ y"
"∀ x ∃ y"

julia> [reverseind(s,i) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

julia> [prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
11-element Array{Int64,1}:
 11
 10
  7
  7
  7
  6
  5
  4
  1
  1
  1

The only problem with this is that it requires a generic definition of sizeof(s) which does not exist, and arguably should not exist for string types that may not be backed by bytes in the usual way. Instead, I would suggest using nextind(s, endof(s)) and giving this some generic function name. This function is something that specific string types may want to overwrite, but that's much easier to do since for typical string types, it's just the storage size of the string.

cc: @stevengj

@StefanKarpinski StefanKarpinski added strings "Strings!" triage This should be discussed on a triage call labels Nov 15, 2017
@oxinabox
Copy link
Contributor

oxinabox commented Nov 15, 2017

I am now convinced this works, purely empirically.

Fuzzer:


julia> newcheck(s)=[prevind(s, sizeof(s)-i+2) for i=1:sizeof(s)]
newcheck (generic function with 1 method)

julia> oldcheck(s)=[reverseind(s,i) for i=1:sizeof(s)]
oldcheck (generic function with 1 method)

julia> for t in [join(rand(Char, rand(1:100))) for _ in 1:10^5]
	Base.Test.@test(newcheck(t) == oldcheck(t))
end

no errors.

@stevengj
Copy link
Member

stevengj commented Nov 15, 2017

sizeof is wrong here, because it is the size in bytes. e.g. it will fail for a UTF-16 array.

What you want is the number of code units, which I think should be nextind(s,endof(s))-1.

Since we have a codeunit(s, i) function, it makes sense to have a lengthcodeunits(s) function or similar that gives the maximum index. Not sure of a good name.

@StefanKarpinski
Copy link
Member Author

Yes, that's right. So the generic definitions would be:

reverseind(s::AbstractString, i::Int) = prevind(s, ncodeunits(s)-i+2)
ncodeunits(s::AbstractString) = nextind(s, endof(s))-1

and you'd have these specific definitions for speed:

ncodeunits(s::String) = sizeof(s)
ncodeunits(s::UTF16String) = sizeof(s) >> 1
ncodeunits(s::UTF32String) = sizeof(s) >> 2

I think everything else falls out of the definition of prevind which is complex for String and UTF16String but just does i-1 for UTF32String.

@stevengj
Copy link
Member

I would just define ncodeunits(s::UTF16String) = length(s.data), but yes.

@StefanKarpinski
Copy link
Member Author

[Not breaking, so removing the triage label.]

@StefanKarpinski StefanKarpinski removed the triage This should be discussed on a triage call label Nov 16, 2017
@StefanKarpinski StefanKarpinski self-assigned this Nov 16, 2017
@StefanKarpinski
Copy link
Member Author

This relies on the indices into a string being the same as its code unit indices. We haven't formally required that before, but I think that we should – that's how all actual string types we've seen work and it's hard to imagine any other way to do this.

@stevengj
Copy link
Member

It won't work if we define a StringIndex type (#9297), because then the - 1 might not be defined.

@StefanKarpinski
Copy link
Member Author

I suspect we're not going to move ahead with #9297, but if we do then string types will just have to define ncodeunits directly and we can't provide a fallback for them anymore.

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Nov 20, 2017

I've realized that there's a complication here. The contract of reverseind is essentially the identity

s[reverseind(s,i)] == reverse(s)[i]

However, there's an assumption baked into this which is a bit of an issue: the type and encoding of reverse(s). Making reverseind generic in the way I've proposed assumes that the type and encoding of reverse(s) is the same as that of s. Until now reverse(::String) has returned a RevString, which has made changing this behavior harder than expected since the generic definition does not work. We can fix this particular issue, but this raises a basic question:

  1. Should reverse(s) have the same type and encoding as s? OR...
  2. Should reverse(s) return a String – i.e. normalize to standard string type?

We can only have a correct generic fallback for reverseind for one choice, not both – since they dictate different behaviors for reverseind. I'm inclined to go with option 1 for a couple of reasons:

  • If someone is working with a specific encoding, it's likely for a reason and we should respect that unless they specifically request changing encodings by converting types.
  • An efficient definition of reverseind is possible for the same type, under fairly reasonable assumptions that are valid, e.g. for UTF-8 and UTF-16 (and trivially UTF-32).
  • An efficient generic definition of reverseind between String and a generic encoding isn't possible in the same way (as far as I can tell).

@stevengj
Copy link
Member

stevengj commented Nov 20, 2017

@StefanKarpinski, recall that this was discussed in #23612, and in consequence we documented that reverse(s) always returns a String. (That would also argue against a generic reverseind.)

@StefanKarpinski
Copy link
Member Author

StefanKarpinski commented Nov 20, 2017

I think that requiring people to define reverseind for custom string types is pretty unfortunate. It's a complicated and very weird function, yet string types don't work correctly without defining it. That's pretty bad. The other approach that's possible is to double down on RevString and always have reverse(s) return RevString(s).

@stevengj
Copy link
Member

stevengj commented Nov 20, 2017

Having reverse(s) return RevString(s) seems like a step backwards. As discussed in #6165, the only realistic justification for having a reverse(s) function for strings in the first place is to support reverse-order processing with external libraries like PCRE, and only physically reversing the data will accomplish this.

@StefanKarpinski
Copy link
Member Author

That seems to argue for reverse(s) generally returning a string of the same type with the same encoding. String types that don't have that property will need to define their own reverseind.

@stevengj
Copy link
Member

So, no fallback for reverse(s::AbstractString)? I’m fine with that.

@StefanKarpinski StefanKarpinski added this to the 1.0 milestone Nov 22, 2017
StefanKarpinski added a commit that referenced this issue Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 22, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 27, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 28, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Nov 29, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
StefanKarpinski added a commit that referenced this issue Dec 4, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close #22611
Close #24613

See also: #10593 #23612 #24103
evetion pushed a commit to evetion/julia that referenced this issue Dec 12, 2017
These seem unrelated, but they're actually linked:

* If you reverse generic strings by wrapping them in `RevString` then
  then this generic `reverseind` is incorrect.

* In order to have a correct generic `reverseind` one needs to assume
  that `reverse(s)` returns a string of the same type and encoding as
  `s` with code points in reverse order; one also needs to assume that
  the code units encoding each character remain the same when reversed.
  This is a valid assumption for UTF-8, UTF-16 and (trivially) UTF-32.

Reverse string search functions are pretty messed up by this and I've
fixed them well enough to work but they may be quite inefficient for
long strings now. I'm not going to spend too much time on this since
there's other work going on to generalize and unify searching APIs.

Close JuliaLang#22611
Close JuliaLang#24613

See also: JuliaLang#10593 JuliaLang#23612 JuliaLang#24103
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!"
Projects
None yet
Development

No branches or pull requests

3 participants