Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change find() to return the same index type as pairs() #24774

Merged
merged 4 commits into from
Jan 10, 2018
Merged

Conversation

nalimilan
Copy link
Member

@nalimilan nalimilan commented Nov 25, 2017

This does not change anything for AbstractVectors and general iterables,
which continue to use linear indices. For other AbstractArrays, return
CartesianIndexes (rather than linear indices). For Dict, return
keys (previously not supported at all).

Relying on collect to choose the return element type allows supporting
any definition of pairs, including that for Dict, which creates a standard
Generator for which eltype returns Any.

Fixes #20684.

base/array.jl Outdated

julia> find(isodd, A)
2-element Array{CartesianIndex{2},1}:
CartesianIndex(1, 1)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice feature is that with #24651 the CartesianIndex part won't be repeated for each element, making the returned object quite natural.

@@ -1789,22 +1802,10 @@ julia> find(falses(3))
```
"""
function find(A)
nnzA = count(t -> t != 0, A)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to issues with the inference of the returned eltype (see commit message), I've taken a radical approach simply using collect instead of the custom loops. This means we no longer compute the length of the result before filling it. Benchmarks will be needed to check what's the best approach, but at first sight it doesn't sound obvious to me that doing two passes over the data is a good tradeoff, does it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically it was worth it, but might be worth benchmarking again. See also https://discourse.julialang.org/t/push-and-interfacing-to-the-runtime-library/7461, which suggests that two passes might still be faster (growing an array with push! is 3x slower than growing it with setindex!).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.

@@ -1264,7 +1264,7 @@ function find(p::Function, S::SparseMatrixCSC)
end
sz = size(S)
I, J = _findn(p, S)
return sub2ind(sz, I, J)
return CartesianIndex.(I, J)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new behavior of find, findn could probably be removed as it gives almost the same information (struct of arrays vs. array of structs).

This method could also be optimized a bit by not allocating I and J first (but it already does less work than before).

@nalimilan
Copy link
Member Author

This was supposed to be a RFC, but it works so well with the new pairs/keys framework, and it breaks no tests, so it may actually be a real PR. The main question is of course performance (but currently BaseBenchmarks doesn't seem to test find systematically).

The extension of this change to findnext and findprev appears to be more difficult. Ideally, we would need to be able to iterate over keys(A) by calling next(keys(A), i), with i the index passed by the user. But I'm not sure we can assume that the iterator state is the index. Cf. @JeffBezanson's proposal to use Iterators.rest for findeach in the Search & Find Julep.

@nalimilan nalimilan added the search & find The find* family of functions label Nov 25, 2017
@nalimilan
Copy link
Member Author

Any comments?

@timholy
Copy link
Member

timholy commented Dec 4, 2017

The extension of this change to findnext and findprev appears to be more difficult. Ideally, we would need to be able to iterate over keys(A) by calling next(keys(A), i), with i the index passed by the user. But I'm not sure we can assume that the iterator state is the index.

Where does i come from? If there's no place where anything other than the iterator state gets back to the user, perhaps it would be safe.

@nalimilan
Copy link
Member Author

Where does i come from? If there's no place where anything other than the iterator state gets back to the user, perhaps it would be safe.

It's passed by the user, and it's currently documented to be a linear index. IIUC that's because it's supposed to be possible to pass the result of the previous findnext iteration, after adding 1 or callind nextind, so that you can get the next match.

Copy link
Member

@timholy timholy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is moving in the right direction, thanks!

base/array.jl Outdated
find(testf::Function, A) = collect(first(p) for p in _pairs(A) if testf(last(p)))

_pairs(A::Union{AbstractArray, Associative}) = pairs(A)
_pairs(iter) = zip(OneTo(typemax(Int)), iter) # safe for objects that don't implement length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't you check the iterator trait here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? We don't guarantee that iterators implement pairs currently.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably use countfrom(1).

@@ -1789,22 +1802,10 @@ julia> find(falses(3))
```
"""
function find(A)
nnzA = count(t -> t != 0, A)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.

@timholy
Copy link
Member

timholy commented Dec 4, 2017

So if findfirst passes back an iterator state and the user always uses nextind, it would be safe, right? Is there any case where the meaning of i is potentially confuse-worthy? (Strings, presumably?)

@nalimilan
Copy link
Member Author

So if findfirst passes back an iterator state and the user always uses nextind, it would be safe, right?

It would be safe, but that iterator state needs to also be an index, since the whole point of findnext is to be able to index the collection with the result. So we would have to require that keys(c) uses indices as its iterator state for all collections. Maybe that's a reasonable requirement, I'm not sure.

Is there any case where the meaning of i is potentially confuse-worthy? (Strings, presumably?)

I'm not sure in what cases ambiguities could be a problem here. Can you develop?

@vtjnash
Copy link
Member

vtjnash commented Dec 4, 2017

But I'm not sure we can assume that the iterator state is the index
require that keys(c) uses indices as its iterator state for all collections

I think Associative might be the obvious counter-example, since in that case the state is an integer, but the keys are anything. But we might also want to just make it into a tautology, such that findnext is also just defined to work only for containers where the indices are equivalent to the keys.

@nalimilan
Copy link
Member Author

I think Associative might be the obvious counter-example, since in that case the state is an integer, but the keys are anything.

Indeed. And I guess it would be really inefficient to use the key itself as the iterator state.

But we might also want to just make it into a tautology, such that findnext is also just defined to work only for containers where the indices are equivalent to the keys.

That restriction is quite annoying considering that findnext would make sense e.g. for an ordered dict. Also the problem is not really "key" vs. "index", it's really "iterator state" vs. "key/index".

It seems that the only general solution would be to use something like findeach (which @JeffBezanson proposed in the Search & Find Julep), but in his proposal iterating over that object would only return the value and state, and not the index. We could have it return ((index, value), state) if we really wanted. Anyway that's post-1.0.

Overall, I guess you're right that we should state that findnext and findprev only work for collections for which nextind works, i.e. you call findnext, you can pass its return value to getindex and/or to nextind (or do + 1 in special cases where it's the same), and then call findnext again. Everything else cannot be supported.

@vtjnash
Copy link
Member

vtjnash commented Dec 4, 2017

but in his proposal iterating over that object would only return the value and state, and not the index

I think at that point it might be equivalent to Iterators.Filter? e.g.:
(g(first(pair)) for pair in pairs(assoc) if filter(last(pair)))

where filter might be something like Base.EqualTo(0)

@timholy
Copy link
Member

timholy commented Dec 5, 2017

EDIT: sorry, should have refreshed my browser window before posting this

There are iterable containers that may not implement indexing. For example, consider a run-length encoded container. You could imagine choosing to not implement getindex (which would have O(logN) performance rather than O(1) performance), and yet have findnext be a sensible operation.

For an AbstractArray, it would of course be better to return the index than the iterator state. But maybe for anything else it should be the iterator state? For a general iterable container, how do you implement findnext if you can't be sure that what gets passed in is the iterator state?

I'm not sure in what cases ambiguities could be a problem here. Can you develop?

Sure. enumerate is a good example, where what gets returned is a counter which might not be a valid index for vectors with indexing that doesn't start at 1. If that doesn't ever happen here, I think you're reasonably safe.

@nalimilan
Copy link
Member Author

I think at that point it might be equivalent to Iterators.Filter? e.g.:
(g(first(pair)) for pair in pairs(assoc) if filter(last(pair)))

where filter might be something like Base.EqualTo(0)

@vtjnash I'm not sure what you mean, isn't that mostly the same as what the PR does? What is g?

@timholy Yeah, for general iterables findnext could return the iterator state, and it would only be useful to pass to nextind and then findnext again. But that kind of use would be more suited to findeach, since the iterator state cannot be used for anything else than that (as opposed to arrays where you may want to do many useful things with the returned index).

@nalimilan
Copy link
Member Author

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

@nalimilan
Copy link
Member Author

Let's try again, just in case: @nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: InterruptException:

Logs and partial data can be found here
cc @ararslan

@ararslan
Copy link
Member

ararslan commented Dec 8, 2017

The InterruptException was me. There was a server error so I restarted it. Oddly enough the server seemed to have survived through the error; that's actually the first time I've ever seen Nanosoldier publicly announce that it was interrupted.

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Something went wrong when running your job:

NanosoldierError: failed to run benchmarks against primary commit: failed process: Process(`sudo cset shield -e su nanosoldier -- -c ./benchscript.sh`, ProcessExited(1)) [1]

Logs and partial data can be found here
cc @ararslan

@ararslan
Copy link
Member

ararslan commented Dec 8, 2017

ERROR: LoadError: MethodError: no method matching isless(::CartesianIndex{2}, ::Int64)
Closest candidates are:
  isless(!Matched::Missing, ::Any) at missing.jl:47
  isless(!Matched::AbstractFloat, ::Real) at operators.jl:126
  isless(!Matched::Real, ::Real) at operators.jl:302
  ...
Stacktrace:
 [1] <(::CartesianIndex{2}, ::Int64) at ./operators.jl:227
 [2] getindex(::SparseMatrixCSC{Float64,Int64}, ::Array{CartesianIndex{2},1}, ::Array{CartesianIndex{2},1}) at ./sparse/sparsematrix.jl:2235
 [3] perf_sparse_fem(::Int64) at /home/nanosoldier/.julia/v0.7/BaseBenchmarks/src/problem/SparseFEM.jl:27
 [4] ##core#9020() at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/execution.jl:312
[5] ##sample#9021(::BenchmarkTools.Parameters) at /home/nanosoldier/.julia/v0.7/BenchmarkTools/src/execution.jl:318
...

I'm pretty sure that's because of

For other AbstractArrays, return CartesianIndexes (rather than linear indices)

which seems pretty breaking in general.

@nalimilan
Copy link
Member Author

OK. It's funny that tests don't cover it, but that benchmarks do. So we have to decide whether that's the right thing to do before possibly adapting the benchmarks.

@nalimilan nalimilan added the triage This should be discussed on a triage call label Jan 5, 2018
@nalimilan
Copy link
Member Author

I agree we should merge this soon so that we can make progress on related find issues. However I've realized that the new functions are type-unstable due to an inference issue (see #25433). Not sure whether we should hold this off until it's fixed or not. It would be nice to find a workaround.

I've also made a PR to fix BaseBenchmarks (JuliaCI/BaseBenchmarks.jl#157), and another one to make it possible to use LinearIndices with Compat (JuliaLang/Compat.jl#446), to get the previous behavior. We could also imagine adding a simpler/more efficient method, e.g. find(Int, pred, a).

Regarding the public API, I think the behavior could be refined a bit after merging the PR. Currently it uses pairs for AbstractArray and AbstractDict, and linear indices elsewhere. This is debatable for a few other types:

  • Linear indices (i.e. number of codepoints) for String make intuitive sense, but they cannot be used to index into the string. OTOH returning the actual string index could be weird.
  • Linear indices for NamedTuples could be replaced with the names of matching entries.
  • In general, for custom types, it would be simpler to decide that we use pairs/keys if it's defined, falling back on linear indices when it's not. But I'm not sure it's OK to use method_exists for that given the inference issues it creates.

NEWS.md Outdated
`AbstractDict` objects ([#24774]). In particular, this means that it returns
`CartesianIndex` objects for matrices and higher-dimensional arrays instead of
always returning linear indices as it was previously the case. Use
`Int[LinearIndices(size(a))[i] for i in find(f, a)]` to compute linear indices
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timholy Is there any reason why you cannot index LinearIndices objects with an array of cartesian indices? That would provide a much more convenient syntax.

Though as I noted in my last comment it would make sense to provide a find method so that no temporary allocation is needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand your question correctly, that's possible because

julia> LinearIndices <: AbstractArray
true

Here's a demo of what I mean:

julia> linear = LinearIndices(1:3, 1:5)
LinearIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}} with indices 1:3×1:5:
 1  4  7  10  13
 2  5  8  11  14
 3  6  9  12  15

julia> linear[3]
3

julia> linear[3,2]
6

julia> linear[[CartesianIndex(3,2), CartesianIndex(2,3)]]
2-element Array{Int64,1}:
 6
 8

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I wonder why I thought it didn't work... So everything is fine.

While I'm at it, should we also support linear[[1, 2]]? That would be convenient for backward compatibility, since you would be able to apply this operation to indices returned by find, and it would be a no-op on 0.6. It would also be more consistent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have that already, too 😄. But it currently fails because we seem to be missing a size method for LinearIndices.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See #25541.

@JeffBezanson
Copy link
Member

For Strings at least I definitely think it should return real usable string indices (not codepoint number).

This does not change anything for AbstractVectors and general iterables,
which continue to use linear indices. For other AbstractArrays, return
CartesianIndexes (rather than linear indices). For Dicts, return keys
(previously not supported at all).

Relying on collect() to choose the return element type allows supporting
any definition of pairs(), including that for Dict, which creates a standard
Generator for which eltype() returns Any.
@nalimilan
Copy link
Member Author

OK, I've added a commit so that the same index type as keys returns is used for AbstractString, Tuple and NamedTuple. That sounds more consistent to me, and I think we should introduce a trait to distinguish indexable collections, for which we should always use keys/pairs.

Thanks to @JeffBezanson's fix to type stability, I think the PR is ready. However, if we want to avoid breaking Nanosolider, JuliaCI/BaseBenchmarks.jl#157 should be merged first.

@ararslan
Copy link
Member

@nanosoldier runbenchmarks(ALL, vs=":master")

@nanosoldier
Copy link
Collaborator

Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan

@nalimilan
Copy link
Member Author

nalimilan commented Jan 10, 2018

Nanosoldier report appears to contain lots of noise (with the usual suspects). It's surprising to see that some find benchmarks on Array now use less memory. OTOH find on generators uses less memory but is somewhat slower (by up to 36%), with some hard to explain differences depending on the element types. Finally, arrays and generators of Bool are much slower (up to 4 times), which is probably due to the fact that we no longer compute the length first to allocate the result beforehand instead of calling push!.

I'm merging despite these regressions so that we can finish the search & find API changes, but I have filed an issue to remember that we need to do something (e.g. add specialized AbstractArray{Bool} methods which preallocate, if push! cannot be made fast enough).

@nalimilan
Copy link
Member Author

I should have noted that the next step is now to change findfirst and findlast to return the same index type as find. findnext and findprev should also accept these indices, but this can be added without breaking the API since the input index type should determine the output index type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
search & find The find* family of functions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants