-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change find() to return the same index type as pairs() #24774
Conversation
base/array.jl
Outdated
|
||
julia> find(isodd, A) | ||
2-element Array{CartesianIndex{2},1}: | ||
CartesianIndex(1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nice feature is that with #24651 the CartesianIndex
part won't be repeated for each element, making the returned object quite natural.
@@ -1789,22 +1802,10 @@ julia> find(falses(3)) | |||
``` | |||
""" | |||
function find(A) | |||
nnzA = count(t -> t != 0, A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to issues with the inference of the returned eltype (see commit message), I've taken a radical approach simply using collect
instead of the custom loops. This means we no longer compute the length of the result before filling it. Benchmarks will be needed to check what's the best approach, but at first sight it doesn't sound obvious to me that doing two passes over the data is a good tradeoff, does it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically it was worth it, but might be worth benchmarking again. See also https://discourse.julialang.org/t/push-and-interfacing-to-the-runtime-library/7461, which suggests that two passes might still be faster (growing an array with push!
is 3x slower than growing it with setindex!
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.
base/sparse/sparsematrix.jl
Outdated
@@ -1264,7 +1264,7 @@ function find(p::Function, S::SparseMatrixCSC) | |||
end | |||
sz = size(S) | |||
I, J = _findn(p, S) | |||
return sub2ind(sz, I, J) | |||
return CartesianIndex.(I, J) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the new behavior of find
, findn
could probably be removed as it gives almost the same information (struct of arrays vs. array of structs).
This method could also be optimized a bit by not allocating I
and J
first (but it already does less work than before).
This was supposed to be a RFC, but it works so well with the new The extension of this change to |
Any comments? |
Where does |
It's passed by the user, and it's currently documented to be a linear index. IIUC that's because it's supposed to be possible to pass the result of the previous |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is moving in the right direction, thanks!
base/array.jl
Outdated
find(testf::Function, A) = collect(first(p) for p in _pairs(A) if testf(last(p))) | ||
|
||
_pairs(A::Union{AbstractArray, Associative}) = pairs(A) | ||
_pairs(iter) = zip(OneTo(typemax(Int)), iter) # safe for objects that don't implement length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't you check the iterator trait here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? We don't guarantee that iterators implement pairs
currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably use countfrom(1)
.
@@ -1789,22 +1802,10 @@ julia> find(falses(3)) | |||
``` | |||
""" | |||
function find(A) | |||
nnzA = count(t -> t != 0, A) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See also https://discourse.julialang.org/t/half-vectorization/7399/3, which benchmarks some conditional comprehensions with somewhat alarming results. I think you really need to benchmark this change before we can decide.
So if |
It would be safe, but that iterator state needs to also be an index, since the whole point of
I'm not sure in what cases ambiguities could be a problem here. Can you develop? |
I think |
Indeed. And I guess it would be really inefficient to use the key itself as the iterator state.
That restriction is quite annoying considering that It seems that the only general solution would be to use something like Overall, I guess you're right that we should state that |
I think at that point it might be equivalent to where |
EDIT: sorry, should have refreshed my browser window before posting this There are iterable containers that may not implement indexing. For example, consider a run-length encoded container. You could imagine choosing to not implement For an
Sure. |
@vtjnash I'm not sure what you mean, isn't that mostly the same as what the PR does? What is @timholy Yeah, for general iterables |
@nanosoldier |
Let's try again, just in case: @nanosoldier |
The @nanosoldier |
I'm pretty sure that's because of
which seems pretty breaking in general. |
OK. It's funny that tests don't cover it, but that benchmarks do. So we have to decide whether that's the right thing to do before possibly adapting the benchmarks. |
0c4f930
to
2faeb23
Compare
I agree we should merge this soon so that we can make progress on related I've also made a PR to fix BaseBenchmarks (JuliaCI/BaseBenchmarks.jl#157), and another one to make it possible to use Regarding the public API, I think the behavior could be refined a bit after merging the PR. Currently it uses
|
NEWS.md
Outdated
`AbstractDict` objects ([#24774]). In particular, this means that it returns | ||
`CartesianIndex` objects for matrices and higher-dimensional arrays instead of | ||
always returning linear indices as it was previously the case. Use | ||
`Int[LinearIndices(size(a))[i] for i in find(f, a)]` to compute linear indices |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@timholy Is there any reason why you cannot index LinearIndices
objects with an array of cartesian indices? That would provide a much more convenient syntax.
Though as I noted in my last comment it would make sense to provide a find
method so that no temporary allocation is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand your question correctly, that's possible because
julia> LinearIndices <: AbstractArray
true
Here's a demo of what I mean:
julia> linear = LinearIndices(1:3, 1:5)
LinearIndices{2,Tuple{UnitRange{Int64},UnitRange{Int64}}} with indices 1:3×1:5:
1 4 7 10 13
2 5 8 11 14
3 6 9 12 15
julia> linear[3]
3
julia> linear[3,2]
6
julia> linear[[CartesianIndex(3,2), CartesianIndex(2,3)]]
2-element Array{Int64,1}:
6
8
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I wonder why I thought it didn't work... So everything is fine.
While I'm at it, should we also support linear[[1, 2]]
? That would be convenient for backward compatibility, since you would be able to apply this operation to indices returned by find
, and it would be a no-op on 0.6. It would also be more consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have that already, too 😄. But it currently fails because we seem to be missing a size
method for LinearIndices
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #25541.
For Strings at least I definitely think it should return real usable string indices (not codepoint number). |
This does not change anything for AbstractVectors and general iterables, which continue to use linear indices. For other AbstractArrays, return CartesianIndexes (rather than linear indices). For Dicts, return keys (previously not supported at all). Relying on collect() to choose the return element type allows supporting any definition of pairs(), including that for Dict, which creates a standard Generator for which eltype() returns Any.
OK, I've added a commit so that the same index type as Thanks to @JeffBezanson's fix to type stability, I think the PR is ready. However, if we want to avoid breaking Nanosolider, JuliaCI/BaseBenchmarks.jl#157 should be merged first. |
…ple, and add tests
@nanosoldier |
Your benchmark job has completed - possible performance regressions were detected. A full report can be found here. cc @ararslan |
Nanosoldier report appears to contain lots of noise (with the usual suspects). It's surprising to see that some I'm merging despite these regressions so that we can finish the search & find API changes, but I have filed an issue to remember that we need to do something (e.g. add specialized |
I should have noted that the next step is now to change |
This does not change anything for
AbstractVectors
and general iterables,which continue to use linear indices. For other
AbstractArrays
, returnCartesianIndex
es (rather than linear indices). ForDict
, returnkeys (previously not supported at all).
Relying on
collect
to choose the return element type allows supportingany definition of
pairs
, including that forDict
, which creates a standardGenerator
for whicheltype
returnsAny
.Fixes #20684.