Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a default implementation of length using iterate. #35947

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

andyferris
Copy link
Member

@andyferris andyferris commented May 20, 2020

Here is an alternative to #35946.

For all iterables with SizeUnknown we default to using iterate to get the length when requested. Algorithms can still take care to check IteratorSize on unknown iterables in order to know if the function is O(1) or not.

The particular (new) behavior I am seeking is:

julia> (length  skipmissing)([1, 2, missing, 4])
3

but I've had to write this defintion elsewhere before so I thought it made sense in Base. I also believe we should tie length to the iterate API, generically.

@andyferris andyferris force-pushed the ajf/default-length-definition branch from e1f3d41 to 4ae9169 Compare May 20, 2020 01:47
@andyferris andyferris added the iteration Involves iteration or the iteration protocol label May 20, 2020
@tkf
Copy link
Member

tkf commented May 20, 2020

I don't think Base.IteratorSize is enough to check if it is safe to use this implementation. It'd break (somewhat naive) generic code when the iterator is stateful. Maybe we need another iterator trait like IteratorState (with :> Union{PureState,MutableState}).

@JeffBezanson
Copy link
Member

I agree with @tkf; I think this violates the spirit of length. It might be ok for length to be O(n) provided the iterator indeed has a fixed, known length (e.g. a linked list), but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms. A statefulness/non-repeatability trait makes sense, but it's not 100% the same issue. For instance, you can read a whole file to count the number of lines, and then read it again to look at the data, but you probably don't want to do that, and if you do you should be pretty explicit.

@andyferris
Copy link
Member Author

Yeah - we could do with a trait to figure out if iterate is pure, and there would definitely be gotchas with this otherwise.

I'm assuming I should focus efforts on #35946 until then?

@tkf
Copy link
Member

tkf commented May 20, 2020

I don't think there is a way to implement iterate-based length reliably even we restrict it to skipmissing. It's very easy to come up with a case that this doesn't work:

julia> using Base.Broadcast: broadcasted, instantiate

julia> itr = (y ? missing : x for (x, y) in instantiate(broadcasted(tuple, 1:1000, broadcasted(rand, Bool))));

julia> Base.IteratorSize(itr)
Base.HasShape{1}()

julia> xs = skipmissing(itr);

julia> count(_ -> true, xs)
512

julia> count(_ -> true, xs)
470

@andyferris
Copy link
Member Author

reliably

@tkf Your example is that of a random iterator. You can't do anything reliable with it! 🤣 I would expect sum, count or any other reduction function to behave randomly - similarly for length.

but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms

Sorry @JeffBezanson I didn't quite understand this. Don't we use IteratorSize for that, rather than catch an error from length?

I think this violates the spirit of length.

and if you do you should be pretty explicit.

I guess that's my point - I'm not sure I undertand the spirit; I always assumed it was linked to the iterate API. When I type length ∘ skipmissing I already feel I am pretty explicitly saying I'm gonna do me some old-fashioned counting. I can call any mapreduce like function on such an iterable (like, say, count). How do I generically opt into counting how long the iterable is without resorting to typing count(_->true, itr)? Do we need a distinct function to represent that operation - i.e. would we like to export seperate functions for length (quicklength and slowlength, so to speak)? Should there be a single-argument count? (That doesn't make it ammenable to currying, though).

@andyferris
Copy link
Member Author

Is the argument that sometimes for some stateful iterators you may wish to "peek" at the length without popping everything off?

@andyferris
Copy link
Member Author

andyferris commented May 20, 2020

but currently we also use it to find out whether accessing an iterator should use incremental/online algorithms

Or is it this: If the length above some threshold, we use an incremental algorithm (that might be slower than an in-memory one)? Wouldn't the check already be something along the lines of IteratorSize(itr) != SizeUnknown(itr) && length(itr) < cutoff to avoid getting an error from length as it stands without this PR?

@tkf
Copy link
Member

tkf commented May 21, 2020

This PR can introduce a segfault in the programs that previously safely threw an error. For example:

function maptoints(f, xs)
    ys = Vector{Int}(undef, length(xs))
    for (i, x) in enumerate(xs)
        @inbounds ys[i] = f(x)
    end
    return ys
end

The tension here is length-as-reduction vs length-as-a-static-property. I think the best approach is to introduce a new function for the former.

@JeffBezanson
Copy link
Member

Yes, the idea is that there are some iterators you should not call length on (stateful ones), and I think a method error is the best way to convey that. Most of the time you call length before iterating, to know how many items there will be, so if length itself consumes the iterator it defeats the purpose. Sure, we can document that you need to check a trait first, but that's just not as clear as a method error. Subjectively, I guess length "feels" like a property you look up, and not something that can kick off an arbitrary process.

@tkf
Copy link
Member

tkf commented May 22, 2020

Maybe it's reasonable to have

nitems(x) = haslength(x) ? length(x) : count(_ -> true, x)

in Base? (The function name is somewhat random.)

It may be useful to have a common interface you can overload when you have a better implementation than count(_ -> true, x). For example, it's reasonable to have Base.SizeUnknown for Iterators.flatten(::Vector{<:Vector}) but also it's possible to implement a better length. In that case, you may want to redirect nitems to length. I'm not sure if this situation is common enough, though.

@martinholters
Copy link
Member

Ref. #35530 for sneaky bugs that can come up if the length changes unexpectedly.

@tkf
Copy link
Member

tkf commented May 27, 2020

Thinking about this more, I think it makes sense to have nitems in Base (or maybe in DataAPI.jl). There are some nice optimizations you can do with it:

nitems(x::Reverse) = nitems(x.itr)  # no need to go backward
nitems(x::Generator) = nitems(x.iter)  # no need to evaluate `f`
nitems(x::Accumulate) = nitems(x.itr)  # ditto
nitems(x::Flatten) = sum(nitems, x.it)  # inner iterators may have nice `nitems`

and so on.

@andyferris
Copy link
Member Author

andyferris commented May 27, 2020

Yes, something along the lines of nitems seems fine to me - my preference would be having this in Base for all to share though.

Can we just use count? Ideally we'd want to allow currying in the future, too. Unfortunately that means we'd need to infer what the 1-argument form might mean using Function or Callable, like so:

  1. count(predicate, iter) - eager evaluation
  2. count(predicate::Callable) - curried form
  3. count(iter) - eager evaluation equivalent to count(_ -> true, iter)
  4. count() - curried form equivalent to count(_ -> true)

EDIT: What I like is that the "count" verb gives the connotation that it might literally iterate through and count the items, as opposed to length which would remain safe & fast.

@tkf
Copy link
Member

tkf commented May 27, 2020

I can see that count(itr) would be a very appealing API for count(_ -> true, itr). But, for the same reason as we discussed in the mergewith PR #34296, I prefer not to make the dispatch on Callable to change the semantics.

@andyferris
Copy link
Member Author

andyferris commented May 27, 2020

Thanks, I couldn't quite recall the reasoning. @tkf wrote

But I think the list of *with functions is not large

Haha. Do we need to rename count to countwith? :trollface: But seriously the duplicating of generic functions is annoying; it would ne nice to have a pattern for higher-order functions with "default" function inputs so we didn't have to create twice as many generic functions...

@tkf
Copy link
Member

tkf commented May 27, 2020

3. count(iter) - eager evaluation equivalent to count(_ -> true, iter)

Actually, my comment was misleading as the reason why we can't use count(itr) as count(_ -> true, itr) is more mundane. We already have a definition for count([true, false, true]) that returns 2.

But I think the list of *with functions is not large

Haha. Do we need to rename count to countwith?

Oops :) Though that comment was in a different context. It was rather a byproduct that it was possible to get the curried form mergewith(f) in #34296. The main point was that we need to distinguish merge(f, dicts...) and merge(dicts...). So, given merge(a, b, c), you don't know the semantics of the operation without knowing the type of a. I think @JeffBezanson's comment #34296 (comment) is a nice summary. But, if we are going to add curried version for all reducers, I'll need to take back my comment because then we are going to need countwith, sumwith, uniquewith, and so on...

Anyway, I think currying is rather orthogonal to the current issue because we can't break how count(itr) already works.

@andyferris
Copy link
Member Author

andyferris commented May 27, 2020

True.

Damn - naming things is hard.

It seems that a lot of languages use length as the fast version and count for arbitrary iterators that might take O(N) time (sometimes size also makes an appearance). C# has fast length for most collections and count for any IEnumerable, similar to Java (and some popular Java libraries). Rust has std::iter::Iterator::count as the iterator version. Python has... sum(1 for i in itr)... clojure has count for iterables, etc. Chapel and C++ have count (and count_if) methods a bit like our two-argument count. I didn't come across any language that follows the one-argument semantics of counting trues.

A quick trip to the thesaurus only left me with tally, which isn't great... I can't think of anything else at the moment.

We already have a definition for count([true, false, true]) that returns 2.

I note that so does sum([true, false, true]); in fact the one-argument case seems entirely redundant with sum - except more restrictive?

@tkf
Copy link
Member

tkf commented May 27, 2020

Damn - naming things is hard.

Totally agree...

Thanks for the survey. Yeah, it's a bit unfortunate that what we have is incompatible with other languages. Looking at this, countif sounds like a better option.

in fact the one-argument case seems entirely redundant with sum - except more restrictive?

Two-argument case as well, I think.

@andyferris
Copy link
Member Author

Two-argument case as well, I think.

Yes - good point!

@tkf
Copy link
Member

tkf commented Jun 2, 2020

Another iterate-based method I frequently want is to get the last item of an iterable:

lastitem(xs) = foldl(right, xs)
right(_, x) = x

We can't use last since it's documented as O(1).

Like nitems, it can also have some optimizations:

lastitem(x::AbstractArray) = last(x)
lastitem(x::Reverse) = firstitem(x.itr)
lastitem(x::Filter) = firstitem(filter(x.flt, reverse(x.itr)))
lastitem(x::Generator) = x.f(lastitem(x.iter))
lastitem(x::Flatten) = lastitem(lastitem(x))

and so on. It'd also be better to have firstitem so that lastitem(::Reverse) and lastitem(::Filter) can be implemented efficiently.

I started to wonder if it makes sense to consistently use item in the name when it has iterate-based fallback. It's somewhat random but it's good to be predictable.

mergify bot pushed a commit to JuliaFolds/DataTools.jl that referenced this pull request Jul 28, 2020
@tkf tkf mentioned this pull request Sep 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
iteration Involves iteration or the iteration protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants