Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: clean up map and collect with iterator traits #15146

Merged
merged 3 commits into from
Mar 29, 2016
Merged

Conversation

JeffBezanson
Copy link
Member

see #15123

@mschauer
Copy link
Contributor

From #13276 I recall that it would be natural to assign an IsInfinite trait to Cycle, Repeated, and Count so that for example zips of cycles together with iterators with HasLength preserve HasLength.

@StefanKarpinski
Copy link
Member

We should probably be careful to distinguish "infinite" from "lazy" – i.e. you can't get the length of an IO object without reading it to the end but it does usually have an end (unless it's /dev/zero or something like that), whereas you can have iterators that are truly infinite as well.

@JeffBezanson
Copy link
Member Author

Yes, if people generally like this approach then we can add an IsInfinite flavor.

@JeffBezanson
Copy link
Member Author

I decided to try making HasEltype and HasLength the defaults, since nearly every iterator defines these (especially eltype). I think this will smooth the transition. I added IsInfinite as well, but fully hooking it up (e.g. zip of infinite plus haslength gives haslength) will take some work.


"""
collect(collection)

Return an array of all items in a collection. For associative collections, returns Pair{KeyType, ValType}.
"""
collect(itr) = collect(eltype(itr), itr)
collect(itr) = _collect(itr, iteratoreltype(itr), iteratorsize(itr))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect(repeated(1)) now gives a method-error instead of hanging? that's nice

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I suppose so. Would you like to take a stab at adding more support for IsInfinite? I think mostly we need definitions for combining it correctly with other iterators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I plan to do this, I am reading now to see how it works.

@@ -208,6 +208,13 @@ reshape(a::AbstractArray, dims::Int...) = reshape(a, dims)

## from general iterable to any array

function copy!(dest, src)
for x in src
push!(dest, x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't copy! overwrite rather than append in all other cases?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes; I did this because push! is the only generic "add to collection" function we have. This definition only works if copy(dest::AbstractArray, src) is also defined, which it is.

push! is key-preserving for non-arrays, but arrays can't do that trick since taking an element out of an array drops the key. Actually this definition can replace union! (sets) and merge! (dicts). Given that we have copy!, append!, union!, and merge!, it looks quite possible we have too many whole-collection functions and not enough single-element functions. append! goes with push!, but there isn't a function that pairs with copy!. Arrays could then implement copy!, but not the single-element partner for that function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you'd have to empty!(dest) first to make this actually implement copy!, otherwise as written this is doing append!.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what does append! mean on a dict or set? (it's not currently defined)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other append! method is equivalent to this, doing for x in src; push!(dest, x); end, right?

There are other non-abstractarray collections other than dicts or sets so defining this with such a loose signature seems wrong. Why should copy! append to existing keys for Associative (or other) collections but overwrite existing values for arrays?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that in general append! is repeated push!. So it seems the solution is to have an f! such that copy! is repeated f!. f! works for unordered collections, or collections like dicts where the "elements" contain both a key and a value, so there is no question about "where" it will be inserted.

append! is in fact only defined for Vector and BitVector, so this hasn't been probed much. We also have union!, which is literally implemented with repeated push!, and merge! which could easily be. These certainly overwrite existing values. So arguably we should replace push! with f! in these cases. f! could be, say, a 2-argument form of insert!.

A bit more background: in the use case in this PR, I'm either incrementally growing an Array, or building some other kind of collection. So push! kind of makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of copy!, that f! seems like it should be setindex!. The generic implementation for dest::AbstractArray should probably be

for (i, x) in zip(eachindex(dest), src)
    dest[i] = x
end

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that looks fine. But the problem is that if you want to implement copy! for non-arrays, you need to use push! AFAICT. Which is not correct for arrays, hence this comment thread.

@timholy
Copy link
Member

timholy commented Mar 19, 2016

I have to interrupt my review for the remainder of the day, but aside from a big 👍 one general comment: this seems to suffer from the difference in our iteration protocol for arrays (which returns values) and associative collections (which returns key-value pairs). The closest we have for arrays is enumerate, and that's starting to look quite limited to me:

  • Indexing with integers is not always efficient
  • In (i,x) in enumerate(src), is i an index into src or a counter? I'm mentally using Fortran-style arrays (which might not start at 1) as an exercise in testing the generality of our constructs, and enumerate seems to fail on this score (the docs indicate it's an index, but the name suggests it's a counter).

I'm behind a ridiculously-slow connection or I'd provide a link, but if you haven't seen it, my recent "detangle..." and "iterate along a dimension..." PRs are also concerned with generalizing our iteration (over just arrays in this case). There seem to be a number of commonalities, so I'll try to keep a close eye on this.

@JeffBezanson
Copy link
Member Author

Exactly; the problem is that a single element of a set or dict has all the information you need to insert it in a different collection, but for arrays the keys are only part of the container and not the elements.

The only solution I can think of is to use a different name than push! for "add item to collection". push! would then mean specifically "add to the end of an ordered collection". We could also replace union! and merge! (which have very few definitions) with copy!.

@timholy
Copy link
Member

timholy commented Mar 21, 2016

The other option is to introduce a eachkv(obj), which always returns a key-value pair.

I originally thought that one other potential difference between arrays and associative collections is setindex!, which never generates a BoundsError for associative collections. However, an alternative conception removes this difference: the array "keys" are a complete set, so you can't generate a new one that is independent of all keys already used. With this perspective, it's much like a Dict{ASCIIString,Any} will complain if you index with a UTF8String that can't be represented in ASCII.

@JeffBezanson
Copy link
Member Author

eachkv will not help for sets unless we add indexing functions to sets, which we probably don't want to do. Independent of any of this though, there is arguably still an issue with push!, where it means "insert" for other collections but "insert at end" for arrays. @tkelman 's view seems to be that these meanings are too dissimilar to inhabit the same function, and I can see his point.

@timholy
Copy link
Member

timholy commented Mar 21, 2016

I agree with @tkelman and was trying to come up with a generic iteration scheme that lets you use setindex! instead. But that is indeed problematic for Sets.

immutable EltypeUnknown <: IteratorEltype end
immutable HasEltype <: IteratorEltype end

iteratoreltype(x) = iteratoreltype(typeof(x))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haseltype? And then EltypeUnknown and EltypeKnown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds ok. I was aiming to have all the iterator traits share the iterator prefix, but I guess having an element type does not need to be tied to iteration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merit in the iterator prefix, too. iteratorhaseltype? Getting kinda long without underscores, though.

@JeffBezanson
Copy link
Member Author

Sounds like the best thing for me to do for now is to restrict the copy! method to Union{Associative,AbstractSet}?

@timholy
Copy link
Member

timholy commented Mar 24, 2016

Seems reasonable to me. Probably move it into dict.jl or something?

@JeffBezanson JeffBezanson force-pushed the jb/itertraits branch 4 times, most recently from d0d8ae8 to 337074c Compare March 24, 2016 16:42
@vtjnash
Copy link
Member

vtjnash commented Mar 24, 2016

@timholy going off your recent changes, would it be more correct to iterate the indexes instead?

copy!(dest, src) = for i in eachindex(src)
  dest[i] = src[i]
end

this might, for example, give more reasonable result when the types of src and dest are not the same

@JeffBezanson
Copy link
Member Author

Ready to merge I think?

@timholy
Copy link
Member

timholy commented Mar 24, 2016

@vtjnash, right now (and even moreso once ReshapedArrays merge), the best approach is to use a separate iterator for each array (so a zip(eachindex(dest), eachindex(src)) would be best). However, for further abstractions, I think we need to do more to ensure alignment yet allow each array to choose its most efficient iterator. I'm still mulling over the best way to express this.

@vtjnash
Copy link
Member

vtjnash commented Mar 24, 2016

i was thinking that for the definition of copy, this isn't actually iterating dest, it is actually iterating the indexes of src and assigning the equivalent index in dest to that value of the index in src. this makes it a correct copy for many more odd cases (for example, dict -> array, array -> dict, array -> array, dict -> dict). although other cases (like set) don't make sense, i think it should be the "right" behavior for map (I think map(set) -> set is wrong, since i think that the definition of map requires that map(f, a)[x] == f(a[x]).

@JeffBezanson
Copy link
Member Author

The trouble is that the notion of an iterator only specifies a sequence of values, not a sequence of values with indices. So one game to play here is to see how far we can get in generic functions assuming that collections have no more structure than, essentially, a multi-set. I think I've tended to assume that an array is an ordered multi-set, rather than a set of index=>value pairs.

@StefanKarpinski
Copy link
Member

An iterator could implicitly be considered to have integer indices.

@timholy
Copy link
Member

timholy commented Mar 24, 2016

I think I've tended to assume that an array is an ordered multi-set, rather than a set of index=>value pairs.

That's a good definition, and maybe all it should be. But I'm quite interested in some generalizations that are guaranteed to cause trouble 😈, and I suspect the multi-set won't prove to be enough of a foundation for deciding what the right behavior should be.

JeffBezanson and others added 3 commits March 28, 2016 13:19
…traits

make HasEltype and HasLength the default

this is a possible fix for #15342

type-accumulating Dict constructor

some tests for type accumulation in `map` and `Dict`
add `iteratorsize(::Type{StreamMapIterator})`
@samoconnor
Copy link
Contributor

Should I be using something like interatorsize(c) == HasLength() here ?

From https://github.com/samoconnor/julia/blob/retry_branch/base/pmap.jl#L74:

function batchsplit(c; min_batch_count=1, max_batch_size=100)

    # Split collection into batches, then peek at the first few batches...
    batches = partition(c, max_batch_size)
    head, tail = head_and_tail(batches, min_batch_count)

    # If there are not enough batches, use a smaller batch size...
    if length(head) < min_batch_count
        batch_size = max(1, div(sum(length, head), min_batch_count))
        return partition(flatten(head), batch_size)
    end

    return flatten((head, tail))
end

The goal is this:

  • c might be infinite
  • c might have length, but if length(c) is implemented by running through the iterator, we really want to avoid calling length(c) (it might be reading rows from a remote database) .
  • we want to know if c has at least enough items for min_batch_count batches of max_batch_size items.

Does the new iterator traits stuff allow asking the question: "Do we known, without non-trivial computation, that this collection has at least n items ?"

@JeffBezanson
Copy link
Member Author

No, it doesn't support that distinction yet. I think it would be ok for length to be O(n) e.g. in the case of a linked list, but if it involves I/O or is very expensive it doesn't seem like it should be implemented.

I can see the need for a "has at least n elements" function, since if n is small it is indeed reasonable to implement for expensive iterators. Of course there can be a fallback that just calls length.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants