RFC: clean up `map` and `collect` with iterator traits #15146

JeffBezanson · 2016-02-19T05:08:27Z

mschauer · 2016-02-21T11:05:24Z

From #13276 I recall that it would be natural to assign an IsInfinite trait to Cycle, Repeated, and Count so that for example zips of cycles together with iterators with HasLength preserve HasLength.

StefanKarpinski · 2016-02-22T16:40:57Z

We should probably be careful to distinguish "infinite" from "lazy" – i.e. you can't get the length of an IO object without reading it to the end but it does usually have an end (unless it's /dev/zero or something like that), whereas you can have iterators that are truly infinite as well.

JeffBezanson · 2016-02-26T00:55:01Z

Yes, if people generally like this approach then we can add an IsInfinite flavor.

JeffBezanson · 2016-03-02T19:03:57Z

I decided to try making HasEltype and HasLength the defaults, since nearly every iterator defines these (especially eltype). I think this will smooth the transition. I added IsInfinite as well, but fully hooking it up (e.g. zip of infinite plus haslength gives haslength) will take some work.

mschauer · 2016-03-02T21:50:30Z

base/array.jl


 """
    collect(collection)

 Return an array of all items in a collection. For associative collections, returns Pair{KeyType, ValType}.
 """
-collect(itr) = collect(eltype(itr), itr)
+collect(itr) = _collect(itr, iteratoreltype(itr), iteratorsize(itr))
+


collect(repeated(1)) now gives a method-error instead of hanging? that's nice

Yes, I suppose so. Would you like to take a stab at adding more support for IsInfinite? I think mostly we need definitions for combining it correctly with other iterators.

Yes, I plan to do this, I am reading now to see how it works.

tkelman · 2016-03-18T16:49:18Z

base/abstractarray.jl

@@ -208,6 +208,13 @@ reshape(a::AbstractArray, dims::Int...) = reshape(a, dims)

 ## from general iterable to any array

+function copy!(dest, src)
+    for x in src
+        push!(dest, x)


doesn't copy! overwrite rather than append in all other cases?

Yes; I did this because push! is the only generic "add to collection" function we have. This definition only works if copy(dest::AbstractArray, src) is also defined, which it is.

push! is key-preserving for non-arrays, but arrays can't do that trick since taking an element out of an array drops the key. Actually this definition can replace union! (sets) and merge! (dicts). Given that we have copy!, append!, union!, and merge!, it looks quite possible we have too many whole-collection functions and not enough single-element functions. append! goes with push!, but there isn't a function that pairs with copy!. Arrays could then implement copy!, but not the single-element partner for that function.

I feel like you'd have to empty!(dest) first to make this actually implement copy!, otherwise as written this is doing append!.

But what does append! mean on a dict or set? (it's not currently defined)

Every other append! method is equivalent to this, doing for x in src; push!(dest, x); end, right?

There are other non-abstractarray collections other than dicts or sets so defining this with such a loose signature seems wrong. Why should copy! append to existing keys for Associative (or other) collections but overwrite existing values for arrays?

I agree that in general append! is repeated push!. So it seems the solution is to have an f! such that copy! is repeated f!. f! works for unordered collections, or collections like dicts where the "elements" contain both a key and a value, so there is no question about "where" it will be inserted.

append! is in fact only defined for Vector and BitVector, so this hasn't been probed much. We also have union!, which is literally implemented with repeated push!, and merge! which could easily be. These certainly overwrite existing values. So arguably we should replace push! with f! in these cases. f! could be, say, a 2-argument form of insert!.

A bit more background: in the use case in this PR, I'm either incrementally growing an Array, or building some other kind of collection. So push! kind of makes sense.

In the case of copy!, that f! seems like it should be setindex!. The generic implementation for dest::AbstractArray should probably be

for (i, x) in zip(eachindex(dest), src) dest[i] = x end

Yes, that looks fine. But the problem is that if you want to implement copy! for non-arrays, you need to use push! AFAICT. Which is not correct for arrays, hence this comment thread.

timholy · 2016-03-19T09:50:10Z

I have to interrupt my review for the remainder of the day, but aside from a big 👍 one general comment: this seems to suffer from the difference in our iteration protocol for arrays (which returns values) and associative collections (which returns key-value pairs). The closest we have for arrays is enumerate, and that's starting to look quite limited to me:

Indexing with integers is not always efficient
In (i,x) in enumerate(src), is i an index into src or a counter? I'm mentally using Fortran-style arrays (which might not start at 1) as an exercise in testing the generality of our constructs, and enumerate seems to fail on this score (the docs indicate it's an index, but the name suggests it's a counter).

I'm behind a ridiculously-slow connection or I'd provide a link, but if you haven't seen it, my recent "detangle..." and "iterate along a dimension..." PRs are also concerned with generalizing our iteration (over just arrays in this case). There seem to be a number of commonalities, so I'll try to keep a close eye on this.

JeffBezanson · 2016-03-21T16:11:30Z

Exactly; the problem is that a single element of a set or dict has all the information you need to insert it in a different collection, but for arrays the keys are only part of the container and not the elements.

The only solution I can think of is to use a different name than push! for "add item to collection". push! would then mean specifically "add to the end of an ordered collection". We could also replace union! and merge! (which have very few definitions) with copy!.

timholy · 2016-03-21T16:56:27Z

The other option is to introduce a eachkv(obj), which always returns a key-value pair.

I originally thought that one other potential difference between arrays and associative collections is setindex!, which never generates a BoundsError for associative collections. However, an alternative conception removes this difference: the array "keys" are a complete set, so you can't generate a new one that is independent of all keys already used. With this perspective, it's much like a Dict{ASCIIString,Any} will complain if you index with a UTF8String that can't be represented in ASCII.

JeffBezanson · 2016-03-21T17:25:53Z

eachkv will not help for sets unless we add indexing functions to sets, which we probably don't want to do. Independent of any of this though, there is arguably still an issue with push!, where it means "insert" for other collections but "insert at end" for arrays. @tkelman 's view seems to be that these meanings are too dissimilar to inhabit the same function, and I can see his point.

timholy · 2016-03-21T18:14:43Z

I agree with @tkelman and was trying to come up with a generic iteration scheme that lets you use setindex! instead. But that is indeed problematic for Sets.

timholy · 2016-03-22T22:02:01Z

base/generator.jl

+immutable EltypeUnknown <: IteratorEltype end
+immutable HasEltype <: IteratorEltype end
+
+iteratoreltype(x) = iteratoreltype(typeof(x))


haseltype? And then EltypeUnknown and EltypeKnown?

Sounds ok. I was aiming to have all the iterator traits share the iterator prefix, but I guess having an element type does not need to be tied to iteration.

Merit in the iterator prefix, too. iteratorhaseltype? Getting kinda long without underscores, though.

JeffBezanson · 2016-03-24T15:04:09Z

Sounds like the best thing for me to do for now is to restrict the copy! method to Union{Associative,AbstractSet}?

timholy · 2016-03-24T15:38:19Z

Seems reasonable to me. Probably move it into dict.jl or something?

vtjnash · 2016-03-24T17:14:35Z

@timholy going off your recent changes, would it be more correct to iterate the indexes instead?

copy!(dest, src) = for i in eachindex(src)
  dest[i] = src[i]
end

this might, for example, give more reasonable result when the types of src and dest are not the same

JeffBezanson · 2016-03-24T17:45:31Z

Ready to merge I think?

timholy · 2016-03-24T18:25:56Z

@vtjnash, right now (and even moreso once ReshapedArrays merge), the best approach is to use a separate iterator for each array (so a zip(eachindex(dest), eachindex(src)) would be best). However, for further abstractions, I think we need to do more to ensure alignment yet allow each array to choose its most efficient iterator. I'm still mulling over the best way to express this.

vtjnash · 2016-03-24T20:29:19Z

i was thinking that for the definition of copy, this isn't actually iterating dest, it is actually iterating the indexes of src and assigning the equivalent index in dest to that value of the index in src. this makes it a correct copy for many more odd cases (for example, dict -> array, array -> dict, array -> array, dict -> dict). although other cases (like set) don't make sense, i think it should be the "right" behavior for map (I think map(set) -> set is wrong, since i think that the definition of map requires that map(f, a)[x] == f(a[x]).

JeffBezanson · 2016-03-24T20:43:24Z

The trouble is that the notion of an iterator only specifies a sequence of values, not a sequence of values with indices. So one game to play here is to see how far we can get in generic functions assuming that collections have no more structure than, essentially, a multi-set. I think I've tended to assume that an array is an ordered multi-set, rather than a set of index=>value pairs.

StefanKarpinski · 2016-03-24T20:46:53Z

An iterator could implicitly be considered to have integer indices.

timholy · 2016-03-24T22:32:57Z

I think I've tended to assume that an array is an ordered multi-set, rather than a set of index=>value pairs.

That's a good definition, and maybe all it should be. But I'm quite interested in some generalizations that are guaranteed to cause trouble 😈, and I suspect the multi-set won't prove to be enough of a foundation for deciding what the right behavior should be.

…traits make HasEltype and HasLength the default this is a possible fix for #15342 type-accumulating Dict constructor some tests for type accumulation in `map` and `Dict`

add `iteratorsize(::Type{StreamMapIterator})`

samoconnor · 2016-03-29T07:14:29Z

Should I be using something like interatorsize(c) == HasLength() here ?

From https://github.com/samoconnor/julia/blob/retry_branch/base/pmap.jl#L74:

function batchsplit(c; min_batch_count=1, max_batch_size=100)

    # Split collection into batches, then peek at the first few batches...
    batches = partition(c, max_batch_size)
    head, tail = head_and_tail(batches, min_batch_count)

    # If there are not enough batches, use a smaller batch size...
    if length(head) < min_batch_count
        batch_size = max(1, div(sum(length, head), min_batch_count))
        return partition(flatten(head), batch_size)
    end

    return flatten((head, tail))
end

The goal is this:

c might be infinite
c might have length, but if length(c) is implemented by running through the iterator, we really want to avoid calling length(c) (it might be reading rows from a remote database) .
we want to know if c has at least enough items for min_batch_count batches of max_batch_size items.

Does the new iterator traits stuff allow asking the question: "Do we known, without non-trivial computation, that this collection has at least n items ?"

JeffBezanson · 2016-03-29T14:28:33Z

No, it doesn't support that distinction yet. I think it would be ok for length to be O(n) e.g. in the case of a linked list, but if it involves I/O or is very expensive it doesn't seem like it should be implemented.

I can see the need for a "has at least n elements" function, since if n is small it is indeed reasonable to implement for expensive iterators. Of course there can be a fallback that just calls length.

JeffBezanson mentioned this pull request Mar 1, 2016

use mark_julia_const in codegen of typeof when type is known #15313

Merged

JeffBezanson force-pushed the jb/itertraits branch 2 times, most recently from 24ebce0 to 6b90b4d Compare March 2, 2016 19:01

mschauer reviewed Mar 2, 2016
View reviewed changes

This was referenced Mar 3, 2016

map(f, c) return type #15342

Closed

type-inference-independent comprehensions #7258

Closed

JeffBezanson force-pushed the jb/itertraits branch from 6b90b4d to 3630dd3 Compare March 8, 2016 03:42

This was referenced Mar 8, 2016

Zips, products and takes with infinite components #15398

Closed

zips, products and takes with infinite components #15400

Merged

JeffBezanson mentioned this pull request Mar 8, 2016

Unnecessary GC root for getfield of SSA immutable object #15402

Closed

JeffBezanson force-pushed the jb/itertraits branch from 6d8191b to ec3d68b Compare March 11, 2016 03:57

JeffBezanson force-pushed the jb/itertraits branch 2 times, most recently from 00c31f0 to c30cffc Compare March 18, 2016 16:18

tkelman reviewed Mar 18, 2016
View reviewed changes

timholy reviewed Mar 22, 2016
View reviewed changes

JeffBezanson force-pushed the jb/itertraits branch 4 times, most recently from d0d8ae8 to 337074c Compare March 24, 2016 16:42

JeffBezanson and others added 3 commits March 28, 2016 13:19

make map and collect more general and uniform by adding iterator …

a2ce230

…traits make HasEltype and HasLength the default this is a possible fix for #15342 type-accumulating Dict constructor some tests for type accumulation in `map` and `Dict`

zips, products and takes with infinite components

d60e002

length() of Take and Drop iterators

bd02ba4

add `iteratorsize(::Type{StreamMapIterator})`

JeffBezanson force-pushed the jb/itertraits branch from 337074c to bd02ba4 Compare March 28, 2016 17:19

JeffBezanson merged commit c45175f into master Mar 29, 2016

vtjnash deleted the jb/itertraits branch March 29, 2016 03:58

samoconnor mentioned this pull request Mar 29, 2016

@catch, retry, partition, asyncmap and pmap #15409

Merged

lmshk mentioned this pull request Apr 17, 2016

no method matching length(::Iterators.Chain) JuliaCollections/Iterators.jl#56

Closed

mschauer mentioned this pull request Apr 26, 2016

HasLength() is the default iteratorsize for all custom iterators? Even if they don't define length? (Breaks Collect) #15977

Closed

rfourquet mentioned this pull request May 21, 2016

map for Dict #5794

Open

oxinabox mentioned this pull request May 25, 2016

RFC: Knowing iteratorsize via type parameters JuliaCollections/Iterators.jl#60

Closed

tkelman added the needs docs Documentation for this change is required label Jul 1, 2016

tkelman referenced this pull request in docopt/DocOpt.jl Jul 22, 2016

add the iteratorsize trait (close #14)

3bf97db

tkelman removed the needs docs Documentation for this change is required label Jul 27, 2016

vtjnash mentioned this pull request Mar 7, 2018

map over a KeySet unexpectedly returns a Set #26359

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: clean up `map` and `collect` with iterator traits #15146

RFC: clean up `map` and `collect` with iterator traits #15146

JeffBezanson commented Feb 19, 2016

mschauer commented Feb 21, 2016

StefanKarpinski commented Feb 22, 2016

JeffBezanson commented Feb 26, 2016

JeffBezanson commented Mar 2, 2016

mschauer Mar 2, 2016

JeffBezanson Mar 2, 2016

mschauer Mar 2, 2016

tkelman Mar 18, 2016

JeffBezanson Mar 18, 2016

tkelman Mar 18, 2016

JeffBezanson Mar 18, 2016

tkelman Mar 19, 2016

JeffBezanson Mar 19, 2016

timholy Mar 19, 2016

JeffBezanson Mar 21, 2016

timholy commented Mar 19, 2016

JeffBezanson commented Mar 21, 2016

timholy commented Mar 21, 2016

JeffBezanson commented Mar 21, 2016

timholy commented Mar 21, 2016

timholy Mar 22, 2016

JeffBezanson Mar 22, 2016

timholy Mar 22, 2016

JeffBezanson commented Mar 24, 2016

timholy commented Mar 24, 2016

vtjnash commented Mar 24, 2016

JeffBezanson commented Mar 24, 2016

timholy commented Mar 24, 2016

vtjnash commented Mar 24, 2016

JeffBezanson commented Mar 24, 2016

StefanKarpinski commented Mar 24, 2016

timholy commented Mar 24, 2016

samoconnor commented Mar 29, 2016

JeffBezanson commented Mar 29, 2016

RFC: clean up map and collect with iterator traits #15146

RFC: clean up map and collect with iterator traits #15146

Conversation

JeffBezanson commented Feb 19, 2016

mschauer commented Feb 21, 2016

StefanKarpinski commented Feb 22, 2016

JeffBezanson commented Feb 26, 2016

JeffBezanson commented Mar 2, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timholy commented Mar 19, 2016

JeffBezanson commented Mar 21, 2016

timholy commented Mar 21, 2016

JeffBezanson commented Mar 21, 2016

timholy commented Mar 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JeffBezanson commented Mar 24, 2016

timholy commented Mar 24, 2016

vtjnash commented Mar 24, 2016

JeffBezanson commented Mar 24, 2016

timholy commented Mar 24, 2016

vtjnash commented Mar 24, 2016

JeffBezanson commented Mar 24, 2016

StefanKarpinski commented Mar 24, 2016

timholy commented Mar 24, 2016

samoconnor commented Mar 29, 2016

JeffBezanson commented Mar 29, 2016

RFC: clean up `map` and `collect` with iterator traits #15146

RFC: clean up `map` and `collect` with iterator traits #15146