Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Add sortedTo (and friends) to Iterator #19

Open
BalmungSan opened this issue Oct 23, 2020 · 15 comments
Open

Idea: Add sortedTo (and friends) to Iterator #19

BalmungSan opened this issue Oct 23, 2020 · 15 comments
Labels
enhancement New feature or request library:collections status:pending This enhancement request lacks consensus on whether or not to include it

Comments

@BalmungSan
Copy link
Contributor

I feel like having a sortedTo(factory) (as well as all the other sorting variants) on Iterator would be a good addition.

It would be good to discuss if it can produce any performance advantage over iterator.to(factory).sorted or if it can be considered useful just as an API improvement to avoid two calls.
It is also worth discussing if this would be useful for all collections in general; for example, list.sortedTo(ArraySeq).

@SethTisue
Copy link
Member

@joshlemer might have an opinion?

@joshlemer
Copy link

I think that having a fully lazy .sorted on Iterator and View would be really beneficial. I'm not sure that we'd need a .sortedTo(factory) because that optimization logic could be baked into the sorted iterator's .to method.

It would also come into handy with more than just terminal operations, if you could have a lazy sort:

case class Employee(name: String, salary: Double)

val employees: Iterable[Employee] = ???

val top3 = employees.{view, iterator}.sortBy(_.salary).take(3).toSeq

@joshlemer
Copy link

joshlemer commented Oct 24, 2020

the lazy sorting would open the door to optimizing the above .sorted.take(n) example by using a min-max heap to avoid doing a lot of sorting on the source collection. In non-optimized, non-totally-correct pseudo-ish code...

self => 

def take(n: Int): Iterator = {
  if (self.knownSize != -1 && self.knownSize < n) {
    // nothing to be gained from heapifying. Just sort it
    return Iterator.empty.concat(self.toArray.sortInPlace.iterator)
  }
  
  new Iterator[A] {
    var heap: MinMaxHeap[A] = null
    def hasNext: Boolean = (heap != null && heap.nonEmpty) || self.hasNext
    def next(): A = {
      if (heap != null) return heap.popMin()
      
      // must force `self` now
      self.foreach { a => 
         if (heap.size < n) heap.addOne(a)
         else {
           if (a < heap.peekLast()) {
             heap.popLast()
             heap.addOne(a)
           }
         }
      }

       heap.popMin()
    }
   
  }
}

@NthPortal
Copy link
Contributor

NthPortal commented Oct 24, 2020

regardless of whether or not we have something for Iterator/View with a min-max heap, I think it would be valuable to have sortedTo (and friends), because of their:

  • low conceptual weight
  • simple implementation
  • fast performance

while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection. additionally, it has low/non-obvious discoverability (we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it).

@NthPortal NthPortal added the enhancement New feature or request label Oct 24, 2020
@BalmungSan
Copy link
Contributor Author

What about having sortedTo and sortedTake (and friends)?

I may be mistaken, but I do not think that a sorted method on Iterator would be efficient in the general sense, for example what would happen if then you chain many more transformations like map and filter? I think you would need to have an intermediate collection which is basically the point to avoid.

It seems (but I may be mistaken) that it would be easier and better to implement a lazy sort in the context of a consumption of the Iterator.

@joshlemer
Copy link

joshlemer commented Oct 24, 2020

@NthPortal

low conceptual weight; simple implementation; fast performance

Just my opinion but I think it's lighter to just have .sorted than to have .sortedTo(List). We would have to have a new iterator implementation at any rate, SortedIterator[A], we can just override the def to:

class SortedIterator[A](underlying: Iterator[A]) { 

  // optionally, we can optimize these methods, which one cannot do if you take the `sortedTo` approach 

  override def take(n: Int): Iterator[A] = ???
  override def drop(n: Int): Iterator[A] = ???
  override def takeRight(n: Int): Iterator[A] = ???
  override def dropRight(n: Int): Iterator[A] = ???
  override def slice(i: Int, j: Int): Iterator[A] = ???


  // whatever optimizations you would have put into sortedTo, just put in here
 override def to[C](f: Factory[A, C]): C = ???
}

This is no less simple than making a new sortedTo method. It's much less intrusiv than duplicating all of the take style methods as @BalmungSan with sortedTake sortedDrop sortedTakeRight sortedDropRight sortedSlice sortedTakeWhile sortedDropWhile and whichever others. .sorted.to is no less optimizeable/performant than .sortedTo (well, save the iterator allocation I guess?)

while a min-max heap could drastically improve head, last, take or drop operations after a sort on an iterator, it comes with a significant performance cost if you want to use most or all of the collection

We could have a heuristic like "only kick in the heap once we confirm that the iterator is larger than (say, for example) 5 or 10 times n. Then we're being very conservative with the heap, only falling back to it when we know for certain that the iterator is enormous compared to n.

we already have .view.sorted.to(...) with essentially the same implementation as sortedTo(...) would have, but people don't know about it)

there isn't a .view.sorted.to(...), on anything but SeqView. I was actually going to write up some of this motivation in the recent Sorted-is-slightly-less-lazy PR but got sidetracked 😅

@NthPortal
Copy link
Contributor

NthPortal commented Oct 24, 2020

with sortedTake sortedDrop sortedTakeRight sortedDropRight sortedSlice sortedTakeWhile sortedDropWhile and whichever others

I'm pretty sure they just meant variants for sortWith and sortBy (i.e. sortToWith and sortToBy or something)

@SethTisue
Copy link
Member

found a previous overlapping discussion: scala/bug#11711

@joshlemer
Copy link

joshlemer commented Oct 24, 2020

@SethTisue got me, I am realizing I've been beating this drum for ages 🤣

@NthPortal NthPortal added the status:pending This enhancement request lacks consensus on whether or not to include it label Nov 21, 2020
@neontorrent
Copy link

I am for adding lazy sort for view and iterator

@NthPortal
Copy link
Contributor

there's only so lazily you can sort. at the end of the day, you need to at the very least iterate over all elements

@julienrf
Copy link
Contributor

If I understand correctly, the value of such an operation would be to be able to sort the X first elements in O(N) rather than in O(N*log(N)), right? (when X is small and in particular X << N)

@joshlemer
Copy link

The benefits would be:

  1. Ability to add optimization that avoids sorting most elements in situations like
View.fill(1000000)(Random.nextInt()).sorted.take(5).toList 

this goes for take and friends (slice, takeRight, drop, dropRight, ...)

  1. avoid forcing at all when you don't need to. This is for performance but also for surprise-avoidance and consistency. The view and iterator apis are in general always lazy, so users can generally build up a tree of views and count on them not being forced until the terminal operation at the end:
val v: View[Int] = ???

View.fill(10)(1).concat(v.sorted).take(5).toList 

^ Under the strict approach, v would have been forced, but it turned out we never even needed it at all.

@neontorrent
Copy link

Either way it works for me. It is not a lazy operation but I think it just conforms to the new way of collection conversion, which is using view, and avoiding adding a new API method

@NthPortal
Copy link
Contributor

currently, View#sorted is only partially lazy - it computes the size eagerly to ensure that the collection is finite

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request library:collections status:pending This enhancement request lacks consensus on whether or not to include it
Projects
None yet
Development

No branches or pull requests

6 participants