Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SortedSet, SortedDictionary #1

Open
lorentey opened this issue Apr 5, 2021 · 12 comments
Open

Implement SortedSet, SortedDictionary #1

lorentey opened this issue Apr 5, 2021 · 12 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@lorentey
Copy link
Member

lorentey commented Apr 5, 2021

OrderedSet and OrderedDictionary work great when we need to keep their elements in the order they were inserted, or if we only need to infrequently reorder/sort them. However, inserting (or removing) an element from the middle of the collection takes linear time in both cases, which makes these generic types less suitable for maintaining a sorted collection. It would be very much desirable for Swift developers to have access to efficient sorted collection types.

Self-balancing search trees naturally keep their elements sorted, and they implement efficient (O(log(n))) insertions and removals from anywhere in the collection -- so they seem an ideal choice for the implementation of a standard suite of sorted collections.

Binary search trees and similar low-fanout search tree variants (such as AVL trees or red-black trees) need to maintain a multitude of internal references, so they come with an inappropriately large memory overhead. High fanout search trees such as in-memory B-trees organize their elements into a tree of small contiguous buffers (up to a couple thousand items or so), and this makes them far more efficient in practice -- in terms of both space and time.

Unlike collections that store their elements in a single, contiguous buffer, tree-based collections also allow different versions of the same tree to share some of their storage, by simply referring to the same nodes within the tree. This makes them potentially more efficient than the existing standard collection types when implementing the copy-on-write optimization. (In B-tree's case, we can maintain a nice balance between lookup and CoW performance by having the inner nodes of the tree have a lower maximum fanout number than leaves.)

We'd like this package to implement a great set of sorted collection types based on an in-memory B-tree implementation.

struct SortedSet<Element: Comparable>: BidirectionalCollection, SetAlgebra {....}
struct SortedDictionary<Key: Comparable, Value> { ... }

struct SortedBag<Element: Comparable> { ... }
struct SortedMultimap<Key: Comparable, Value> { ... }
@lorentey lorentey added the enhancement New feature or request label Apr 5, 2021
@lorentey lorentey self-assigned this Apr 5, 2021
@codeman7
Copy link

codeman7 commented Apr 7, 2021

Hey @lorentey have you kicked this work off? I'd love to contribute and help out if possible.

@Jeehut
Copy link

Jeehut commented Apr 10, 2021

Just want to add that the types SortedSet and SortedDictionary would also allow for finally adding binary search to Swift. The API could look very familiar, but under the hoods it would be much faster by using binary search:

let sortedNumbers: SortedSet = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
if let firstMatchingIndex = sortedNumbers.index(where: { $0 > 5 }) {
  let firstMatch = sortedNumbers[firstMatchingIndex]
}

The index(where:) method would then include the binary search algorithm.

Also, @lorentey I don't quite understand why you're restricting adding sorted variants to Set and Dictionary only, I think Array deserves one just as much. The only reason there's no OrderedArray in this library is because Array is order by definition. But it's not sorted, so we should also have a SortedArray.

See also my own SortedArray implementation that I have been already using in production for years.

@kylemacomber
Copy link

I think the SortedBag that @lorentey mentions at the end of the issue would effectively be the same as a SortedArray. It's interesting to consider what the more approachable name is.

@Jeehut
Copy link

Jeehut commented Apr 11, 2021

@kylemacomber I was already wondering what the SortedBag is supposed to mean. Never heard Bag as a type name before, maybe it's just me but I find it confusing. We should use known terms, I would say. Array seems an obvious one.

@lorentey
Copy link
Member Author

"Bag" is a well-established name for a set that allows duplicates; it's less clinical sounding as "multiset", but that is definitely a viable name for it. (For what it's worth, multiset seems to be more popular these days; see e.g. Guava, LLVM, or Boost. However, contrast this with Eclipse, Apache Commons, and others. Knuth also prefers the term multiset, and that probably seals the deal for me. 😉)

At this early in the implementation stage though, I think it would probably be better to think about the abstract construct and the operations it would provide, rather than the exact name we would end up labeling it under.

(E.g., note that there is a distinction between a multiset that stores each individual duplicate, and one that merely counts duplicates. For a sorted collection intended to serve as a replacement for O(n^2) sorted array implementations, I expect we'd need the former variant.)

The word "array" has some strong connotations with random-access indexing to me. I don't know if I'd like to use the term SortedArray to mean anything other than an Array that is sorted. (And to be clear, I don't think adding a SortedArray in this sense would be a high priority for this package.)

<aside>(My problem with names for less universally established mathematical (or programming) concepts is the same as my problem with indentation styles and football clubs -- I find people tend to stick to whatever they got first exposed to, and they sometimes form overly strong attachments to them. This includes myself, of course.)</aside>

@lorentey lorentey mentioned this issue Apr 14, 2021
@codeman7
Copy link

codeman7 commented May 3, 2021

What would the process generally look like for someone to contribute? Both from a personal aspect as well as from a professional one if possible?

@CTMacUser
Copy link
Contributor

Just want to add that the types SortedSet and SortedDictionary would also allow for finally adding binary search to Swift.

That should probably be in the Algorithms library, so any collection that stores its elements in-order can take advantage of it, not just specific container types. (In fact, I think it has one right now, and I've proposed at least another in the past.)

@vihanb vihanb mentioned this issue Jul 21, 2021
7 tasks
@lorentey lorentey added this to the 1.1.0 milestone Sep 19, 2022
@lorentey
Copy link
Member Author

Provisionally putting on the 1.1.0 milestone. This may get bumped off if we take too much time on persistent collections, but let's assume otherwise for now.

@lorentey lorentey modified the milestones: 1.1.0, 1.2.0 Oct 11, 2022
@lorentey
Copy link
Member Author

Bumping to 1.2.0 acknowledging just how much work the rest of the new features in 1.1.0 is taking.
(I'm hoping 1.2.0 will come soon after 1.1.0 though.)

@mdznr
Copy link

mdznr commented Dec 3, 2022

It would be nice if SortedSet didn’t require Element to conform to Comparable, and instead could take a comparator closure ((Element, Element) -> Bool), like the sort(by:) family of functions. This would be useful in cases where you might want to have the same type sorted in different ways in different collections. Of course, if Element is Comparable, there’d be a convenience initializer for that.

For example, if you wanted a sorted list of people (SortedSet<Person>), you may wish to have them sorted by either their given name or family name. This different valid mechanisms for sorting isn’t something that could really easily be defined in a Comparable implementation on a Person type.

Before:

class Person: Equatable, Hashable {
  let givenName: String
  let familyName: String
  
  init(givenName: String, familyName: String) {
    self.givenName = givenName
    self.familyName = familyName
  }
  
  static func == (lhs: Person, rhs: Person) -> Bool {
    return (lhs.givenName == rhs.givenName) &&
           (lhs.familyName == rhs.familyName)
  }
  
  func hash(into hasher: inout Hasher) {
    hasher.combine(self.givenName)
    hasher.combine(self.familyName)
  }
}

class GivenNameSortedFirstPerson: Person, Comparable {
  static func < (lhs: GivenNameSortedFirstPerson, rhs: GivenNameSortedFirstPerson) -> Bool {
    if lhs.givenName < rhs.givenName {
      return true
    } else if lhs.givenName > rhs.givenName {
      return false
    } else {
      return lhs.familyName < rhs.familyName
    }
  }
}

let givenNameSortedPeople = SortedSet<GivenNameSortedFirstPerson>()

class FamilyNameSortedFirstPerson: Person, Comparable {
  static func < (lhs: FamilyNameSortedFirstPerson, rhs: FamilyNameSortedFirstPerson) -> Bool {
    if lhs.familyName < rhs.familyName {
      return true
    } else if lhs.familyName > rhs.familyName {
      return false
    } else {
      return lhs.givenName < rhs.givenName
    }
  }
}

let familyNameSortedPeople = SortedSet<FamilyNameSortedFirstPerson>()

After:

struct Person: Equatable, Hashable {
  let givenName: String
  let familyName: String
}

extension Person {
  static func compareByGivenNameFirst(_ lhs: Self, _ rhs: Self) -> Bool {
    if lhs.givenName < rhs.givenName {
      return true
    } else if lhs.givenName > rhs.givenName {
      return false
    } else {
      return lhs.familyName < rhs.familyName
    }
  }
  
  static func compareByFamilyNameFirst(_ lhs: Self, _ rhs: Self) -> Bool {
    if lhs.familyName < rhs.familyName {
      return true
    } else if lhs.familyName > rhs.familyName {
      return false
    } else {
      return lhs.givenName < rhs.givenName
    }
  }
}

let givenNameSortedPeople = SortedSet<Person>(comparator: Person.compareByGivenNameFirst(_:_:))
let familyNameSortedPeople = SortedSet<Person>(comparator: Person.compareByFamilyNameFirst(_:_:))

@lorentey
Copy link
Member Author

lorentey commented Dec 6, 2022

It would be nice if SortedSet didn’t require Element to conform to Comparable, and instead could take a comparator closure ((Element, Element) -> Bool), like the sort(by:) family of functions.

This would have too many undesirable consequences to accept. The core sorted collection types will definitely not work this way.

Closure values are inherently not Equatable. Therefore, there is no way to prove that a collection that's sorted by a closure is using a particular ordering -- the order is neither reflected in the type system, nor it is verifiable at runtime without manually scanning the entire contents of the collection. This makes closure-based collection designs inherently weaker than ones that make use of protocol conformances.

In my (very, very strongly held) opinion, two SortedSet values of the same type must be guaranteed to sort their members precisely the same way. This is a crucial feature of the core sorted collection types; violating it is not an option.

A function that takes a SortedSet must encode its ordering expectation within the Swift type system, and the natural way to do this is to rely on the Comparable protocol. SortedSet and SortedDictionary will therefore follow Set and Dictionary in making full use of Swift's protocols. Instead of Hashable, they will naturally require their Element type to conform to Comparable.

Beyond the type-safety/expressivity issue of not actually expressing the ordering in the type system, using a closure-based ordering would also have dire technical consequences.

  1. It would make it impossible to efficiently implement crucial methods for combining sets, such as intersection or union: there would be no way to quickly verify if two sets use the same ordering relation, so combining two sets would require element-wise processing with O(n*log(n)) worst-case performance. This would not be acceptable behavior for the basic sorted collection types. Instead, the statically typed SortedSet will guarantee a linear upper bound for merges. Additionally, it will be able to recognize identical subtrees in its inputs and process them in O(1) time -- which is going to be a crucial feature for discovering differences between diverged values.

  2. In addition, while the Swift compiler is able to specialize the implementation of generic types for specific type arguments, it is nowhere near capable of specializing functions for specific closure arguments -- so a closure-based API is expected to be generally slower (by a constant, but potentially significant factor) than ones that make good use of the type system.

Having said all this, though, I would be also open to shipping additional collection types that take closures -- once we have shipped the core sorted collection types. I would very strongly object to shipping closure-based collections before we ship the type-safe ones, or to implement type-safe sorted collections on top of lesser, closure-based variants.

Instead of designing closure-taking APIs, we also have the option of using mixin type parameters instead:

enum ComparisonResult {
  case orderedAscending
  case orderedSame
  case orderedDescending
}

protocol ComparatorProtocol {
  associatedtype Value
  static func compare(_ first: Value, _ second: Value) -> ComparisonResult
}

struct GeneralizedSortedSet<Element, Comparator: ComparatorProtocol> {
 ...
}

struct Person: Equatable, Hashable {
  let givenName: String
  let familyName: String

  struct ComparatorByGivenNameFirst: ComparatorProtocol {
    typealias Value = Person
    static func compare(_ first: Value, _ second: Value) -> ComparisonResult {
      ...
    }
  }

  struct ComparatorByFamilyNameFirst: ComparatorProtocol {
    typealias Value = Person
    static func compare(_ first: Value, _ second: Value) -> ComparisonResult {
      ...
    }
  }
}

let givenNameSortedPeople = GeneralizedSortedSet<Person, Person.ComparatorByGivenNameFirst>()
let familyNameSortedPeople = GeneralizedSortedSet<Person, Person.ComparatorByFamilyNameFirst>()

However, this is by no means necessary -- it's simply a mechanical transformation of the far more pedestrian option of creating trivial wrapper types over Person:

struct PersonOrderedByGivenName: Comparable {
  var value: Person

  func ==(left: Self, right: Self) -> Bool { left.value == right.value }
  func <(left: Self, right: Self) -> Bool { ... }
}

struct PersonOrderedByGivenName: Comparable {
  var value: Person

  func ==(left: Self, right: Self) -> Bool { left.value == right.value }
  func <(left: Self, right: Self) -> Bool { ... }
}

let givenNameSortedPeople = SortedSet<PersonOrderedByGivenName>()
let familyNameSortedPeople = SortedSet<PersonOrderedByFamilyName>()

GeneralizedSortedSet can be built on top of SortedSet, or vice versa.

Note that GeneralizedSortedSet would share the type safety of the core SortedSet -- you cannot pass a set that's sorted by given name to a function that expects to get a set that's ordered by family name. I generally consider this a feature, not a bug.

(Like closure-based data structures, GeneralizedSortedSet is also firmly in the "possible future direction" box -- the fundamental problem we need to tackle first is shipping the SortedSet type that requires its Element to be Comparable. Once we have done that, we'll have a sound technical foundation on which we can build other things.)

@pbk20191
Copy link

In current implementation. It is not possible to look up for non existing Key.

It would great if SortedDictionary and SortedSet provides this kind of feature.
Like java.util.NavigableMap and java.util.NavigableSet

extension SortedDictionary {
// returns the existing Index, if key exist, else return Index which contains bigger key than the given key.
   func ceilingIndex(for key:Key) -> Index
// returns the existing Index, if key exist, else return Index which contains smaller key than the given key.
   func floorIndex(for key:Key) -> Index
}

extension SortedSet {
   func ceilingIndex(for item:Element) -> Index
   func floorIndex(for item:Element) -> Index
}

I hope it won't be too hard, since _BTree already has lastIndex(forKey:) and firstIndex(forKey:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

8 participants