Skip to content

Commit

Permalink
Update benchmarks and README
Browse files Browse the repository at this point in the history
  • Loading branch information
GarrisonJ committed Apr 26, 2024
1 parent d1c7a8b commit 43db188
Show file tree
Hide file tree
Showing 17 changed files with 171 additions and 42 deletions.
24 changes: 15 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,32 +1,38 @@
# SortedContainers

SortedContainers is a _fast_ implementation of sorted lists, sets, and dictionaries in pure Ruby. It is based on the [sortedcontainers](https://grantjenks.com/docs/sortedcontainers/) Python library by Grant Jenks.
SortedContainers is a fast implementation of sorted lists, sets, and dictionaries in pure Ruby. It is based on the [sortedcontainers](https://grantjenks.com/docs/sortedcontainers/) Python library by Grant Jenks.

SortedContainers provides three main classes: `SortedArray`, `SortedSet`, and `SortedHash`. Each class is a drop-in replacement for the corresponding Ruby class, but with the added benefit of maintaining the elements in sorted order.

SortedContainers exploits the fact that modern computers are really good at shifting elements around in memory. We sacrifice theroetical time complexity for practical performance. In practice, SortedContainers is _fast_.
SortedContainers exploits the fact that modern computers are really good at shifting arrays in memory. We sacrifice theroetical time complexity for practical performance. In practice, SortedContainers is fast.

## How it works

Computers are really good at shifting arrays around. For that reason, in practice it's often faster to keep an array sorted than to use the usual tree-based data structures.
Computers are good at shifting arrays. For that reason, it's often faster to keep an array sorted than to use the usual tree-based data structures.

For example, if you have the array `[1, 2, 4, 5]` and you want to insert the element `3`, you can simply shift `4, 5` to the right and insert `3` in the correct position. This is a `O(n)` operation, but it's fast.
For example, if you have the array `[1, 2, 4, 5]` and want to insert the element `3`, you can shift `4, 5` to the right and insert `3` in the correct position. This is a `O(n)` operation, but in practice it's fast.

But if we have a lot of elements we can do better by breaking up the array into smaller arrays. That way we don't have to shift so many elements whenever we insert. For example, if you have the array `[[1, 2], [4, 5]]` and you want to insert the element `3`, you can simply insert `3` into the first array.
But we can do better if we have a lot of elements. We can break up the array into smaller arrays so the shifts don't have to move so many elements. For example, if you have the array `[[1,2,4], [5,6,7]]` and you want to insert the element `3`, you can insert `3` into the first array to get `[[1,2,3,4], [5,6,7]]` and only the element `4` has to be shifted.

This often outperforms the more common tree-based data structures like red-black trees and AVL trees, which have `O(log n)` insertions and deletions. In practice, the `O(n)` insertions and deletions of SortedContainers are faster.
This often outperforms the more common tree-based data structures like red-black trees with `O(log n)` insertions and deletions. In practice, the `O(n)` insertions and deletions of SortedContainers are faster.

How big these smaller arrays should be is a trade-off. The default is set DEFAULT_LOAD_FACTOR = 1000. There is no perfect value and the ideal value will depend on your use case.
How big the subarrays are is a trade-off. You can modify how big you want to subarrays by setting the `load_factor`. The default is set to DEFAULT_LOAD_FACTOR = 1000. The subarray is split when its size is `2*load_factor`. There is no perfect value. The ideal value will depend on your use case and may require some experimentation.

## Benchmarks

Performance comparison against [SortedSet](https://github.com/knu/sorted_set) a C extension red-black tree implementation (lower is better).
Performance comparison against [SortedSet](https://github.com/knu/sorted_set) a C extension red-black tree implementation. Every test was run 5 times and the average was taken.

You can see that SortedContainers has compariable performance for add and delete, and much better performance for iteration, initialization, and include.

Note: I do not know why initialization is faster for 4 million than 3 million elements. This was consistant across multiple runs.

- MacBook Pro (16-inch, 2019)
- 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4
- Ruby 3.2.2
- SortedContainers 0.1.0
- SortedSet 1.0.3

### Results (Lower is better)
<img src="benchmark/initialize_performance_comparison.png" width="50%">
<img src="benchmark/add_performance_comparison.png" width="50%">
<img src="benchmark/delete_performance_comparison.png" width="50%">
<img src="benchmark/iteration_performance_comparison.png" width="50%">
Expand Down
Binary file added benchmark/.DS_Store
Binary file not shown.
Binary file modified benchmark/add_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
66 changes: 41 additions & 25 deletions benchmark/benchmark.rb
Original file line number Diff line number Diff line change
Expand Up @@ -6,47 +6,63 @@
require "csv"

sizes = [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000]
#sizes = [100_000, 200_000, 300_000, 400_000, 500_000]
#sizes = [10_000, 20_000, 30_000, 40_000, 50_000]
results = []
runs = 5

Benchmark.bm(15) do |bm|
sizes.each do |n|
list = Array.new(n) { rand(0..n) }
# The items to be added to the set
list_adds = (1..n).to_a.shuffle
results_for_n = { size: n }

# Benchmarking original SortedSet
bm.report("SortedSet #{n}:") do
sorted_set = SortedSet.new
results_for_n[:add_sorted_set] = Benchmark.measure { list.each { |i| sorted_set.add(i) } }.real
results_for_n[:include_sorted_set] = Benchmark.measure { list.each { |i| sorted_set.include?(i) } }.real
# rubocop:disable Lint/EmptyBlock
results_for_n[:loop_sorted_set] = Benchmark.measure { sorted_set.each { |i| } }.real
# rubocop:enable Lint/EmptyBlock
results_for_n[:delete_sorted_set] = Benchmark.measure { list.shuffle.each { |i| sorted_set.delete(i) } }.real
total_time = {add: 0, include: 0, loop: 0, delete: 0}
runs.times do
sorted_set = SortedSet.new
total_time[:add] += Benchmark.measure { list_adds.each { |i| sorted_set.add(i) } }.real
total_time[:include] += Benchmark.measure { (1..n).map { rand((-0.5*n).to_i..(n*1.5).to_i) }.each { |i| sorted_set.include?(i) } }.real
total_time[:loop] += Benchmark.measure { sorted_set.each { |i| } }.real
total_time[:delete] += Benchmark.measure do
list_adds.shuffle.each do |i|
sorted_set.delete(i)
end
end.real
end
results_for_n[:add_sorted_set] = total_time[:add] / runs
results_for_n[:include_sorted_set] = total_time[:include] / runs
results_for_n[:loop_sorted_set] = total_time[:loop] / runs
results_for_n[:delete_sorted_set] = total_time[:delete] / runs
end

# Benchmarking custom SortedSet
bm.report("SortedContainers #{n}:") do
sorted_set = SortedContainers::SortedSet.new
results_for_n[:add_sorted_containers] = Benchmark.measure { list.each { |i| sorted_set.add(i) } }.real
results_for_n[:include_sorted_containers] = Benchmark.measure { list.each { |i| sorted_set.include?(i) } }.real
# rubocop:disable Lint/EmptyBlock
results_for_n[:loop_sorted_containers] = Benchmark.measure { sorted_set.each { |i| } }.real
# rubocop:enable Lint/EmptyBlock
results_for_n[:delete_sorted_containers] = Benchmark.measure do
list.shuffle.each do |i|
sorted_set.delete(i)
end
end.real
total_time = {add: 0, include: 0, loop: 0, delete: 0}
runs.times do
sorted_set = SortedContainers::SortedSet.new
total_time[:add] += Benchmark.measure { list_adds.each { |i| sorted_set.add(i) } }.real
total_time[:include] += Benchmark.measure { (1..n).map { rand((-0.5*n).to_i..(n*1.5).to_i) }.each { |i| sorted_set.include?(i) } }.real
total_time[:loop] += Benchmark.measure { sorted_set.each { |i| } }.real
total_time[:delete] += Benchmark.measure do
list_adds.shuffle.each do |i|
sorted_set.delete(i)
end
end.real
end
results_for_n[:add_sorted_containers] = total_time[:add] / runs
results_for_n[:include_sorted_containers] = total_time[:include] / runs
results_for_n[:loop_sorted_containers] = total_time[:loop] / runs
results_for_n[:delete_sorted_containers] = total_time[:delete] / runs
end

results << results_for_n
end
end

# Export results to CSV for visualization
CSV.open("benchmark_results.csv", "wb") do |csv|
csv << results.first.keys # Adds the headers
results.each do |data|
csv << data.values
csv << results.first.keys
results.each do |result|
csv << result.values
end
end
end
46 changes: 46 additions & 0 deletions benchmark/benchmark_init_only.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# frozen_string_literal: true

require "benchmark"
require "sorted_set"
require_relative "../lib/sorted_containers/sorted_set"
require "csv"

sizes = [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000]
#sizes = [100_000, 200_000, 300_000, 400_000, 500_000]
#sizes = [10_000, 20_000, 30_000, 40_000, 50_000]
results = []
runs = 5

Benchmark.bm(15) do |bm|
sizes.each do |n|
# The items to be added to the set
list_adds = (1..n).to_a.shuffle
results_for_n = { size: n }

# Benchmarking original SortedSet
bm.report("SortedSet #{n}:") do
total_time = {init: 0}
runs.times do
total_time[:init] += Benchmark.measure { SortedSet.new(list_adds) }.real
end
results_for_n[:init_sorted_set] = total_time[:init] / runs
end

# Benchmarking custom SortedSet
bm.report("SortedContainers #{n}:") do
total_time = {init: 0}
runs.times do
total_time[:init] += Benchmark.measure { SortedContainers::SortedSet.new(list_adds) }.real
end
results_for_n[:init_sorted_containers] = total_time[:init] / runs
end
results << results_for_n
end
end

CSV.open("benchmark_results_init.csv", "wb") do |csv|
csv << results.first.keys
results.each do |result|
csv << result.values
end
end
10 changes: 5 additions & 5 deletions benchmark/benchmark_results.csv
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
size,add_sorted_set,include_sorted_set,loop_sorted_set,delete_sorted_set,add_sorted_containers,include_sorted_containers,loop_sorted_containers,delete_sorted_containers
1000000,1.554820999968797,1.2833110000938177,0.09176800027489662,1.4769029999151826,1.350441999733448,0.25363999977707863,0.02894899994134903,1.317619999870658
2000000,3.537397999782115,2.99283599993214,0.19226500019431114,3.3960860003717244,2.897786000277847,0.5344350002706051,0.056071000173687935,2.8261299999430776
3000000,5.6855340003967285,4.681231999769807,0.29748000018298626,6.225711999926716,4.550105999689549,0.8690530001185834,0.0854059997946024,4.565742999780923
4000000,7.780092000029981,6.583335000090301,0.3906319998204708,8.723869000095874,6.202872999943793,1.1050410000607371,0.11117599997669458,6.186546999961138
5000000,10.018141000065953,8.859457000158727,0.526653999928385,11.896155999973416,8.67450200021267,1.503161999862641,0.13894600002095103,8.176713000051677
1000000,1.2325135999359191,11.639327200129628,0.14237359995022417,1.3688056000508368,1.564646599907428,0.6316141999326647,0.044612800050526855,1.8568651999346912
2000000,3.1588088000193237,42.194053599890324,0.3135467999614775,3.3180113999173044,3.7178223999217153,1.5509135999716819,0.08306660009548068,4.205694200005382
3000000,5.286830999981612,95.30711920000613,0.47653339989483356,5.443703200109303,6.175909599941224,2.700068200007081,0.11789239989593625,6.770565199945122
4000000,7.538741199858487,153.70711900005116,0.6546002000570297,7.559051800053567,8.882256799936295,4.182955399993807,0.1547184000723064,9.676217600051313
5000000,10.20743679990992,241.74679239979014,0.8260514000430703,9.85318099996075,11.755913200229406,5.6500453999266025,0.18556239986792206,12.258404000196606
6 changes: 6 additions & 0 deletions benchmark/benchmark_results_init.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
size,init_sorted_set,init_sorted_containers
1000000,1.626862599980086,0.5701855999417603
2000000,3.699613399989903,2.0838129998184742
3000000,6.832762000150979,3.794983599986881
4000000,9.501148399990052,3.467436399962753
5000000,13.889628200046719,4.997745399922133
6 changes: 3 additions & 3 deletions benchmark/build_graphs.rb
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,8 @@ def create_graph(title, _sizes, data1, data2, labels, file_name)
g.colors = %w[#ff6600 #3333ff]

# Define data
g.data("C-extension RB Tree", data1)
g.data("SortedContainers SortedSet", data2)
g.data("C Implemented RB Tree", data1)
g.data("SortedContainers::SortedSet", data2)

# X-axis labels
g.x_axis_label = "Number of operations"
Expand All @@ -44,7 +44,7 @@ def create_graph(title, _sizes, data1, data2, labels, file_name)
end
# rubocop:enable Metrics/ParameterLists

# Generate labels for x_axis
# Generate labels for x_axis, format numbers with commas
labels = {}
sizes.each_with_index do |size, index|
labels[index] = size.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
Expand Down
55 changes: 55 additions & 0 deletions benchmark/build_graphs_init_only.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# frozen_string_literal: true

require "gruff"
require "csv"

# Read data from CSV
data = CSV.read("benchmark_results_init.csv", headers: true, converters: :numeric)

# Prepare data arrays
sizes = data["size"]
operations = {
"initialize" => %w[init_sorted_set init_sorted_containers]
}

# Method to create and save a graph
# rubocop:disable Metrics/ParameterLists
def create_graph(title, _sizes, data1, data2, labels, file_name)
g = Gruff::Line.new
g.title = "#{title} performance"

g.theme = Gruff::Themes::THIRTYSEVEN_SIGNALS

# Set line colors
g.colors = %w[#ff6600 #3333ff]

# Define data
g.data("C Implemented RB Tree", data1)
g.data("SortedContainers::SortedSet", data2)

# X-axis labels
g.x_axis_label = "Number of operations"

# Labels for x_axis
g.labels = labels

# Formatting y-axis to ensure no scientific notation, may need to adjust if log scale creates issues
g.y_axis_label = "Time (seconds)"

# Write the graph to a file
g.write(file_name)
end
# rubocop:enable Metrics/ParameterLists

# Generate labels for x_axis, format numbers with commas
labels = {}
sizes.each_with_index do |size, index|
labels[index] = size.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
end

# Generate a graph for each operation
operations.each do |operation, keys|
puts "#{operation} #{keys}"
create_graph(operation, sizes, data[keys[0]], data[keys[1]], labels,
"#{operation.downcase}_performance_comparison.png")
end
Binary file modified benchmark/delete_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified benchmark/include_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/init_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/initialize_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/initiation_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified benchmark/iteration_performance_comparison.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed benchmark/looping_performance_comparison.png
Binary file not shown.
Binary file removed benchmark/performance_comparison.png
Binary file not shown.

0 comments on commit 43db188

Please sign in to comment.