Update benchmarks and README

GarrisonJ · Apr 26, 2024 · 43db188 · 43db188
1 parent d1c7a8b
commit 43db188
Show file tree

Hide file tree

Showing 17 changed files with 171 additions and 42 deletions.
diff --git a/README.md b/README.md
@@ -1,32 +1,38 @@
 # SortedContainers
 
-SortedContainers is a _fast_ implementation of sorted lists, sets, and dictionaries in pure Ruby. It is based on the [sortedcontainers](https://grantjenks.com/docs/sortedcontainers/) Python library by Grant Jenks.
+SortedContainers is a fast implementation of sorted lists, sets, and dictionaries in pure Ruby. It is based on the [sortedcontainers](https://grantjenks.com/docs/sortedcontainers/) Python library by Grant Jenks.
 
 SortedContainers provides three main classes: `SortedArray`, `SortedSet`, and `SortedHash`. Each class is a drop-in replacement for the corresponding Ruby class, but with the added benefit of maintaining the elements in sorted order.
 
-SortedContainers exploits the fact that modern computers are really good at shifting elements around in memory. We sacrifice theroetical time complexity for practical performance. In practice, SortedContainers is _fast_.
+SortedContainers exploits the fact that modern computers are really good at shifting arrays in memory. We sacrifice theroetical time complexity for practical performance. In practice, SortedContainers is fast.
 
 ## How it works
 
-Computers are really good at shifting arrays around. For that reason, in practice it's often faster to keep an array sorted than to use the usual tree-based data structures.
+Computers are good at shifting arrays. For that reason, it's often faster to keep an array sorted than to use the usual tree-based data structures.
 
-For example, if you have the array `[1, 2, 4, 5]` and you want to insert the element `3`, you can simply shift `4, 5` to the right and insert `3` in the correct position. This is a `O(n)` operation, but it's fast.
+For example, if you have the array `[1, 2, 4, 5]` and want to insert the element `3`, you can shift `4, 5` to the right and insert `3` in the correct position. This is a `O(n)` operation, but in practice it's fast.
 
-But if we have a lot of elements we can do better by breaking up the array into smaller arrays. That way we don't have to shift so many elements whenever we insert. For example, if you have the array `[[1, 2], [4, 5]]` and you want to insert the element `3`, you can simply insert `3` into the first array. 
+But we can do better if we have a lot of elements. We can break up the array into smaller arrays so the shifts don't have to move so many elements. For example, if you have the array `[[1,2,4], [5,6,7]]` and you want to insert the element `3`, you can insert `3` into the first array to get `[[1,2,3,4], [5,6,7]]` and only the element `4` has to be shifted.
 
-This often outperforms the more common tree-based data structures like red-black trees and AVL trees, which have `O(log n)` insertions and deletions. In practice, the `O(n)` insertions and deletions of SortedContainers are faster.
+This often outperforms the more common tree-based data structures like red-black trees with `O(log n)` insertions and deletions. In practice, the `O(n)` insertions and deletions of SortedContainers are faster.
 
-How big these smaller arrays should be is a trade-off. The default is set DEFAULT_LOAD_FACTOR = 1000. There is no perfect value and the ideal value will depend on your use case.
+How big the subarrays are is a trade-off. You can modify how big you want to subarrays by setting the `load_factor`. The default is set to DEFAULT_LOAD_FACTOR = 1000. The subarray is split when its size is `2*load_factor`. There is no perfect value. The ideal value will depend on your use case and may require some experimentation.
 
 ## Benchmarks
 
-Performance comparison against [SortedSet](https://github.com/knu/sorted_set) a C extension red-black tree implementation (lower is better).
+Performance comparison against [SortedSet](https://github.com/knu/sorted_set) a C extension red-black tree implementation. Every test was run 5 times and the average was taken.
 
+You can see that SortedContainers has compariable performance for add and delete, and much better performance for iteration, initialization, and include.
+
+Note: I do not know why initialization is faster for 4 million than 3 million elements. This was consistant across multiple runs.
+
+- MacBook Pro (16-inch, 2019)
 - 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4
 - Ruby 3.2.2
 - SortedContainers 0.1.0
 - SortedSet 1.0.3
-
+### Results (Lower is better)
+<img src="benchmark/initialize_performance_comparison.png" width="50%">
 <img src="benchmark/add_performance_comparison.png" width="50%">
 <img src="benchmark/delete_performance_comparison.png" width="50%">
 <img src="benchmark/iteration_performance_comparison.png" width="50%">

diff --git a/benchmark/.DS_Store b/benchmark/.DS_Store
diff --git a/benchmark/add_performance_comparison.png b/benchmark/add_performance_comparison.png
diff --git a/benchmark/benchmark.rb b/benchmark/benchmark.rb
@@ -6,47 +6,63 @@
 require "csv"
 
 sizes = [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000]
+#sizes = [100_000, 200_000, 300_000, 400_000, 500_000]
+#sizes = [10_000, 20_000, 30_000, 40_000, 50_000]
 results = []
+runs = 5
 
 Benchmark.bm(15) do |bm|
   sizes.each do |n|
-    list = Array.new(n) { rand(0..n) }
+    # The items to be added to the set
+    list_adds = (1..n).to_a.shuffle
     results_for_n = { size: n }
 
     # Benchmarking original SortedSet
     bm.report("SortedSet #{n}:") do
-      sorted_set = SortedSet.new
-      results_for_n[:add_sorted_set] = Benchmark.measure { list.each { |i| sorted_set.add(i) } }.real
-      results_for_n[:include_sorted_set] = Benchmark.measure { list.each { |i| sorted_set.include?(i) } }.real
-      # rubocop:disable Lint/EmptyBlock
-      results_for_n[:loop_sorted_set] = Benchmark.measure { sorted_set.each { |i| } }.real
-      # rubocop:enable Lint/EmptyBlock
-      results_for_n[:delete_sorted_set] = Benchmark.measure { list.shuffle.each { |i| sorted_set.delete(i) } }.real
+      total_time = {add: 0, include: 0, loop: 0, delete: 0}
+      runs.times do
+        sorted_set = SortedSet.new
+        total_time[:add] += Benchmark.measure { list_adds.each { |i| sorted_set.add(i) } }.real
+        total_time[:include] += Benchmark.measure { (1..n).map { rand((-0.5*n).to_i..(n*1.5).to_i) }.each { |i| sorted_set.include?(i) } }.real
+        total_time[:loop] += Benchmark.measure { sorted_set.each { |i| } }.real
+        total_time[:delete] += Benchmark.measure do 
+          list_adds.shuffle.each do |i| 
+            sorted_set.delete(i) 
+          end
+        end.real
+      end
+      results_for_n[:add_sorted_set] = total_time[:add] / runs
+      results_for_n[:include_sorted_set] = total_time[:include] / runs
+      results_for_n[:loop_sorted_set] = total_time[:loop] / runs
+      results_for_n[:delete_sorted_set] = total_time[:delete] / runs
     end
 
     # Benchmarking custom SortedSet
     bm.report("SortedContainers #{n}:") do
-      sorted_set = SortedContainers::SortedSet.new
-      results_for_n[:add_sorted_containers] = Benchmark.measure { list.each { |i| sorted_set.add(i) } }.real
-      results_for_n[:include_sorted_containers] = Benchmark.measure { list.each { |i| sorted_set.include?(i) } }.real
-      # rubocop:disable Lint/EmptyBlock
-      results_for_n[:loop_sorted_containers] = Benchmark.measure { sorted_set.each { |i| } }.real
-      # rubocop:enable Lint/EmptyBlock
-      results_for_n[:delete_sorted_containers] = Benchmark.measure do
-        list.shuffle.each do |i|
-          sorted_set.delete(i)
-        end
-      end.real
+      total_time = {add: 0, include: 0, loop: 0, delete: 0}
+      runs.times do
+        sorted_set = SortedContainers::SortedSet.new
+        total_time[:add] += Benchmark.measure { list_adds.each { |i| sorted_set.add(i) } }.real
+        total_time[:include] += Benchmark.measure { (1..n).map { rand((-0.5*n).to_i..(n*1.5).to_i) }.each { |i| sorted_set.include?(i) } }.real
+        total_time[:loop] += Benchmark.measure { sorted_set.each { |i| } }.real
+        total_time[:delete] += Benchmark.measure do 
+          list_adds.shuffle.each do |i| 
+            sorted_set.delete(i) 
+          end
+        end.real
+      end
+      results_for_n[:add_sorted_containers] = total_time[:add] / runs
+      results_for_n[:include_sorted_containers] = total_time[:include] / runs
+      results_for_n[:loop_sorted_containers] = total_time[:loop] / runs
+      results_for_n[:delete_sorted_containers] = total_time[:delete] / runs
     end
-
     results << results_for_n
   end
 end
 
-# Export results to CSV for visualization
 CSV.open("benchmark_results.csv", "wb") do |csv|
-  csv << results.first.keys # Adds the headers
-  results.each do |data|
-    csv << data.values
+  csv << results.first.keys
+  results.each do |result|
+    csv << result.values
   end
-end
+end
diff --git a/benchmark/benchmark_init_only.rb b/benchmark/benchmark_init_only.rb
@@ -0,0 +1,46 @@
+# frozen_string_literal: true
+
+require "benchmark"
+require "sorted_set"
+require_relative "../lib/sorted_containers/sorted_set"
+require "csv"
+
+sizes = [1_000_000, 2_000_000, 3_000_000, 4_000_000, 5_000_000]
+#sizes = [100_000, 200_000, 300_000, 400_000, 500_000]
+#sizes = [10_000, 20_000, 30_000, 40_000, 50_000]
+results = []
+runs = 5
+
+Benchmark.bm(15) do |bm|
+  sizes.each do |n|
+    # The items to be added to the set
+    list_adds = (1..n).to_a.shuffle
+    results_for_n = { size: n }
+
+    # Benchmarking original SortedSet
+    bm.report("SortedSet #{n}:") do
+      total_time = {init: 0}
+      runs.times do
+        total_time[:init] += Benchmark.measure { SortedSet.new(list_adds) }.real
+      end
+      results_for_n[:init_sorted_set] = total_time[:init] / runs
+    end
+
+    # Benchmarking custom SortedSet
+    bm.report("SortedContainers #{n}:") do
+      total_time = {init: 0}
+      runs.times do
+        total_time[:init] += Benchmark.measure { SortedContainers::SortedSet.new(list_adds) }.real
+      end
+      results_for_n[:init_sorted_containers] = total_time[:init] / runs
+    end
+    results << results_for_n
+  end
+end
+
+CSV.open("benchmark_results_init.csv", "wb") do |csv|
+  csv << results.first.keys
+  results.each do |result|
+    csv << result.values
+  end
+end
diff --git a/benchmark/benchmark_results.csv b/benchmark/benchmark_results.csv
@@ -1,6 +1,6 @@
 size,add_sorted_set,include_sorted_set,loop_sorted_set,delete_sorted_set,add_sorted_containers,include_sorted_containers,loop_sorted_containers,delete_sorted_containers
-1000000,1.554820999968797,1.2833110000938177,0.09176800027489662,1.4769029999151826,1.350441999733448,0.25363999977707863,0.02894899994134903,1.317619999870658
-2000000,3.537397999782115,2.99283599993214,0.19226500019431114,3.3960860003717244,2.897786000277847,0.5344350002706051,0.056071000173687935,2.8261299999430776
-3000000,5.6855340003967285,4.681231999769807,0.29748000018298626,6.225711999926716,4.550105999689549,0.8690530001185834,0.0854059997946024,4.565742999780923
-4000000,7.780092000029981,6.583335000090301,0.3906319998204708,8.723869000095874,6.202872999943793,1.1050410000607371,0.11117599997669458,6.186546999961138
-5000000,10.018141000065953,8.859457000158727,0.526653999928385,11.896155999973416,8.67450200021267,1.503161999862641,0.13894600002095103,8.176713000051677
+1000000,1.2325135999359191,11.639327200129628,0.14237359995022417,1.3688056000508368,1.564646599907428,0.6316141999326647,0.044612800050526855,1.8568651999346912
+2000000,3.1588088000193237,42.194053599890324,0.3135467999614775,3.3180113999173044,3.7178223999217153,1.5509135999716819,0.08306660009548068,4.205694200005382
+3000000,5.286830999981612,95.30711920000613,0.47653339989483356,5.443703200109303,6.175909599941224,2.700068200007081,0.11789239989593625,6.770565199945122
+4000000,7.538741199858487,153.70711900005116,0.6546002000570297,7.559051800053567,8.882256799936295,4.182955399993807,0.1547184000723064,9.676217600051313
+5000000,10.20743679990992,241.74679239979014,0.8260514000430703,9.85318099996075,11.755913200229406,5.6500453999266025,0.18556239986792206,12.258404000196606
diff --git a/benchmark/benchmark_results_init.csv b/benchmark/benchmark_results_init.csv
@@ -0,0 +1,6 @@
+size,init_sorted_set,init_sorted_containers
+1000000,1.626862599980086,0.5701855999417603
+2000000,3.699613399989903,2.0838129998184742
+3000000,6.832762000150979,3.794983599986881
+4000000,9.501148399990052,3.467436399962753
+5000000,13.889628200046719,4.997745399922133
diff --git a/benchmark/build_graphs.rb b/benchmark/build_graphs.rb
@@ -27,8 +27,8 @@ def create_graph(title, _sizes, data1, data2, labels, file_name)
   g.colors = %w[#ff6600 #3333ff]
 
   # Define data
-  g.data("C-extension RB Tree", data1)
-  g.data("SortedContainers SortedSet", data2)
+  g.data("C Implemented RB Tree", data1)
+  g.data("SortedContainers::SortedSet", data2)
 
   # X-axis labels
   g.x_axis_label = "Number of operations"
@@ -44,7 +44,7 @@ def create_graph(title, _sizes, data1, data2, labels, file_name)
 end
 # rubocop:enable Metrics/ParameterLists
 
-# Generate labels for x_axis
+# Generate labels for x_axis, format numbers with commas
 labels = {}
 sizes.each_with_index do |size, index|
   labels[index] = size.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse

diff --git a/benchmark/build_graphs_init_only.rb b/benchmark/build_graphs_init_only.rb
@@ -0,0 +1,55 @@
+# frozen_string_literal: true
+
+require "gruff"
+require "csv"
+
+# Read data from CSV
+data = CSV.read("benchmark_results_init.csv", headers: true, converters: :numeric)
+
+# Prepare data arrays
+sizes = data["size"]
+operations = {
+  "initialize" => %w[init_sorted_set init_sorted_containers]
+}
+
+# Method to create and save a graph
+# rubocop:disable Metrics/ParameterLists
+def create_graph(title, _sizes, data1, data2, labels, file_name)
+  g = Gruff::Line.new
+  g.title = "#{title} performance"
+
+  g.theme = Gruff::Themes::THIRTYSEVEN_SIGNALS
+
+  # Set line colors
+  g.colors = %w[#ff6600 #3333ff]
+
+  # Define data
+  g.data("C Implemented RB Tree", data1)
+  g.data("SortedContainers::SortedSet", data2)
+
+  # X-axis labels
+  g.x_axis_label = "Number of operations"
+
+  # Labels for x_axis
+  g.labels = labels
+
+  # Formatting y-axis to ensure no scientific notation, may need to adjust if log scale creates issues
+  g.y_axis_label = "Time (seconds)"
+
+  # Write the graph to a file
+  g.write(file_name)
+end
+# rubocop:enable Metrics/ParameterLists
+
+# Generate labels for x_axis, format numbers with commas
+labels = {}
+sizes.each_with_index do |size, index|
+  labels[index] = size.to_s.reverse.gsub(/(\d{3})(?=\d)/, '\\1,').reverse
+end
+
+# Generate a graph for each operation
+operations.each do |operation, keys|
+  puts "#{operation} #{keys}"
+  create_graph(operation, sizes, data[keys[0]], data[keys[1]], labels,
+               "#{operation.downcase}_performance_comparison.png")
+end
diff --git a/benchmark/delete_performance_comparison.png b/benchmark/delete_performance_comparison.png
diff --git a/benchmark/include_performance_comparison.png b/benchmark/include_performance_comparison.png
diff --git a/benchmark/init_performance_comparison.png b/benchmark/init_performance_comparison.png
diff --git a/benchmark/initialize_performance_comparison.png b/benchmark/initialize_performance_comparison.png
diff --git a/benchmark/initiation_performance_comparison.png b/benchmark/initiation_performance_comparison.png
diff --git a/benchmark/iteration_performance_comparison.png b/benchmark/iteration_performance_comparison.png
diff --git a/benchmark/looping_performance_comparison.png b/benchmark/looping_performance_comparison.png
diff --git a/benchmark/performance_comparison.png b/benchmark/performance_comparison.png