Extract GroupValues (#6969) #7016

tustvold · 2023-07-19T00:11:28Z

Which issue does this PR close?

Part of #6969

Rationale for this change

Extracts the group key storage into a type-erased trait object so that it can then be specialized for different column types.

I have not been able to run the benchmarks on this yet, but I don't expect it to have a major material impact

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

tustvold · 2023-07-19T00:12:53Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

-        }
-
-        // account for memory growth in scratch space
-        *allocated += self.scratch_space.size();


I couldn't see a reason to keep track of the delta and not just use try_resize at the end

Related response: #6932 (comment)

alamb · 2023-07-19T09:38:32Z

I am running benchmarks on this PR

alamb · 2023-07-19T10:42:00Z

My benchmark runs confirm what @tustvold suspected that there is no major performance change for this PR

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ extract-group-values ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  570.35ms │             567.53ms │     no change │
│ QQuery 2     │  166.07ms │             162.08ms │     no change │
│ QQuery 3     │  174.89ms │             170.65ms │     no change │
│ QQuery 4     │  115.82ms │             113.91ms │     no change │
│ QQuery 5     │  390.04ms │             388.49ms │     no change │
│ QQuery 6     │   39.12ms │              40.99ms │     no change │
│ QQuery 7     │  956.00ms │             925.58ms │     no change │
│ QQuery 8     │  239.53ms │             249.49ms │     no change │
│ QQuery 9     │  586.23ms │             586.65ms │     no change │
│ QQuery 10    │  331.74ms │             339.71ms │     no change │
│ QQuery 11    │  166.03ms │             155.13ms │ +1.07x faster │
│ QQuery 12    │  173.20ms │             174.27ms │     no change │
│ QQuery 13    │  305.51ms │             306.92ms │     no change │
│ QQuery 14    │   48.78ms │              51.24ms │  1.05x slower │
│ QQuery 15    │   52.68ms │              53.09ms │     no change │
│ QQuery 16    │  165.92ms │             172.41ms │     no change │
│ QQuery 17    │  981.34ms │             924.09ms │ +1.06x faster │
│ QQuery 18    │ 1644.03ms │            1696.96ms │     no change │
│ QQuery 19    │  169.66ms │             168.50ms │     no change │
│ QQuery 20    │  334.96ms │             329.97ms │     no change │
│ QQuery 21    │ 1161.63ms │            1189.21ms │     no change │
│ QQuery 22    │   87.75ms │              90.87ms │     no change │
└──────────────┴───────────┴──────────────────────┴───────────────┘

alamb

This looks great to me -- thank you @tustvold

I am very excited to see what this looks like with a special case for single columns (I expect it to be screaming fast)

alamb · 2023-07-19T10:43:24Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -60,6 +60,151 @@ pub(crate) enum ExecutionState {

 use super::AggregateExec;

+/// An interning store for group keys


Suggested change

/// An interning store for group keys

/// Stores group key values and their mapping to group_index

///

/// This is a trait to allow special casing certain kinds of keys

/// like single column primitive arrays

alamb · 2023-07-19T10:43:50Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    fn flush(&mut self) -> Result<Vec<ArrayRef>>;
+}
+
+/// A [`GroupValues`] making use of [`Rows`]


Suggested change

/// A [`GroupValues`] making use of [`Rows`]

/// A [`GroupValues`] which stores group values using the arrow_row format [`Rows`]

alamb · 2023-07-19T10:45:08Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+    ///
+    /// keys: u64 hashes of the GroupValue
+    /// values: (hash, group_index)
+    map: RawTable<(u64, usize)>,


I am surprised to see the map in this structure -- I was expecting only the group values.

Is your idea to store the group keys inside the map, as shown to be so effective by @yahoNanJing in apache/arrow-rs#4524 (comment) ?

I also was thinking this structure will allow for an easy insertion of the FixedWidthRowFormat if that turns out to be better as well. So I think this PR is a step in the right direction regardless of which approach we choose to take

alamb · 2023-07-19T10:45:58Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

@@ -60,6 +60,151 @@ pub(crate) enum ExecutionState {

 use super::AggregateExec;

+/// An interning store for group keys
+trait GroupValues: Send {


Eventually I would love to see this in its own module aggregates/group_values.rs but that can be a follow on PR for sure

alamb · 2023-07-19T10:47:52Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

+        batch_hashes.resize(n_rows, 0);
+        create_hashes(cols, &self.random_state, batch_hashes)?;
+
+        for (row, &hash) in batch_hashes.iter().enumerate() {


This change will likely conflict with #6932 (review) but I think it should be straightforward to update

alamb · 2023-07-19T10:48:31Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

-///                      └────────────┘    ││ └────────┘ ││       ││ └────────┘ ││
-///                                        │└────────────┘│       │└────────────┘│
-///                                        └──────────────┘       └──────────────┘
+///         ┌────────────┐              ┌──────────────┐       ┌──────────────┐


If we choose to go with this approach, I can make some pictures for RowGroupValues as well

alamb · 2023-07-19T10:49:19Z

datafusion/core/src/physical_plan/aggregates/row_hash.rs

-        }
-
-        // account for memory growth in scratch space
-        *allocated += self.scratch_space.size();


Related response: #6932 (comment)

Extract GroupValues (apache#6969)

a961ad0

github-actions bot added the core Core DataFusion crate label Jul 19, 2023

tustvold commented Jul 19, 2023

View reviewed changes

alamb mentioned this pull request Jul 19, 2023

Consolidate BoundedAggregateStream #6932

Merged

alamb approved these changes Jul 19, 2023

View reviewed changes

tustvold mentioned this pull request Jul 19, 2023

Don't store hashes in GroupOrdering #7029

Merged

tustvold added 2 commits July 19, 2023 11:37

Merge remote-tracking branch 'upstream/main' into extract-group-values

ad3a4aa

Merge remote-tracking branch 'upstream/main' into extract-group-values

ab2db82

tustvold merged commit a6dcd94 into apache:main Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract GroupValues (#6969) #7016

Extract GroupValues (#6969) #7016

tustvold commented Jul 19, 2023

tustvold Jul 19, 2023

alamb Jul 19, 2023

alamb commented Jul 19, 2023

alamb commented Jul 19, 2023

alamb left a comment

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

alamb Jul 19, 2023

		@@ -60,6 +60,151 @@ pub(crate) enum ExecutionState {

		use super::AggregateExec;

		/// An interning store for group keys

-/// An interning store for group keys
+/// Stores group key values and their mapping to group_index
+///
+/// This is a trait to allow special casing certain kinds of keys
+/// like single column primitive arrays

	/// A [`GroupValues`] making use of [`Rows`]
	/// A [`GroupValues`] which stores group values using the arrow_row format [`Rows`]

Extract GroupValues (#6969) #7016

Extract GroupValues (#6969) #7016

Conversation

tustvold commented Jul 19, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented Jul 19, 2023

alamb commented Jul 19, 2023

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment