Improve `GpuExpand` by pre-projecting some columns[databricks] #10247

firestarman · 2024-01-23T03:43:33Z

Some rules (e.g. RewriteDistinctAggregates) in Spark will put non-leaf expressions into Expand projections, then it can not leverage the GPU tiered projection across the projection lists.

So this PR tries to factor out these expressions and evaluate them before expanding to avoid duplicate evaluations of semantic-equal (sub) expressions.

e.g. given projections:

      [if((a+b)>0) 1 else 0, null],
      [null, if((a+b)=0 "no" else "yes")]

without pre-projection, a+b will be evaluated twice. While with pre-projection, it has

   # preprojection list:
            [if((a+b)>0) 1 else 0, if((a+b)=0 "no" else "yes")]
   # preprojected projections for expanding:
            [_pre-project-c1#0, null], [null, _pre-project-c3#1]
   # and
           "_pre-project-c1#0" refers to "if((a+b)>0) 1 else 0",
           "_pre-project-c3#1" refers to "if((a+b)=0 "no" else "yes"

By leveraging the tiered projection on the preprojection list, a+b will be evaluated only once.

Some rules in Spark will put non-leaf expressions into Expand projections, then it can not leverage the GPU tiered projection across the projection lists. So this PR tries to factor out these expressions and evaluate them before expanding to avoid duplicate evaluation for semantic-equal (sub) expressions. --------- Signed-off-by: Firestarman <[email protected]>

Signed-off-by: Firestarman <[email protected]>

winningsix · 2024-01-23T05:55:25Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -1079,6 +1079,17 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
    .booleanConf
    .createWithDefault(true)

+  val ENABLE_EXPAND_PREPROJECT = conf("spark.rapids.sql.expandPreproject.enabled")


Introducing an internal configuration will be some extra user burden. Could we have a check util method to figure out the existence of common expression. If we do have a common subexpression, we will intro a pre-project ahead of Expand node. Otherwise, we do nothing there.

It seems we can do this via a dry-run of tiered project. If duplicated ref exists in seq[seq[expression]], we will introduce a pre-project node to evaluate that duplicated refs.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

Lines 492 to 507 in 5d08aec

/**

* Do projections in a tiered fashion, where earlier tiers contain sub-expressions that are

* referenced in later tiers. Each tier adds columns to the original batch corresponding

* to the output of the sub-expressions. It also removes columns that are no longer needed,

* based on inputAttrTiers for the current tier and the next tier.

* Example of how this is processed:

* Original projection expressions:

* (((a + b) + c) * e), (((a + b) + d) * f), (a + e), (c + f)

* Input columns for tier 1: a, b, c, d, e, f (original projection inputs)

* Tier 1: (a + b) as ref1

* Input columns for tier 2: a, c, d, e, f, ref1

* Tier 2: (ref1 + c) as ref2, (ref1 + d) as ref3

* Input columns for tier 3: a, c, e, f, ref2, ref3

* Tier 3: (ref2 * e), (ref3 * f), (a + e), (c + f)

*/

case class GpuTieredProject(exprTiers: Seq[Seq[GpuExpression]]) {

We can change the default value to true to eliminate user's extra burden. But I do not want to do this until we run enough tests to prove this will not lead to any regressions, especially for high GPU memory pressure cases.

figure out the existence of common expression. If we do have a common subexpression, we will intro a pre-project ahead of Expand node. Otherwise, we do nothing there.

I guess this check boundPreprojections.exprTiers.size > 1 will do as what you suggested.

For me it all comes down to heuristics and our confidence in them.

If we don't have high confidence in the feature, then we should have a config so a customer can disable it if they run into problems, but there should be a follow on issue to remove the config once we feel confident in it. If we have really high confidence in the solution, then we don't need any kind of config. If we have such low confidence in it that we don't want it on yet, then it is probably not ready to be checked in. If this is for a feature that is still a WIP, then it is fine. We will enable it when the feature is done. If it is a small feature like this it is not OK to have it off by default.

A runtime heuristic that deicides when and/how to apply this feature is separate from the config, because the config is there to help us after the feature has been released and the heuristic is a part of the feature itself. From a computation standpoint I don't see how this would ever be worse than what we have today. But I can envision cases where the total memory usage might be much higher with this feature than without it. But I don't know how realistic those cases are in practice. Like if we have lots and lots of distinct operations on computed columns, with only a very small amount of overlap between them.

select COUNT(distinct a + b) as ab, COUNT(distinct a + b + c) as abc, COUNT(distinct CAST(a as STRING)), COUNT(distinct CAST(b as STRING)), ...

I am happy to see that literal values are not materialized until later, also from what I can tell the initial project throws away any columns that are not going to be materialized in the final output.

There are a few more things that we might be able to do to reduce the memory, but I think it ends up with us guessing at the size of various outputs and then doing some form of a bin packing problem to reduce the amount of memory used.

firestarman · 2024-01-23T06:27:20Z

build

revans2 · 2024-01-23T20:19:14Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

@@ -1079,6 +1079,17 @@ val GPU_COREDUMP_PIPE_PATTERN = conf("spark.rapids.gpu.coreDump.pipePattern")
    .booleanConf
    .createWithDefault(true)

+  val ENABLE_EXPAND_PREPROJECT = conf("spark.rapids.sql.expandPreproject.enabled")


For me it all comes down to heuristics and our confidence in them.

If we don't have high confidence in the feature, then we should have a config so a customer can disable it if they run into problems, but there should be a follow on issue to remove the config once we feel confident in it. If we have really high confidence in the solution, then we don't need any kind of config. If we have such low confidence in it that we don't want it on yet, then it is probably not ready to be checked in. If this is for a feature that is still a WIP, then it is fine. We will enable it when the feature is done. If it is a small feature like this it is not OK to have it off by default.

A runtime heuristic that deicides when and/how to apply this feature is separate from the config, because the config is there to help us after the feature has been released and the heuristic is a part of the feature itself. From a computation standpoint I don't see how this would ever be worse than what we have today. But I can envision cases where the total memory usage might be much higher with this feature than without it. But I don't know how realistic those cases are in practice. Like if we have lots and lots of distinct operations on computed columns, with only a very small amount of overlap between them.

select COUNT(distinct a + b) as ab, COUNT(distinct a + b + c) as abc, COUNT(distinct CAST(a as STRING)), COUNT(distinct CAST(b as STRING)), ...

I am happy to see that literal values are not materialized until later, also from what I can tell the initial project throws away any columns that are not going to be materialized in the final output.

There are a few more things that we might be able to do to reduce the memory, but I think it ends up with us guessing at the size of various outputs and then doing some form of a bin packing problem to reduce the amount of memory used.

revans2 · 2024-01-23T20:19:24Z

sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala

+      s"enable this.")
+    .internal()
+    .booleanConf
+    .createWithDefault(false)


Please enable this by default.

winningsix · 2024-01-24T01:49:46Z

integration_tests/src/main/python/expand_exec_test.py

+
+    assert_gpu_and_cpu_are_equal_sql(get_df,
+        "pre_pro",
+        "select count(distinct (a+b)), count(distinct if((a+b)>100, c, null)) from pre_pro group by a",


Offline sync'ed with @revans2 , it would be great to have some test coverage around (cube and rollup) besides count distinct.

Signed-off-by: Firestarman <[email protected]>

firestarman · 2024-01-24T03:29:50Z

build

Signed-off-by: Firestarman <[email protected]>

firestarman · 2024-01-25T10:29:45Z

build

winningsix · 2024-01-26T05:28:37Z

LGTM. Just need to confirm whether this can improve performance in our targeted workload.

revans2

Approved pending performance checks

revans2 · 2024-02-13T14:14:32Z

I ran some performance tests myself using a set of queries that represent a specific customer that was seeing slowness. Even though it is not a perfect representation I saw an improvement with this patch from a median time of 10.867s to 7.005 seconds or about 50% faster than it is today. I am going to merge this in.

I also rebuilt it after up-merging and it looks good.

firestarman · 2024-02-19T01:17:43Z

I ran some performance tests myself using a set of queries that represent a specific customer that was seeing slowness. Even though it is not a perfect representation I saw an improvement with this patch from a median time of 10.867s to 7.005 seconds or about 50% faster than it is today. I am going to merge this in.

I also rebuilt it after up-merging and it looks good.

Thx a lot for perf tests

firestarman added 3 commits January 22, 2024 19:14

add tests

bda70df

Signed-off-by: Firestarman <[email protected]>

disable preprojection by default

56091a2

Signed-off-by: Firestarman <[email protected]>

winningsix reviewed Jan 23, 2024

View reviewed changes

winningsix added the performance A performance related task/issue label Jan 23, 2024

revans2 reviewed Jan 23, 2024

View reviewed changes

winningsix reviewed Jan 24, 2024

View reviewed changes

firestarman added 2 commits January 24, 2024 10:57

Address comments

cf238da

Signed-off-by: Firestarman <[email protected]>

Add a comment

fa11643

Signed-off-by: Firestarman <[email protected]>

firestarman requested review from winningsix and revans2 January 24, 2024 06:27

Op time includes pre-projection

8f51ce0

Signed-off-by: Firestarman <[email protected]>

firestarman changed the base branch from branch-24.02 to branch-24.04 January 26, 2024 01:26

revans2 approved these changes Jan 26, 2024

View reviewed changes

revans2 merged commit 0b9e134 into NVIDIA:branch-24.04 Feb 13, 2024
40 checks passed

winningsix deleted the expand-prepro branch February 19, 2024 00:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `GpuExpand` by pre-projecting some columns[databricks] #10247

Improve `GpuExpand` by pre-projecting some columns[databricks] #10247

firestarman commented Jan 23, 2024 •

edited

Loading

winningsix Jan 23, 2024 •

edited

Loading

firestarman Jan 23, 2024 •

edited

Loading

revans2 Jan 23, 2024 •

edited

Loading

firestarman commented Jan 23, 2024

revans2 Jan 23, 2024 •

edited

Loading

revans2 Jan 23, 2024

firestarman Jan 24, 2024

winningsix Jan 24, 2024

firestarman Jan 24, 2024

firestarman commented Jan 24, 2024

firestarman commented Jan 25, 2024

winningsix commented Jan 26, 2024

revans2 left a comment

revans2 commented Feb 13, 2024

firestarman commented Feb 19, 2024

	/**
	* Do projections in a tiered fashion, where earlier tiers contain sub-expressions that are
	* referenced in later tiers. Each tier adds columns to the original batch corresponding
	* to the output of the sub-expressions. It also removes columns that are no longer needed,
	* based on inputAttrTiers for the current tier and the next tier.
	* Example of how this is processed:
	* Original projection expressions:
	* (((a + b) + c) * e), (((a + b) + d) * f), (a + e), (c + f)
	* Input columns for tier 1: a, b, c, d, e, f (original projection inputs)
	* Tier 1: (a + b) as ref1
	* Input columns for tier 2: a, c, d, e, f, ref1
	* Tier 2: (ref1 + c) as ref2, (ref1 + d) as ref3
	* Input columns for tier 3: a, c, e, f, ref2, ref3
	* Tier 3: (ref2 * e), (ref3 * f), (a + e), (c + f)
	*/
	case class GpuTieredProject(exprTiers: Seq[Seq[GpuExpression]]) {

Improve GpuExpand by pre-projecting some columns[databricks] #10247

Improve GpuExpand by pre-projecting some columns[databricks] #10247

Conversation

firestarman commented Jan 23, 2024 • edited Loading

winningsix Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

firestarman Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

revans2 Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

firestarman commented Jan 23, 2024

revans2 Jan 23, 2024 • edited Loading

Choose a reason for hiding this comment

revans2 Jan 23, 2024

Choose a reason for hiding this comment

firestarman Jan 24, 2024

Choose a reason for hiding this comment

winningsix Jan 24, 2024

Choose a reason for hiding this comment

firestarman Jan 24, 2024

Choose a reason for hiding this comment

firestarman commented Jan 24, 2024

firestarman commented Jan 25, 2024

winningsix commented Jan 26, 2024

revans2 left a comment

Choose a reason for hiding this comment

revans2 commented Feb 13, 2024

firestarman commented Feb 19, 2024

Improve `GpuExpand` by pre-projecting some columns[databricks] #10247

Improve `GpuExpand` by pre-projecting some columns[databricks] #10247

firestarman commented Jan 23, 2024 •

edited

Loading

winningsix Jan 23, 2024 •

edited

Loading

firestarman Jan 23, 2024 •

edited

Loading

revans2 Jan 23, 2024 •

edited

Loading

revans2 Jan 23, 2024 •

edited

Loading