feat(spark): add Window support #307

andrew-coleman · 2024-10-16T08:33:23Z

To support the OVER clause in SQL

This fixes some of the TPC-DS tests.

vbarua · 2024-10-16T21:48:41Z

This will also fix several of the TPC-DS tests, but I will enable those in a future PR to try and avoid merge conflicts with other PRs that are currently open.

I've gone ahead and reviewed and merged the other PRs so we shouldn't have any issues with conflict there, beyond the one that exists now. I can review this PR tomorrow.

vbarua

Left some questions.

vbarua · 2024-10-18T23:54:52Z

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

+            WindowSpecDefinition(_, _, SpecifiedWindowFrame(frameType, lower, upper))) =>
+        (fromSpark(frameType), fromSparkPreceding(lower), fromSparkFollowing(upper))
+      case WindowExpression(_, WindowSpecDefinition(_, _, UnspecifiedFrame)) =>
+        (WindowBoundsType.ROWS, UNBOUNDED, CURRENT_ROW)


I was curious about the default behaviour if the frame is unspecified, because for Postgres it is RANGE UNBOUNDED PRECEDING, however it looks like for Spark it seems to depend on whether the ordering is defined:

* @note * When ordering is not defined, an unbounded window frame (rowFrame, unboundedPreceding, * unboundedFollowing) is used by default. When ordering is defined, a growing window frame * (rangeFrame, unboundedPreceding, currentRow) is used by default.

https://github.com/apache/spark/blob/f8d92224b9af4ffffbb83ca2c9dd3c3b909b135d/sql/api/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L35-L38

Good spot, I hadn't noticed that comment.

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

vbarua · 2024-10-19T00:13:48Z

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

+
+    val (frameType, lower, upper) = sparkExp match {
+      case WindowExpression(_: OffsetWindowFunction, _) =>
+        (WindowBoundsType.ROWS, UNBOUNDED, CURRENT_ROW)


What is an OffsetWindowFunction. Does it somehow override the frame bounds?

That's my understanding. TBH I'm not sure exactly how it should translate to substrait, but it's ignored by spark (it matches the lag/lead functions)
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala#L404

You actually should get the frame here as well properly. Otherwise the bound will be incorrect for other consumers (unbounded preceeding and current row will not work for Lead, for example) and Spark will throw. You'll see that if you add logicalPlan2.show() into SubstraitPlanTestBase::assertSqlToSubstraitRoundTrip and re-run the lag/lead test. (We should also make the tests always evaluate the converted plans to ensure they are actually valid, I'll try to get to that)

Can be fixed by just removing this case.

Ok, I've removed it for now

Blizzara

Thanks! Overall looks good, but there's some things in the bounds that I think aren't always correct

Blizzara · 2024-10-21T12:02:44Z

spark/src/main/scala/io/substrait/debug/RelToVerboseString.scala

+          .append("sorts=")
+          .append(window.getSorts)
+      })
+  }


FWIW, in our internal fork I just deleted the whole RelToVerboseString thing. I don't see it bringing that much value over pretty-printing just the protobuf. And there's a fair amount of work to maintain this.

Blizzara · 2024-10-21T12:09:10Z

spark/src/main/scala/io/substrait/spark/expression/FunctionMappings.scala

@@ -73,9 +73,22 @@ class FunctionMappings {
    s[HyperLogLogPlusPlus]("approx_count_distinct")
  )

+  val WINDOW_SIGS: Seq[Sig] = Seq(
+    s[RowNumber]("row_number"),
+    s[Rank]("rank"),


From looking at our internal fork's window impl, the rank functions in Spark have some child column but substrait doesn't define any. I wonder how it works here, ie how do get a match still?

Ah you handle it here: https://github.com/substrait-io/substrait-java/pull/307/files#diff-78e1a9d4e9f4c7d8968b3ba83d3cc5222ade95d28e16b0328277c3f8c8a9d313R180

Blizzara · 2024-10-21T12:10:51Z

spark/src/main/scala/io/substrait/spark/expression/FunctionMappings.scala

+    s[CumeDist]("cume_dist"),
+    s[NTile]("ntile"),
+    s[Lead]("lead"),
+    s[Lag]("lag"),


I also have a note around Lead and Lag having an NullType null as a child, which I think Substrait doesn't support, did you run into anything like that?

I haven't seen this - do you have a test case?

Looks like you had, you just handle it here: https://github.com/substrait-io/substrait-java/pull/307/files#diff-78e1a9d4e9f4c7d8968b3ba83d3cc5222ade95d28e16b0328277c3f8c8a9d313R183 😄

Oh yes - not known for my memory! 😂

Blizzara · 2024-10-21T12:15:24Z

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

+    case other => throw new UnsupportedOperationException(s"Unsupported bounds type: $other.")
+  }
+
+  def fromSparkPreceding(bound: Expression): WindowBound = bound match {


Substrait defines the bounds as being strictly positive integers, but iirc Spark may have a "preceeding -1 row" for example. This is what I used:

expr match { case UnboundedPreceding => WindowBound.UNBOUNDED case UnboundedFollowing => WindowBound.UNBOUNDED case CurrentRow => WindowBound.CURRENT_ROW case e: Literal => e.dataType match { case IntegerType => { val offset = e.eval().asInstanceOf[Int] if (offset < 0) WindowBound.Preceding.of(-offset) else if (offset == 0) WindowBound.CURRENT_ROW else WindowBound.Following.of(offset) } } case _ => throw new UnsupportedOperationException(s"Unexpected bound: $expr") } }

Thanks, I've changed it this.

Blizzara · 2024-10-21T12:39:43Z

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

+    case UNBOUNDED => UnboundedPreceding
+    case CURRENT_ROW => CurrentRow
+    case p: Preceding => Literal(p.offset())
+    case _ => throw new UnsupportedOperationException(s"Unsupported bounds expression $bound")


I think this also doesn't work for the same reason as above, the func.lowerBound() may very well be a Following (and vice versa below). So you should handle both in both.

Our version was a bit different, yours might be nice (once fixed):

def toSparkFrame( boundsType: WindowBoundsType, lowerBound: WindowBound, upperBound: WindowBound): WindowFrame = { val frameType = boundsType match { case WindowBoundsType.ROWS => RowFrame case WindowBoundsType.RANGE => RangeFrame case WindowBoundsType.UNSPECIFIED => return UnspecifiedFrame } SpecifiedWindowFrame( frameType, toSparkBound(lowerBound, isLower = true), toSparkBound(upperBound, isLower = false)) } private def toSparkBound(bound: WindowBound, isLower: Boolean): Expression = { bound.accept(new WindowBoundVisitor[Expression, Exception] { override def visit(preceding: WindowBound.Preceding): Expression = Literal(-preceding.offset().intValue()) override def visit(following: WindowBound.Following): Expression = Literal(following.offset().intValue()) override def visit(currentRow: WindowBound.CurrentRow): Expression = CurrentRow override def visit(unbounded: WindowBound.Unbounded): Expression = if (isLower) UnboundedPreceding else UnboundedFollowing }) }

Yours is nicer :).

Blizzara · 2024-10-21T13:15:55Z

spark/src/test/scala/io/substrait/spark/WindowPlan.scala

+        |
+        |""".stripMargin
+    assertSqlSubstraitRelRoundTrip(query)
+  }


maybe add a test with two different partitions in same select? I think it should pass fine but just in case

Blizzara · 2024-10-21T13:17:00Z

spark/src/test/scala/io/substrait/spark/WindowPlan.scala

+    assertSqlSubstraitRelRoundTrip(query)
+  }
+
+  test("min") {


nit: maybe name this as "aggregate" or something, as I think that's what it's testing (fact that it's min specifically is less relevant)?

Suggested change

test("min") {

test("aggregate") {

andrew-coleman · 2024-10-22T14:59:37Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

I’d only implemented as much as necessary to enable the windowing tests in the TPC-DS suite to pass. If you’ve got other tests that require additional logic in the converter, then it would be great to add these. But perhaps in a future PR in the interest of moving things forward incrementally?

Blizzara · 2024-10-23T16:40:15Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

Yea, shame on me - it's been on my todo-list ever since you added this into substrait-java to pull up our changes. Some of those changes are non-trivial refactorings which might be annoying, but I'll try to get to it!

I’d only implemented as much as necessary to enable the windowing tests in the TPC-DS suite to pass. If you’ve got other tests that require additional logic in the converter, then it would be great to add these. But perhaps in a future PR in the interest of moving things forward incrementally?

I'd like to see the bounds things fixed, since currently this implementation doesn't adhere to the bounds (preceeding, following) being strictly positive. That's not a problem for the roundtrip tests, but it means a plan generated here may not be valid Substrait that can be consumed by other

Blizzara · 2024-10-23T17:09:27Z

Thanks @Blizzara, there’s some really useful feedback here. I hadn’t realised you had already been implementing this, it would be awesome if you want to collaborate :)

Starting with #311! After that I'll add structs (+ fix the name handling), and then I have some more complicated changes in how FunctionMappings works to allow for more complicated mappings.

To support the OVER clause in SQL Signed-off-by: Andrew Coleman <[email protected]>

Blizzara · 2024-10-25T14:43:02Z

spark/src/main/scala/io/substrait/spark/expression/ToWindowFunction.scala

+            WindowSpecDefinition(_, _, SpecifiedWindowFrame(frameType, lower, upper))) =>
+        (fromSpark(frameType), fromSpark(lower), fromSpark(upper))
+      case WindowExpression(_, WindowSpecDefinition(_, orderSpec, UnspecifiedFrame)) =>
+        if (orderSpec.isEmpty) {


just for posterity, this comes from Spark notes: https://github.com/apache/spark/blob/250f8affd04e4be14446dd02a1c52716e54a226d/sql/api/src/main/scala/org/apache/spark/sql/expressions/Window.scala#L36

Blizzara · 2024-10-25T14:50:36Z

spark/src/test/scala/io/substrait/spark/TPCDSPlan.scala

-    "q21", "q22", "q23a", "q23b", "q24a", "q24b", "q25", "q26", "q27", "q28", "q29",
-    "q30", "q31", "q32", "q33", "q37", "q38",
-    "q40", "q41", "q42", "q43", "q46", "q48",
+  val successfulSQL: Set[String] = Set("q1", "q3", "q4", "q5", "q7", "q8",


given how this is growing, might be worth starting to list the ones that don't work instead 😄

Exactly, I was going to do this after my next PR which will add support for more numeric functions. The number of failing tests will be relatively small then.

Blizzara

LGTM, thanks!

andrew-coleman force-pushed the window branch from 03ac4c8 to f093ec4 Compare October 17, 2024 08:18

vbarua reviewed Oct 19, 2024

View reviewed changes

Blizzara reviewed Oct 21, 2024

View reviewed changes

andrew-coleman force-pushed the window branch from f093ec4 to fc0f15a Compare October 22, 2024 15:00

andrew-coleman requested a review from vbarua October 22, 2024 15:13

andrew-coleman force-pushed the window branch 2 times, most recently from 82cb729 to 73d3bc9 Compare October 25, 2024 09:07

andrew-coleman requested a review from Blizzara October 25, 2024 09:48

feat(spark): add Window support

86a0548

To support the OVER clause in SQL Signed-off-by: Andrew Coleman <[email protected]>

andrew-coleman force-pushed the window branch from 73d3bc9 to 86a0548 Compare October 25, 2024 12:44

Blizzara reviewed Oct 25, 2024

View reviewed changes

Blizzara approved these changes Oct 25, 2024

View reviewed changes

Blizzara merged commit b3f61a2 into substrait-io:main Oct 25, 2024
13 checks passed

andrew-coleman deleted the window branch October 25, 2024 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(spark): add Window support #307

feat(spark): add Window support #307

andrew-coleman commented Oct 16, 2024 •

edited

Loading

vbarua commented Oct 16, 2024

vbarua left a comment

vbarua Oct 18, 2024

andrew-coleman Oct 22, 2024

vbarua Oct 19, 2024

andrew-coleman Oct 22, 2024

Blizzara Oct 25, 2024

andrew-coleman Oct 25, 2024

Blizzara left a comment

Blizzara Oct 21, 2024

Blizzara Oct 21, 2024

Blizzara Oct 25, 2024

Blizzara Oct 21, 2024

andrew-coleman Oct 22, 2024

Blizzara Oct 25, 2024 •

edited

Loading

andrew-coleman Oct 25, 2024

Blizzara Oct 21, 2024

andrew-coleman Oct 25, 2024

Blizzara Oct 21, 2024

andrew-coleman Oct 25, 2024

Blizzara Oct 21, 2024

Blizzara Oct 21, 2024

andrew-coleman commented Oct 22, 2024

Blizzara commented Oct 23, 2024

Blizzara commented Oct 23, 2024

Blizzara Oct 25, 2024

Blizzara Oct 25, 2024

andrew-coleman Oct 25, 2024

Blizzara left a comment

feat(spark): add Window support #307

feat(spark): add Window support #307

Conversation

andrew-coleman commented Oct 16, 2024 • edited Loading

vbarua commented Oct 16, 2024

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara Oct 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrew-coleman commented Oct 22, 2024

Blizzara commented Oct 23, 2024

Blizzara commented Oct 23, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Blizzara left a comment

Choose a reason for hiding this comment

andrew-coleman commented Oct 16, 2024 •

edited

Loading

Blizzara Oct 25, 2024 •

edited

Loading