[SPARK-48356][SQL] Support for FOR statement #48794

dusantism-db · 2024-11-07T15:32:34Z

What changes were proposed in this pull request?

In this PR, support for FOR statement in SQL scripting is introduced. Examples:

FOR row AS SELECT * FROM t DO
   SELECT row.intCol;
 END FOR;

FOR SELECT * FROM t DO
   SELECT intCol;
 END FOR;

Implementation notes:
As local variables for SQL scripting are currently a work in progress, session variables are used to simulate them.
When FOR begins executing, session variables are declared for each column in the result set, and optionally for the for variable if it is present ("row" in the example above).
On each iteration, these variables are overwritten with the values from the row currently being iterated.
The variables are dropped upon loop completion.

This means that if a session variable which matches the name of a column in the result set already exists, the for statement will drop that variable after completion. If that variable would be referenced after the for statement, the script would fail as the variable would not exist. This limitation is already present in the current iteration of SQL scripting, and will be fixed once local variables are introduced. Also, with local variables the implementation of for statement will be much simpler.

Grammar/parser changes:
forStatement grammar rule
visitForStatement rule visitor
ForStatement logical operator

Why are the changes needed?

FOR statement is an part of SQL scripting control flow logic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

New tests are introduced to all of the three scripting test suites: SqlScriptingParserSuite, SqlScriptingExecutionNodeSuite and SqlScriptingInterpreterSuite.

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala

davidm-db · 2024-11-08T11:47:57Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+    case m: Map[_, _] =>
+      // arguments of CreateMap are in the format: (key1, val1, key2, val2, ...)
+      val mapArgs = m.keys.toSeq.flatMap { key =>
+        Seq(createExpressionFromValue(key), createExpressionFromValue(m(key)))
+      }
+      CreateMap(mapArgs, false)
+    case s: GenericRowWithSchema =>
+    // struct types match this case
+    // arguments of CreateNamedStruct are in the format: (name1, val1, name2, val2, ...)
+    val namedStructArgs = s.schema.names.toSeq.flatMap { colName =>
+        val valueExpression = createExpressionFromValue(s.getAs(colName))
+        Seq(Literal(colName), valueExpression)
+      }
+      CreateNamedStruct(namedStructArgs)
+    case _ => Literal(value)


for my knowledge, can you explain what does the case with the Map means exactly, i.e. when will this happen?
also, how did we check that this is the complete list of the relevant cases?

When Map or Struct are in the result set of the query, we can't use Literal(value) to convert them to expressions because Literals don't support them. So for example for Map we recursively convert both keys and values to expressions first, and then create a map expression using CreateMap. The process is similar for structs.

The way i checked is i went through all the spark data types, and for each checked in code of Literal whether it's supported. I only found these two which are not, however I agree we can't be completely sure, and new types will be added to Spark in the future which Literals may or may not support. Probably I should add an error message for currently unsupported type, in case it comes up. Does that make sense to you?

Yeah, I would say internal error is fine in this case (i.e. no need to introduce new error for this) since it would mean that we have a bug.
Other than that, this sounds fine to me, but let's wait for Max and/or Wenchen to comment on this if they have any concerns.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

miland-db · 2024-11-19T16:13:53Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+      override def next(): CompoundStatementExec = state match {
+
+        case ForState.VariableAssignment =>
+          variablesMap = createVariablesMapFromRow(cachedQueryResult()(currRow))


Why do we need to create this every time? Can we fill variablesMap once and then reuse it?

We need to create it every time because the map is different for every row in the result set. You can see we call it on the currRow.

davidm-db · 2024-11-20T11:45:30Z

Can we rebase first to include already merged changes regarding the label checks, logical plans, etc? And I'll review afterwards again?

…ave/iterate/normal case

dusantism-db · 2024-11-20T15:34:03Z

@davidm-db @miland-db Rebased, you can review again

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-11-20T19:02:23Z

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala

+    assert(statements === Seq(
+      "statement1",
+      "lbl1"
+    ))


we don't have drop var statements here due to the fact that they are dropped in handleLeaveStatement?
this is the thing we talked about that will be properly resolved once the proper execution and scopes are introduced?

Yes, that's right. In this case the variables are dropped immediately when the leave statement is encountered, instead of the usual behavior which is to return the dropVariable exec nodes from the iterator.

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala

davidm-db · 2024-11-20T19:35:41Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+  private var isResultCacheValid = false
+  private def cachedQueryResult(): Array[Row] = {
+    if (!isResultCacheValid) {
+      queryResult = query.buildDataFrame(session).collect()


food for thought: does DataFrame have a mechanism to partially collect the data so we don't collect all the results in memory? since we are already using the caching concept, this would be easy to add to the logic of cachedQueryResult.

quickly researching, we can do something like:
sliced_df = df.offset(starting_index).limit(ending_index - starting_index)
but there might be something better...

I wouldn't block the PR on this, but I think we definitely need to consider something like this for a follow-up.

cc: @cloud-fan @MaxGekk

That makes sense, currently the entire result is collected to the driver so it would be problematic if the result size is too large. We should definitely follow up on this

Let's see what Wenchen and Max have to say and maybe create a follow-up work item so we don't forget it.

davidm-db · 2024-11-20T19:38:38Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+    new Iterator[CompoundStatementExec] {
+
+      override def hasNext: Boolean = {
+        val resultSize = cachedQueryResult().length


maybe have a cacheSize that's a class member and set only once in cachedQueryResult()?

I don't think this is necessary, as the queryResult is a fixed size array, meaning we don't calculate the length here.

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

davidm-db · 2024-11-20T20:00:52Z

I've left comments, but in general the approach looks good to me!
This wasn't easy, good job!
Let's resolve the comments and fix tests (seems to be just syntax/style errors).

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

...st/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala

davidm-db · 2024-11-21T15:53:56Z

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala

+            // create and execute declare var statements
+            variablesMap.keys.toSeq
+              .map(colName => createDeclareVarExec(colName, variablesMap(colName)))
+              .foreach(declareVarExec => declareVarExec.buildDataFrame(session).collect())


should we set isExecuted = true here as well? or it is not important since we don't return it anywhere?
same question for the line 722 as well?

I don't think it's important since the execs are not persisted or used anywhere else.

github-actions bot added the SQL label Nov 7, 2024

dejankrak-db reviewed Nov 7, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingInterpreter.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 8, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

dusantism-db changed the title ~~[WIP][SPARK-48356][SQL] Support for FOR statement~~ [SPARK-48356][SQL] Support for FOR statement Nov 19, 2024

dusantism-db requested review from dejankrak-db and davidm-db November 19, 2024 15:37

miland-db reviewed Nov 19, 2024

View reviewed changes

dusantism-db added 18 commits November 20, 2024 15:29

for loop initial version

097dfc1

first time working

648b1af

drop local variable at end of execution

473b58b

refactor to drop var after every iteration

6c2eddd

cleanup code

042be71

support for FOR without variable

6372607

adding tests

08fec1f

add more tests

854e476

add support for map

ae68474

add support for struct, seems to work

d583eab

clean up

04d5f4d

fix comments and clean up code

a18df20

formatting

7e2e168

change iterator to seq

a1ba3af

improve iteration logic for map args

fe86d40

add parser test

5049ab4

identation

919d6ae

execution node tests

08089c7

dusantism-db added 8 commits November 20, 2024 15:32

refactor to support column access without qualifying

0501eae

update execution node test

d7070df

add drop variables

0ffb310

fix for nested arrays, and change drop variable logic to work with le…

60b893d

…ave/iterate/normal case

add nested tests

45ab641

add tests for no variables variant of for

231dce0

clean up

2472a21

update labels and tests

2e10f0b

dusantism-db force-pushed the scripting-for-loop branch from 446fc05 to 2e10f0b Compare November 20, 2024 15:26

nit

35aec73

add unique label tests

68e41da

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNodeSuite.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 20, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/scripting/SqlScriptingExecutionNode.scala Show resolved Hide resolved

dtenedor reviewed Nov 21, 2024

View reviewed changes

dusantism-db added 3 commits November 21, 2024 15:34

fix formatting and improve tests

7e67d78

implement daneils suggestions

40dd282

move isExecuted out of buildDataframe

8686669

davidm-db reviewed Nov 21, 2024

View reviewed changes

...st/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/SqlScriptingLogicalPlans.scala Outdated Show resolved Hide resolved

davidm-db reviewed Nov 21, 2024

View reviewed changes

davidm-db approved these changes Nov 21, 2024

View reviewed changes

dusantism-db added 2 commits November 21, 2024 17:16

fix scalastyle

c92ff3a

formatting

2504fb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48356][SQL] Support for FOR statement #48794

[SPARK-48356][SQL] Support for FOR statement #48794

dusantism-db commented Nov 7, 2024 •

edited

Loading

davidm-db Nov 8, 2024

dusantism-db Nov 8, 2024

davidm-db Nov 8, 2024

miland-db Nov 19, 2024

dusantism-db Nov 19, 2024

davidm-db commented Nov 20, 2024

dusantism-db commented Nov 20, 2024

davidm-db Nov 20, 2024

dusantism-db Nov 21, 2024

davidm-db Nov 20, 2024

dusantism-db Nov 21, 2024

davidm-db Nov 21, 2024

davidm-db Nov 20, 2024 •

edited

Loading

dusantism-db Nov 21, 2024

davidm-db commented Nov 20, 2024

davidm-db Nov 21, 2024

dusantism-db Nov 21, 2024

[SPARK-48356][SQL] Support for FOR statement #48794

Are you sure you want to change the base?

[SPARK-48356][SQL] Support for FOR statement #48794

Conversation

dusantism-db commented Nov 7, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db commented Nov 20, 2024

dusantism-db commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db Nov 20, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidm-db commented Nov 20, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dusantism-db commented Nov 7, 2024 •

edited

Loading

davidm-db Nov 20, 2024 •

edited

Loading