Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-21238][SQL] allow nested SQL execution #18450

Closed
wants to merge 2 commits into from

Conversation

cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Jun 28, 2017

What changes were proposed in this pull request?

This is kind of another follow-up for #18064 .

In #18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. #18419 trid to improve it a little bit, by introduing SQLExecition.ignoreNestedExecutionId. However, this is not friendly to data source developers, they may need to update their code to use this ignoreNestedExecutionId API.

This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up.

How was this patch tested?

existing tests.

@cloud-fan
Copy link
Contributor Author

cc @rdblue @zsxwing

@@ -26,22 +26,9 @@ import org.apache.spark.sql.SparkSession
class SQLExecutionSuite extends SparkFunSuite {

test("concurrent query execution (SPARK-10548)") {
// Try to reproduce the issue with the old SparkContext
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now we allow nested execution, so we can't reproduce this bug anymore.

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78783 has finished for PR 18450 at commit d47433e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdblue
Copy link
Contributor

rdblue commented Jun 28, 2017

I think it is good that this would no longer throw exceptions at runtime. Is the purpose of not allowing nested executions to minimize the queries shown in the UI? If that's the only purpose then I agree that just eliminating the empty ones is a good strategy.

Copy link
Member

@zsxwing zsxwing left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one suggestion. Otherwise, LGTM

@@ -314,6 +314,8 @@ class SQLListener(conf: SparkConf) extends SparkListener with Logging {
if (executionUIData.isFailed) {
failedExecutions += executionUIData
trimExecutionsIfNecessary(failedExecutions)
} else if (executionUIData.jobs.isEmpty && executionUIData.accumulatorMetrics.isEmpty) {
Copy link
Member

@zsxwing zsxwing Jun 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to not do this. Otherwise, such query appears in running queries abut disappear after it finishes on UI and it seems pretty weird.

@SparkQA
Copy link

SparkQA commented Jun 29, 2017

Test build #78853 has finished for PR 18450 at commit f8e9901.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@asfgit asfgit closed this in 9f6b3e6 Jun 29, 2017
robert3005 pushed a commit to palantir/spark that referenced this pull request Jun 29, 2017
## What changes were proposed in this pull request?

This is kind of another follow-up for apache#18064 .

In apache#18064 , we wrap every SQL command with SQL execution, which makes nested SQL execution very likely to happen. apache#18419 trid to improve it a little bit, by introduing `SQLExecition.ignoreNestedExecutionId`. However, this is not friendly to data source developers, they may need to update their code to use this `ignoreNestedExecutionId` API.

This PR proposes a new solution, to just allow nested execution. The downside is that, we may have multiple executions for one query. We can improve this by updating the data organization in SQLListener, to have 1-n mapping from query to execution, instead of 1-1 mapping. This can be done in a follow-up.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#18450 from cloud-fan/execution-id.
ghost pushed a commit to dbtsai/spark that referenced this pull request Oct 12, 2017
…hema

## What changes were proposed in this pull request?

In apache#18064, we allowed `RunnableCommand` to have children in order to fix some UI issues. Then we made `InsertIntoXXX` commands take the input `query` as a child, when we do the actual writing, we just pass the physical plan to the writer(`FileFormatWriter.write`).

However this is problematic. In Spark SQL, optimizer and planner are allowed to change the schema names a little bit. e.g. `ColumnPruning` rule will remove no-op `Project`s, like `Project("A", Scan("a"))`, and thus change the output schema from "<A: int>" to `<a: int>`. When it comes to writing, especially for self-description data format like parquet, we may write the wrong schema to the file and cause null values at the read path.

Fortunately, in apache#18450 , we decided to allow nested execution and one query can map to multiple executions in the UI. This releases the major restriction in apache#18604 , and now we don't have to take the input `query` as child of `InsertIntoXXX` commands.

So the fix is simple, this PR partially revert apache#18064 and make `InsertIntoXXX` commands leaf nodes again.

## How was this patch tested?

new regression test

Author: Wenchen Fan <[email protected]>

Closes apache#19474 from cloud-fan/bug.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants