-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16866][SQL] Infrastructure for file-based SQL end-to-end tests #14472
Conversation
} | ||
|
||
private def listTestCases(): Seq[TestCase] = { | ||
listFilesRecursively(new File(inputFilePath)).map { file => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this might not work for Maven - I will look into this later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now it should work for maven
cc @cloud-fan and @rxin for feedback. |
Test build #63148 has finished for PR 14472 at commit
|
QueryOutput( | ||
sql = sql, | ||
schema = df.schema.map(_.dataType.simpleString).mkString(", "), | ||
output = df.showString(_numRows = 10000, truncate = 10000).trim) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using showString might not be the most friendly when there is a mismatch and the output is huge, but should work very well with smaller outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
org.apache.spark.sql.catalyst.util.sideBySide
can show which line is mismatched. Can we borrow this idea?
Seems like a good idea. @cloud-fan can you review? |
Test build #63149 has finished for PR 14472 at commit
|
|
||
/** A single SQL query's output. */ | ||
private case class QueryOutput(sql: String, schema: String, output: String) { | ||
def toXML: String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it a convention to use XML as golden file? It's really hard to read...
For output file -- one other way I thought about was to use
We can do string parsing there but it'd be less rigorous than XML. |
Is there any case that people need to write a golden file themselves? If we wanna use some standard formats, I prefer json over xml. |
I don't think these golden files should be manually created. They should always be generated. I tried JSON earlier and it was not very friendly either (worse than XML in this case). Do you like the format I showed above? |
yea, it looks much better. Although golden files are always generated, we should make it easy to read and verify its correctness. |
I have updated this to use a custom format that is more readable. |
Test build #63197 has finished for PR 14472 at commit
|
Test build #3201 has finished for PR 14472 at commit
|
@@ -99,4 +99,5 @@ spark-deps-.* | |||
.*tsv | |||
org.apache.spark.scheduler.ExternalClusterManager | |||
.*\.sql | |||
.*\.sql\.xml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we add a rule for .out
?
Test build #63297 has finished for PR 14472 at commit
|
@@ -99,4 +99,5 @@ spark-deps-.* | |||
.*tsv | |||
org.apache.spark.scheduler.ExternalClusterManager | |||
.*\.sql | |||
.*\.sql\.out |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed, we already have a rule: .*out
I have updated this based on review feedback. The harness is now using hiveResultString() rather than show(), similar to the existing Hive compatibility test. |
Test build #63471 has finished for PR 14472 at commit
|
Test build #63476 has finished for PR 14472 at commit
|
// val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n") | ||
val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n") | ||
// note: this is not a robust way to split queries using semicolon, but works for now. | ||
cleaned.split("(?<=[^\\\\]);").map(_.trim).filterNot(q => q == "").toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.filter(_.nonEmpty)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it strictly less clear? It is more obvious that this is a string this way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to filter(_ != "")
|
||
// List of SQL queries to run | ||
val queries: Seq[String] = { | ||
// val cleaned = input.split("\n").filterNot(_.matches("--.*(?<=[^\\\\]);")).mkString("\n") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
val queries: Seq[String] = { | ||
val cleaned = input.split("\n").filterNot(_.startsWith("--")).mkString("\n") | ||
// note: this is not a robust way to split queries using semicolon, but works for now. | ||
cleaned.split("(?<=[^\\\\]);").map(_.trim).filter(_ != "").toSeq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you explain a bit more about this regex? Why can't we just use ;
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this from HiveComparisonTest. I think it is done to avoid escaping e.g. ";". It is really not very robust, but seems fine for now.
mostly LGTM, can you try some error cases and put the error message in PR description? Then we can have a better understanding about how this framework report errors. Thanks! |
Updated the description. I think it might make sense to switch over to the same way HiveCompatibilitySuite reports mismatches, but I think we should do that after porting a few tests that have larger outputs and then decide. |
do you have any idea why the test is so slow? |
It was the first time any query was run, so JIT hasn't really kicked in yet, and many classes needed to be loaded into the JVM. |
val expectedOutputs: Seq[QueryOutput] = { | ||
val goldenOutput = fileToString(new File(testCase.resultFile)) | ||
val segments = goldenOutput.split("-- !query.+\n") | ||
assert(segments.size == outputs.size * 3 + 1) // each query has 3 segments, plus the header |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert((segments.size - 1) % 3 == 0)
? Or we will never hit this https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that a problem that some asserts are never hit? Some logic might change and then one of the asserts can fail. I prefer to be more conservative for asserts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this assert doesn't match the logic in this branch. What we do here is skipping the first segment, and grouping the rest segments by 3, each group means a QueryOutput
. We are not comparing the real output with expected output in this branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But then why would % 3 be better? Are you arguing for a better message when it fails?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get "this assert doesn't match the logic in this branch". There is no logic that dictates we cannot verify the number of blocks here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if segments.size == outputs.size * 3 + 1
fails, we can still finish this branch right?
However if (segments.size - 1) % 3 == 0
fails, we will throw ArrayIndexOutOfBound
in this branch.
Anyway it's bad to have dead code, if we wanna keep this assert, we should remove https://github.com/apache/spark/pull/14472/files#diff-432455394ca50800d5de508861984ca5R164 and move its error message here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add a better error message, but I'm afraid I disagree with you on removing the other assert. It is not dead code because it is exercised at runtime. They are making different assumptions at different places in the code. We could change the way we arrange blocks in the future and then the other assert would be useful.
Anyway I am not sure why you are nitpicking on this. It seems very minor and we are simply wasting time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically asserts are used as defensive guards against program errors. By your definition almost all asserts are "dead code".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I get where you are coming from. You think the assert as something to verify correctness for a test case (in the ScalaTest sense). I was using assert as a defensive guard to catch error (as in basic invariants that shouldn't have been violated for this tiny block).
Test build #63506 has finished for PR 14472 at commit
|
Test build #63507 has finished for PR 14472 at commit
|
thanks, merging to master! |
Test build #63515 has finished for PR 14472 at commit
|
Thanks, @cloud-fan ! |
## What changes were proposed in this pull request? This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files. This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework. Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form). When there is a mismatch, the error message looks like the following: ``` [info] - blacklist.sql !!! IGNORED !!! [info] - number-format.sql *** FAILED *** (2 seconds, 405 milliseconds) [info] Expected "...147483648 -214748364[8]", but got "...147483648 -214748364[9]" Result should match for query #1 (SQLQueryTestSuite.scala:171) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions$class.newAssertionFailedException(Assertions.scala:495) [info] at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1555) [info] at org.scalatest.Assertions$class.assertResult(Assertions.scala:1171) ``` ## How was this patch tested? This is a test infrastructure change. Author: petermaxlee <[email protected]> Closes #14472 from petermaxlee/SPARK-16866. (cherry picked from commit b9f8a11) Signed-off-by: Wenchen Fan <[email protected]>
backport to 2.0! |
What changes were proposed in this pull request?
This patch introduces SQLQueryTestSuite, a basic framework for end-to-end SQL test cases defined in spark/sql/core/src/test/resources/sql-tests. This is a more standard way to test SQL queries end-to-end in different open source database systems, because it is more manageable to work with files.
This is inspired by HiveCompatibilitySuite, but simplified for general Spark SQL tests. Once this is merged, I can work towards porting SQLQuerySuite over, and eventually also move the existing HiveCompatibilitySuite to use this framework.
Unlike HiveCompatibilitySuite, SQLQueryTestSuite compares both the output schema and the output data (in string form).
When there is a mismatch, the error message looks like the following:
How was this patch tested?
This is a test infrastructure change.