-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-23023][SQL] Cast field data to strings in showString #20214
Conversation
Test build #85900 has finished for PR 20214 at commit
|
Test build #85905 has finished for PR 20214 at commit
|
retest this please |
@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer) | |||
} | |||
|
|||
test("SPARK-XXXXX Cast rows to strings in showString") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: need to update jira id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh... thx..
Test build #85910 has finished for PR 20214 at commit
|
Test build #85911 has finished for PR 20214 at commit
|
retest this please |
Test build #85933 has finished for PR 20214 at commit
|
Test build #85942 has finished for PR 20214 at commit
|
|
retest this please |
Test build #85948 has finished for PR 20214 at commit
|
Test build #85951 has finished for PR 20214 at commit
|
retest this please |
Test build #85956 has finished for PR 20214 at commit
|
Test build #85972 has finished for PR 20214 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except for some comments.
val castExprs = newDf.schema.map { f => f.dataType match { | ||
// Since binary types in top-level schema fields have a specific format to print, | ||
// so we do not cast them to strings here. | ||
case BinaryType => s"${f.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to surround f.name
with s""
? Or we need to add ` around f.name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, I forgot `
// Since binary types in top-level schema fields have a specific format to print, | ||
// so we do not cast them to strings here. | ||
case BinaryType => s"${f.name}" | ||
case udt: UserDefinedType[_] => s"${f.name}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: _: UserDefinedType[_]
.
case BinaryType => s"${f.name}" | ||
case udt: UserDefinedType[_] => s"${f.name}" | ||
case _ => s"CAST(`${f.name}` AS STRING)" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove an extra line.
// Since binary types in top-level schema fields have a specific format to print, | ||
// so we do not cast them to strings here. | ||
case BinaryType => s"`${f.name}`" | ||
case _: UserDefinedType[_] => s"`${f.name}`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this entry for passing the existing tests in pyspark though, we still hit wired behaviours when casting user-defined types into strings;
>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0] |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+
This cast shows the internal data structure of user-define types.
I just tried to fix this though, I think we easily can't because, in codegen path, spark can't tell a way to convert a given internal data into an user-defined string;
master...maropu:CastUDTtoString#diff-258b71121d8d168e4d53cb5b6dc53ffeR844
WDYT? cc: @cloud-fan @ueshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about something like:
case udt: UserDefinedType[_] =>
(c, evPrim, evNull) => {
val udtTerm = ctx.addReferenceObj("udt", udt)
s"$evPrim = UTF8String.fromString($udtTerm.deserialize($c).toString());"
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, yea. I missed that. Thanks, I'll make a separate pr.
Test build #86012 has finished for PR 20214 at commit
|
Please check #20246 first? Thanks! @ueshin @cloud-fan |
Test build #86028 has finished for PR 20214 at commit
|
val castExprs = newDf.schema.map { f => f.dataType match { | ||
// Since binary types in top-level schema fields have a specific format to print, | ||
// so we do not cast them to strings here. | ||
case BinaryType => s"`${f.name}`" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we use dataframe API? which looks more reliable here
newDf.logicalPlan.output.map { col =>
if (col.dataType == BinaryType) {
Column(col)
} else {
Column(col).cast(StringType)
}
}
@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext { | |||
assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer) | |||
} | |||
|
|||
test("SPARK-23023 Cast rows to strings in showString") { | |||
val df1 = Seq(Seq(1, 2, 3, 4)).toDF("a") | |||
assert(df1.showString(10) === |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why it shows WrappedArray
before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since RowEncoder
deserializer converts nested arrays into WrappedArray
, toString
shows do so in showString
.
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala
Line 304 in 55dbfbc
scala.collection.mutable.WrappedArray.getClass, |
Test build #86069 has finished for PR 20214 at commit
|
retest this please. |
Test build #86127 has finished for PR 20214 at commit
|
thanks, merging to master/2.3 |
## What changes were proposed in this pull request? The current `Datset.showString` prints rows thru `RowEncoder` deserializers like; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------------------------------------------+ |a | +------------------------------------------------------------+ |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]| +------------------------------------------------------------+ ``` This result is incorrect because the correct one is; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------+ |a | +------------------------+ |[[1, 2], [3], [4, 5, 6]]| +------------------------+ ``` So, this pr fixed code in `showString` to cast field data to strings before printing. ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <[email protected]> Closes #20214 from maropu/SPARK-23023. (cherry picked from commit b598083) Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The current
Datset.showString
prints rows thruRowEncoder
deserializers like;This result is incorrect because the correct one is;
So, this pr fixed code in
showString
to cast field data to strings before printing.How was this patch tested?
Added tests in
DataFrameSuite
.