[SPARK-23023][SQL] Cast field data to strings in showString #20214

maropu · 2018-01-10T03:12:26Z

What changes were proposed in this pull request?

The current Datset.showString prints rows thru RowEncoder deserializers like;

scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------------------------------------------+
|a                                                           |
+------------------------------------------------------------+
|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
+------------------------------------------------------------+

This result is incorrect because the correct one is;

scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------+
|a                       |
+------------------------+
|[[1, 2], [3], [4, 5, 6]]|
+------------------------+

So, this pr fixed code in showString to cast field data to strings before printing.

How was this patch tested?

Added tests in DataFrameSuite.

SparkQA · 2018-01-10T05:39:19Z

Test build #85900 has finished for PR 20214 at commit eb56aff.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-10T08:05:02Z

Test build #85905 has finished for PR 20214 at commit e393c63.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-10T08:09:59Z

retest this please

ueshin · 2018-01-10T08:21:05Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer)
  }

+  test("SPARK-XXXXX Cast rows to strings in showString") {


nit: need to update jira id.

oh... thx..

SparkQA · 2018-01-10T10:52:24Z

Test build #85910 has finished for PR 20214 at commit e393c63.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-10T11:12:40Z

Test build #85911 has finished for PR 20214 at commit ae7a807.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-10T23:11:09Z

retest this please

SparkQA · 2018-01-11T01:47:00Z

Test build #85933 has finished for PR 20214 at commit ae7a807.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T02:45:35Z

Test build #85942 has finished for PR 20214 at commit cbccb1b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-11T03:14:32Z

org.apache.spark.sql.streaming.StreamingOuterJoinSuite is flaky? (It seems this pr is not related to the test).

maropu · 2018-01-11T03:14:38Z

retest this please

SparkQA · 2018-01-11T05:55:05Z

Test build #85948 has finished for PR 20214 at commit cbccb1b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T08:05:01Z

Test build #85951 has finished for PR 20214 at commit 9cf9954.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-11T08:06:33Z

retest this please

SparkQA · 2018-01-11T10:48:34Z

Test build #85956 has finished for PR 20214 at commit 9cf9954.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-11T18:48:02Z

Test build #85972 has finished for PR 20214 at commit 66b06c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

ueshin

LGTM except for some comments.

ueshin · 2018-01-12T01:50:22Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    val castExprs = newDf.schema.map { f => f.dataType match {
+      // Since binary types in top-level schema fields have a specific format to print,
+      // so we do not cast them to strings here.
+      case BinaryType => s"${f.name}"


Do we need to surround f.name with s""? Or we need to add ` around f.name?

oops, I forgot `

ueshin · 2018-01-12T01:50:43Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      // Since binary types in top-level schema fields have a specific format to print,
+      // so we do not cast them to strings here.
+      case BinaryType => s"${f.name}"
+      case udt: UserDefinedType[_] => s"${f.name}"


nit: _: UserDefinedType[_].

ueshin · 2018-01-12T01:51:06Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      case BinaryType => s"${f.name}"
+      case udt: UserDefinedType[_] => s"${f.name}"
+      case _ => s"CAST(`${f.name}` AS STRING)"
+


nit: remove an extra line.

maropu · 2018-01-12T03:01:00Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+      // Since binary types in top-level schema fields have a specific format to print,
+      // so we do not cast them to strings here.
+      case BinaryType => s"`${f.name}`"
+      case _: UserDefinedType[_] => s"`${f.name}`"


I added this entry for passing the existing tests in pyspark though, we still hit wired behaviours when casting user-defined types into strings;

>>> from pyspark.ml.classification import MultilayerPerceptronClassifier >>> from pyspark.ml.linalg import Vectors >>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"]) >>> df.selectExpr("CAST(features AS STRING)").show(truncate = False) +-------------------------------------------+ |features | +-------------------------------------------+ |[6,1,0,0,2800000020,2,0,0,0] | |[6,1,0,0,2800000020,2,0,0,3ff0000000000000]| +-------------------------------------------+

This cast shows the internal data structure of user-define types.
I just tried to fix this though, I think we easily can't because, in codegen path, spark can't tell a way to convert a given internal data into an user-defined string;
master...maropu:CastUDTtoString#diff-258b71121d8d168e4d53cb5b6dc53ffeR844

WDYT? cc: @cloud-fan @ueshin

How about something like:

case udt: UserDefinedType[_] => (c, evPrim, evNull) => { val udtTerm = ctx.addReferenceObj("udt", udt) s"$evPrim = UTF8String.fromString($udtTerm.deserialize($c).toString());" }

oh, yea. I missed that. Thanks, I'll make a separate pr.

SparkQA · 2018-01-12T06:03:25Z

Test build #86012 has finished for PR 20214 at commit afe0af5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-12T08:09:17Z

Please check #20246 first? Thanks! @ueshin @cloud-fan

SparkQA · 2018-01-12T12:22:43Z

Test build #86028 has finished for PR 20214 at commit 18552a4.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-12T15:44:57Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

+    val castExprs = newDf.schema.map { f => f.dataType match {
+      // Since binary types in top-level schema fields have a specific format to print,
+      // so we do not cast them to strings here.
+      case BinaryType => s"`${f.name}`"


can we use dataframe API? which looks more reliable here

newDf.logicalPlan.output.map { col => if (col.dataType == BinaryType) { Column(col) } else { Column(col).cast(StringType) } }

cloud-fan · 2018-01-12T15:46:14Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala

@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
    assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer)
  }

+  test("SPARK-23023 Cast rows to strings in showString") {
+    val df1 = Seq(Seq(1, 2, 3, 4)).toDF("a")
+    assert(df1.showString(10) ===


Do you know why it shows WrappedArray before?

Since RowEncoder deserializer converts nested arrays into WrappedArray, toString shows do so in showString.

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/RowEncoder.scala

Line 304 in 55dbfbc

scala.collection.mutable.WrappedArray.getClass,

SparkQA · 2018-01-13T04:02:49Z

Test build #86069 has finished for PR 20214 at commit 022ed32.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-01-15T03:03:59Z

retest this please.

SparkQA · 2018-01-15T06:16:40Z

Test build #86127 has finished for PR 20214 at commit 022ed32.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-01-15T08:28:51Z

thanks, merging to master/2.3

## What changes were proposed in this pull request? The current `Datset.showString` prints rows thru `RowEncoder` deserializers like; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------------------------------------------+ |a | +------------------------------------------------------------+ |[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]| +------------------------------------------------------------+ ``` This result is incorrect because the correct one is; ``` scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false) +------------------------+ |a | +------------------------+ |[[1, 2], [3], [4, 5, 6]]| +------------------------+ ``` So, this pr fixed code in `showString` to cast field data to strings before printing. ## How was this patch tested? Added tests in `DataFrameSuite`. Author: Takeshi Yamamuro <[email protected]> Closes #20214 from maropu/SPARK-23023. (cherry picked from commit b598083) Signed-off-by: Wenchen Fan <[email protected]>

Cast data to strings in showString

e393c63

maropu force-pushed the SPARK-23023 branch from eb56aff to e393c63 Compare January 10, 2018 06:12

ueshin reviewed Jan 10, 2018

View reviewed changes

Fix

ae7a807

maropu force-pushed the SPARK-23023 branch from cbccb1b to 9cf9954 Compare January 11, 2018 06:06

Fix

66b06c3

maropu force-pushed the SPARK-23023 branch from 9cf9954 to 66b06c3 Compare January 11, 2018 15:29

ueshin reviewed Jan 12, 2018

View reviewed changes

Fix

afe0af5

maropu commented Jan 12, 2018

View reviewed changes

Drop entry for UserDefinedType

18552a4

cloud-fan reviewed Jan 12, 2018

View reviewed changes

Fix

022ed32

asfgit closed this in b598083 Jan 15, 2018

kevinjmh mentioned this pull request Aug 14, 2019

[CARBONDATA-3491] Return updated/deleted rows count when execute update/delete sql apache/carbondata#3357

Closed

5 tasks

[SPARK-23023][SQL] Cast field data to strings in showString #20214

[SPARK-23023][SQL] Cast field data to strings in showString #20214

Conversation

maropu commented Jan 10, 2018

What changes were proposed in this pull request?

How was this patch tested?

SparkQA commented Jan 10, 2018

SparkQA commented Jan 10, 2018

maropu commented Jan 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 10, 2018

SparkQA commented Jan 10, 2018

maropu commented Jan 10, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

maropu commented Jan 11, 2018

maropu commented Jan 11, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

maropu commented Jan 11, 2018

SparkQA commented Jan 11, 2018

SparkQA commented Jan 11, 2018

ueshin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ueshin Jan 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 12, 2018

maropu commented Jan 12, 2018

SparkQA commented Jan 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Jan 13, 2018

maropu commented Jan 15, 2018

SparkQA commented Jan 15, 2018

cloud-fan commented Jan 15, 2018

ueshin Jan 12, 2018 •

edited

Loading