Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23023][SQL] Cast field data to strings in showString #20214

Closed
wants to merge 6 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jan 10, 2018

What changes were proposed in this pull request?

The current Datset.showString prints rows thru RowEncoder deserializers like;

scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------------------------------------------+
|a                                                           |
+------------------------------------------------------------+
|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
+------------------------------------------------------------+

This result is incorrect because the correct one is;

scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------+
|a                       |
+------------------------+
|[[1, 2], [3], [4, 5, 6]]|
+------------------------+

So, this pr fixed code in showString to cast field data to strings before printing.

How was this patch tested?

Added tests in DataFrameSuite.

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85900 has finished for PR 20214 at commit eb56aff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85905 has finished for PR 20214 at commit e393c63.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 10, 2018

retest this please

@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer)
}

test("SPARK-XXXXX Cast rows to strings in showString") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need to update jira id.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh... thx..

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85910 has finished for PR 20214 at commit e393c63.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 10, 2018

Test build #85911 has finished for PR 20214 at commit ae7a807.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 10, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85933 has finished for PR 20214 at commit ae7a807.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85942 has finished for PR 20214 at commit cbccb1b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 11, 2018

org.apache.spark.sql.streaming.StreamingOuterJoinSuite is flaky? (It seems this pr is not related to the test).

@maropu
Copy link
Member Author

maropu commented Jan 11, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85948 has finished for PR 20214 at commit cbccb1b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85951 has finished for PR 20214 at commit 9cf9954.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 11, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85956 has finished for PR 20214 at commit 9cf9954.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 11, 2018

Test build #85972 has finished for PR 20214 at commit 66b06c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some comments.

val castExprs = newDf.schema.map { f => f.dataType match {
// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
case BinaryType => s"${f.name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to surround f.name with s""? Or we need to add ` around f.name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops, I forgot `

// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
case BinaryType => s"${f.name}"
case udt: UserDefinedType[_] => s"${f.name}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: _: UserDefinedType[_].

case BinaryType => s"${f.name}"
case udt: UserDefinedType[_] => s"${f.name}"
case _ => s"CAST(`${f.name}` AS STRING)"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove an extra line.

// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
case BinaryType => s"`${f.name}`"
case _: UserDefinedType[_] => s"`${f.name}`"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this entry for passing the existing tests in pyspark though, we still hit wired behaviours when casting user-defined types into strings;

>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features                                   |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0]               |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+

This cast shows the internal data structure of user-define types.
I just tried to fix this though, I think we easily can't because, in codegen path, spark can't tell a way to convert a given internal data into an user-defined string;
master...maropu:CastUDTtoString#diff-258b71121d8d168e4d53cb5b6dc53ffeR844

WDYT? cc: @cloud-fan @ueshin

Copy link
Member

@ueshin ueshin Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like:

case udt: UserDefinedType[_] =>
  (c, evPrim, evNull) => {
    val udtTerm = ctx.addReferenceObj("udt", udt)
    s"$evPrim = UTF8String.fromString($udtTerm.deserialize($c).toString());"
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yea. I missed that. Thanks, I'll make a separate pr.

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86012 has finished for PR 20214 at commit afe0af5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 12, 2018

Please check #20246 first? Thanks! @ueshin @cloud-fan

@SparkQA
Copy link

SparkQA commented Jan 12, 2018

Test build #86028 has finished for PR 20214 at commit 18552a4.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

val castExprs = newDf.schema.map { f => f.dataType match {
// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
case BinaryType => s"`${f.name}`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use dataframe API? which looks more reliable here

newDf.logicalPlan.output.map { col =>
  if (col.dataType == BinaryType) {
    Column(col)
  } else {
    Column(col).cast(StringType)
  }
}

@@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer)
}

test("SPARK-23023 Cast rows to strings in showString") {
val df1 = Seq(Seq(1, 2, 3, 4)).toDF("a")
assert(df1.showString(10) ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why it shows WrappedArray before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RowEncoder deserializer converts nested arrays into WrappedArray, toString shows do so in showString.

@SparkQA
Copy link

SparkQA commented Jan 13, 2018

Test build #86069 has finished for PR 20214 at commit 022ed32.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 15, 2018

retest this please.

@SparkQA
Copy link

SparkQA commented Jan 15, 2018

Test build #86127 has finished for PR 20214 at commit 022ed32.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master/2.3

asfgit pushed a commit that referenced this pull request Jan 15, 2018
## What changes were proposed in this pull request?
The current `Datset.showString` prints rows thru `RowEncoder` deserializers like;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------------------------------------------+
|a                                                           |
+------------------------------------------------------------+
|[WrappedArray(1, 2), WrappedArray(3), WrappedArray(4, 5, 6)]|
+------------------------------------------------------------+
```
This result is incorrect because the correct one is;
```
scala> Seq(Seq(Seq(1, 2), Seq(3), Seq(4, 5, 6))).toDF("a").show(false)
+------------------------+
|a                       |
+------------------------+
|[[1, 2], [3], [4, 5, 6]]|
+------------------------+
```
So, this pr fixed code in `showString` to cast field data to strings before printing.

## How was this patch tested?
Added tests in `DataFrameSuite`.

Author: Takeshi Yamamuro <[email protected]>

Closes #20214 from maropu/SPARK-23023.

(cherry picked from commit b598083)
Signed-off-by: Wenchen Fan <[email protected]>
@asfgit asfgit closed this in b598083 Jan 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants