Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-23023][SQL] Cast field data to strings in showString #20214

Closed
wants to merge 6 commits into from
Closed
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 16 additions & 16 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -1849,14 +1849,14 @@ def explode_outer(col):
+---+----------+----+-----+

>>> df.select("id", "a_map", explode_outer("an_array")).show()
+---+-------------+----+
| id| a_map| col|
+---+-------------+----+
| 1|Map(x -> 1.0)| foo|
| 1|Map(x -> 1.0)| bar|
| 2| Map()|null|
| 3| null|null|
+---+-------------+----+
+---+----------+----+
| id| a_map| col|
+---+----------+----+
| 1|[x -> 1.0]| foo|
| 1|[x -> 1.0]| bar|
| 2| []|null|
| 3| null|null|
+---+----------+----+
"""
sc = SparkContext._active_spark_context
jc = sc._jvm.functions.explode_outer(_to_java_column(col))
Expand All @@ -1881,14 +1881,14 @@ def posexplode_outer(col):
| 3| null|null|null| null|
+---+----------+----+----+-----+
>>> df.select("id", "a_map", posexplode_outer("an_array")).show()
+---+-------------+----+----+
| id| a_map| pos| col|
+---+-------------+----+----+
| 1|Map(x -> 1.0)| 0| foo|
| 1|Map(x -> 1.0)| 1| bar|
| 2| Map()|null|null|
| 3| null|null|null|
+---+-------------+----+----+
+---+----------+----+----+
| id| a_map| pos| col|
+---+----------+----+----+
| 1|[x -> 1.0]| 0| foo|
| 1|[x -> 1.0]| 1| bar|
| 2| []|null|null|
| 3| null|null|null|
+---+----------+----+----+
"""
sc = SparkContext._active_spark_context
jc = sc._jvm.functions.posexplode_outer(_to_java_column(col))
Expand Down
19 changes: 9 additions & 10 deletions sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
Original file line number Diff line number Diff line change
Expand Up @@ -237,13 +237,18 @@ class Dataset[T] private[sql](
private[sql] def showString(
_numRows: Int, truncate: Int = 20, vertical: Boolean = false): String = {
val numRows = _numRows.max(0).min(Int.MaxValue - 1)
val takeResult = toDF().take(numRows + 1)
val newDf = toDF()
val castExprs = newDf.schema.map { f => f.dataType match {
// Since binary types in top-level schema fields have a specific format to print,
// so we do not cast them to strings here.
case BinaryType => s"`${f.name}`"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use dataframe API? which looks more reliable here

newDf.logicalPlan.output.map { col =>
  if (col.dataType == BinaryType) {
    Column(col)
  } else {
    Column(col).cast(StringType)
  }
}

case _: UserDefinedType[_] => s"`${f.name}`"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this entry for passing the existing tests in pyspark though, we still hit wired behaviours when casting user-defined types into strings;

>>> from pyspark.ml.classification import MultilayerPerceptronClassifier
>>> from pyspark.ml.linalg import Vectors
>>> df = spark.createDataFrame([(0.0, Vectors.dense([0.0, 0.0])), (1.0, Vectors.dense([0.0, 1.0]))], ["label", "features"])
>>> df.selectExpr("CAST(features AS STRING)").show(truncate = False)
+-------------------------------------------+
|features                                   |
+-------------------------------------------+
|[6,1,0,0,2800000020,2,0,0,0]               |
|[6,1,0,0,2800000020,2,0,0,3ff0000000000000]|
+-------------------------------------------+

This cast shows the internal data structure of user-define types.
I just tried to fix this though, I think we easily can't because, in codegen path, spark can't tell a way to convert a given internal data into an user-defined string;
master...maropu:CastUDTtoString#diff-258b71121d8d168e4d53cb5b6dc53ffeR844

WDYT? cc: @cloud-fan @ueshin

Copy link
Member

@ueshin ueshin Jan 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like:

case udt: UserDefinedType[_] =>
  (c, evPrim, evNull) => {
    val udtTerm = ctx.addReferenceObj("udt", udt)
    s"$evPrim = UTF8String.fromString($udtTerm.deserialize($c).toString());"
  }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, yea. I missed that. Thanks, I'll make a separate pr.

case _ => s"CAST(`${f.name}` AS STRING)"
}}
val takeResult = newDf.selectExpr(castExprs: _*).take(numRows + 1)
val hasMoreData = takeResult.length > numRows
val data = takeResult.take(numRows)

lazy val timeZone =
DateTimeUtils.getTimeZone(sparkSession.sessionState.conf.sessionLocalTimeZone)

// For array values, replace Seq and Array with square brackets
// For cells that are beyond `truncate` characters, replace it with the
// first `truncate-3` and "..."
Expand All @@ -252,12 +257,6 @@ class Dataset[T] private[sql](
val str = cell match {
case null => "null"
case binary: Array[Byte] => binary.map("%02X".format(_)).mkString("[", " ", "]")
case array: Array[_] => array.mkString("[", ", ", "]")
case seq: Seq[_] => seq.mkString("[", ", ", "]")
case d: Date =>
DateTimeUtils.dateToString(DateTimeUtils.fromJavaDate(d))
case ts: Timestamp =>
DateTimeUtils.timestampToString(DateTimeUtils.fromJavaTimestamp(ts), timeZone)
case _ => cell.toString
}
if (truncate > 0 && str.length > truncate) {
Expand Down
28 changes: 28 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -1255,6 +1255,34 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
assert(testData.select($"*").showString(1, vertical = true) === expectedAnswer)
}

test("SPARK-23023 Cast rows to strings in showString") {
val df1 = Seq(Seq(1, 2, 3, 4)).toDF("a")
assert(df1.showString(10) ===
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why it shows WrappedArray before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since RowEncoder deserializer converts nested arrays into WrappedArray, toString shows do so in showString.

s"""+------------+
|| a|
|+------------+
||[1, 2, 3, 4]|
|+------------+
|""".stripMargin)
val df2 = Seq(Map(1 -> "a", 2 -> "b")).toDF("a")
assert(df2.showString(10) ===
s"""+----------------+
|| a|
|+----------------+
||[1 -> a, 2 -> b]|
|+----------------+
|""".stripMargin)
val df3 = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
assert(df3.showString(10) ===
s"""+------+---+
|| a| b|
|+------+---+
||[1, a]| 0|
||[2, b]| 0|
|+------+---+
|""".stripMargin)
}

test("SPARK-7327 show with empty dataFrame") {
val expectedAnswer = """+---+-----+
||key|value|
Expand Down
12 changes: 6 additions & 6 deletions sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -958,12 +958,12 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
).toDS()

val expected =
"""+-------+
|| f|
|+-------+
||[foo,1]|
||[bar,2]|
|+-------+
"""+--------+
|| f|
|+--------+
||[foo, 1]|
||[bar, 2]|
|+--------+
|""".stripMargin

checkShowString(ds, expected)
Expand Down