Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-22981][SQL] Fix incorrect results of Casting Struct to String #20176

Closed
wants to merge 2 commits into from

Conversation

maropu
Copy link
Member

@maropu maropu commented Jan 7, 2018

What changes were proposed in this pull request?

This pr fixed the issue when casting structs into strings;

scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
scala> df.write.saveAsTable("t")                                        
scala> sql("SELECT CAST(a AS STRING) FROM t").show
+-------------------+
|                  a|
+-------------------+
|[0,1,1800000001,61]|
|[0,2,1800000001,62]|
+-------------------+

This pr modified the result into;

+------+
|     a|
+------+
|[1, a]|
|[2, b]|
+------+

How was this patch tested?

Added tests in CastSuite.

buildCast[InternalRow](_, row => {
val builder = new UTF8StringBuilder
builder.append("[")
if (row.numFields > 0) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, it seems we have no chance to hit row.numFields == 0 here though, I just leave this for strict checks.

@SparkQA
Copy link

SparkQA commented Jan 7, 2018

Test build #85767 has finished for PR 20176 at commit 10285d0.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Jan 7, 2018

Test build #85770 has finished for PR 20176 at commit 10285d0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jan 8, 2018

@cloud-fan ping

val structToStringCode = st.zipWithIndex.map { case (ft, i) =>
val fieldToStringCode = castToStringCode(ft, ctx)
val funcName = ctx.freshName("fieldToString")
ctx.addNewFunction(funcName,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to create a function, it's called only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually we can create functions, for data types that appeared more than once among struct fields.

Copy link
Contributor

@cloud-fan cloud-fan Jan 9, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW here we may hit 64kb compile error if there are a lot of fields in this struct. We should use ctx.splitExpressions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, l'll update soon

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85834 has finished for PR 20176 at commit 3dcbcc2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85838 has finished for PR 20176 at commit 6f5b080.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Jan 9, 2018

Test build #85844 has finished for PR 20176 at commit 6f5b080.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request Jan 9, 2018
## What changes were proposed in this pull request?
This pr fixed the issue when casting structs into strings;
```
scala> val df = Seq(((1, "a"), 0), ((2, "b"), 0)).toDF("a", "b")
scala> df.write.saveAsTable("t")
scala> sql("SELECT CAST(a AS STRING) FROM t").show
+-------------------+
|                  a|
+-------------------+
|[0,1,1800000001,61]|
|[0,2,1800000001,62]|
+-------------------+
```
This pr modified the result into;
```
+------+
|     a|
+------+
|[1, a]|
|[2, b]|
+------+
```

## How was this patch tested?
Added tests in `CastSuite`.

Author: Takeshi Yamamuro <[email protected]>

Closes #20176 from maropu/SPARK-22981.

(cherry picked from commit 2250cb7)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

LGTM, merging to master/2.3!

@asfgit asfgit closed this in 2250cb7 Jan 9, 2018
@maropu
Copy link
Member Author

maropu commented Jan 9, 2018

Thanks! Next, I'll fix showString though, one question; currently casting binary to string is different between Cast and showString. Which one is expected?
(I like the postgresql one because we can easily tell the difference between string and binary)

FYI: This casting behaviour is different between mysql and postgresql;

postgres=# create table t (a bytea);
CREATE TABLE
postgres=# insert into t values('abc');
INSERT 0 1
postgres=# SELECT CAST(a AS TEXT) from t;
    a     
----------
 \x616263

mysql> create table t(a blob);
Query OK, 0 rows affected (0.02 sec)

mysql> insert into t values('abc');
Query OK, 1 row affected (0.00 sec)

mysql> select CAST(a AS char) From t;
+-----------------+
| CAST(a AS char) |
+-----------------+
| abc             |
+-----------------+
1 row in set (0.00 sec)

@maropu
Copy link
Member Author

maropu commented Jan 9, 2018

hive one is the same with mysql one;

hive> create table t(a BINARY);
OK
hive> INSERT INTO t values('abc');
OK
hive> select CAST(a AS STRING) from t;
OK
abc

@cloud-fan
Copy link
Contributor

show binary as string and cast binary to string seems different to me, let's stick with what it is. BTW it's pretty dangerous to change the behavior of cast to be different with Hive.

@maropu
Copy link
Member Author

maropu commented Jan 10, 2018

ok, I'll make a follow-up for showString later.

case StructType(fields) =>
buildCast[InternalRow](_, row => {
val builder = new UTF8StringBuilder
builder.append("[")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any reasons for using the same brackets [] for Struct, Map and Array?

How about [] for arrays, {} for structs and <> for maps or somehow distinguish them. At least, it is not consistent with toHiveString which prints {} for structs and maps.

Copy link
Member Author

@maropu maropu Jul 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have no strong reason. In this PR, I just followed the existing format. Looks like we can change it to { for consistency (But, we might need to update the migration doc cuz it change the casting behaivour). cc: @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the new format.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR #29308 . Please, review it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants