Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

Closed
wants to merge 4 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Apr 27, 2017

What changes were proposed in this pull request?

It seems we are using SQLUtils.getSQLDataType for type string in structField. It looks we can replace this with CatalystSqlParser.parseDataType.

They look similar DDL-like type definitions as below:

scala> Seq(Tuple1(Tuple1("a"))).toDF.show()
+---+
| _1|
+---+
|[a]|
+---+
scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()
+---+
| _1|
+---+
|[a]|
+---+

Such type strings looks identical when R’s one as below:

> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
  struct
1      a

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks structField is the only place that calls this method).

How was this patch tested?

Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.

@HyukjinKwon
Copy link
Member Author

(I will cc related reviewers after the tests got passed).

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76228 has finished for PR 17785 at commit 8874de1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Apr 27, 2017

Test build #76230 has finished for PR 17785 at commit b802e36.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon HyukjinKwon reopened this Apr 27, 2017
@HyukjinKwon
Copy link
Member Author

cc @felixcheung, @hvanhovell and @gatorsmile. Could you take a look please?

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we target master (and not branch-2.2) for this change?

R/pkg/R/utils.R Outdated
@@ -864,6 +864,14 @@ captureJVMException <- function(e, method) {
# Extract the first message of JVM exception.
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1]
stop(paste0(rmsg, "no such table - ", first), call. = FALSE)
} else if (any(grep("org.apache.spark.sql.catalyst.parser.ParseException: ", stacktrace))) {
msg <- strsplit(stacktrace, "org.apache.spark.sql.catalyst.parser.ParseException: ",
fixed = TRUE)[[1]]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent

def createStructField(name: String, dataType: String, nullable: Boolean): StructField = {
val dtObj = getSQLDataType(dataType)
val dtObj = CatalystSqlParser.parseDataType(dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked myself, what are the differences if any between getSQLDataType and CatalystSqlParser.parseDataType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, getSQLDataType supports the types below:

binary
boolean
byte
character
date
double
float
integer
logical
numeric
raw
string
timestamp
array<...>
struct<...>
map<...>

and these look required to be case-sensitive whereas parseDataType supports ...

bigint
binary
boolean
byte
char
date
decimal
double
float
int
integer
long
short
smallint
string
timestamp
tinyint
varchar
array<...>
struct<...>
map<...>

and these look case-insensitive.

I think the inital intention for getSQLDataType was to support R type string conversions
too but they look unreachable codes now because we were checking the type strings before actually calling getSQLDataType in checkType.

If the types are not in !is.null(PRIMITIVE_TYPES[[type]]) (case-sensitive), it looks throwing an error.

bigint
binary
boolean
byte
date
decimal
double
float
int
integer
smallint
string
timestamp
tinyint
array<...>
map<...>
struct<...>

In short, I think there should not be a behaviour change below types (intersection between getSQLDataType and parseDataType) ...

binary
string
double
float
boolean
timestamp
date
integer
byte
array<...>
map<...>
struct<...>

and these should be case-sensitive.

Additionally, we will support the types below (which are written in R's PREMISITVE_TYPES but getSQLDataType did not support before):

tinyint
smallint
int
bigint
decimal

Before

> structField("_col", "tinyint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type tinyint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "smallint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type smallint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "int")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type int
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "bigint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type bigint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "decimal")
  ...
  java.lang.IllegalArgumentException: Invalid type decimal
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...

After

> structField("_col", "tinyint")
StructField(name = "_col", type = "ByteType", nullable = TRUE)>
> structField("_col", "smallint")
StructField(name = "_col", type = "ShortType", nullable = TRUE)>
> structField("_col", "int")
StructField(name = "_col", type = "IntegerType", nullable = TRUE)>
> structField("_col", "bigint")
StructField(name = "_col", type = "LongType", nullable = TRUE)>
> structField("_col", "decimal")
StructField(name = "_col", type = "DecimalType(10,0)", nullable = TRUE)>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wrote the details about this at my best. Yes, I think this should be targeted to master not 2.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into it. if I take the diff,

character
logical
numeric
raw

these are actually R native type names though, for which if I have to guess, is intentional that we support R native type in structField as well as Scala/Spark types.

I'm not sure how much coverage we have for something like this but is that going to still work with this change?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, however, for those types, we can't create that field because the check via checkType fails as it is not in the keys in PREMISITVE_TYPES as below:

> structField("_col", "character")
Error in checkType(type) : Unsupported type for SparkDataframe: character
> structField("_col", "logical")
Error in checkType(type) : Unsupported type for SparkDataframe: logical
> structField("_col", "numeric")
Error in checkType(type) : Unsupported type for SparkDataframe: numeric
> structField("_col", "raw")
Error in checkType(type) : Unsupported type for SparkDataframe: raw

I double-checked this is the only place where we called getSQLDataType and therefore they look unreachable (I hope you could double-check this when you have some time for this one just in case I missed something about this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - then I suppose this was "broken" when checkType was added / over the past 2 years or so.
I'm ok with this change then - could you please check that each of the case in checkType we have a test for in R?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, probably those 12 cases in test_sparkSQL.R#L143-L194 and 5 cases in #17785 (comment) of total 17 cases in checkType above will be covered.

Let me double check them and will leave a comment here today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add positive cases for ...

bigint
tinyint
int
smallint
decimal

and negative cases for ...

short
varchar
long
char

@SparkQA
Copy link

SparkQA commented Apr 28, 2017

Test build #76255 has finished for PR 17785 at commit 257e625.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

re: your title De-deuplicate -> De-duplicate

@HyukjinKwon HyukjinKwon changed the title [SPARK-20493][R] De-deuplicate parse logics for DDL-like type strings in R [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R Apr 28, 2017
Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if we check as per #17785 (comment)

And AppVeyor passes

@SparkQA
Copy link

SparkQA commented Apr 28, 2017

Test build #76270 has finished for PR 17785 at commit e9d672a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@felixcheung
Copy link
Member

merged to master, thanks!

@asfgit asfgit closed this in 70f1bcd Apr 29, 2017
@HyukjinKwon
Copy link
Member Author

Thank you for checking this closely with me and merging it @felixcheung.

@HyukjinKwon HyukjinKwon deleted the SPARK-20493 branch January 2, 2018 03:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants