-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785
Conversation
(I will cc related reviewers after the tests got passed). |
Test build #76228 has finished for PR 17785 at commit
|
Test build #76230 has finished for PR 17785 at commit
|
cc @felixcheung, @hvanhovell and @gatorsmile. Could you take a look please? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we target master (and not branch-2.2) for this change?
R/pkg/R/utils.R
Outdated
@@ -864,6 +864,14 @@ captureJVMException <- function(e, method) { | |||
# Extract the first message of JVM exception. | |||
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1] | |||
stop(paste0(rmsg, "no such table - ", first), call. = FALSE) | |||
} else if (any(grep("org.apache.spark.sql.catalyst.parser.ParseException: ", stacktrace))) { | |||
msg <- strsplit(stacktrace, "org.apache.spark.sql.catalyst.parser.ParseException: ", | |||
fixed = TRUE)[[1]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
indent
def createStructField(name: String, dataType: String, nullable: Boolean): StructField = { | ||
val dtObj = getSQLDataType(dataType) | ||
val dtObj = CatalystSqlParser.parseDataType(dataType) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haven't checked myself, what are the differences if any between getSQLDataType
and CatalystSqlParser.parseDataType
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to my knowledge, getSQLDataType
supports the types below:
binary
boolean
byte
character
date
double
float
integer
logical
numeric
raw
string
timestamp
array<...>
struct<...>
map<...>
and these look required to be case-sensitive whereas parseDataType
supports ...
bigint
binary
boolean
byte
char
date
decimal
double
float
int
integer
long
short
smallint
string
timestamp
tinyint
varchar
array<...>
struct<...>
map<...>
and these look case-insensitive.
I think the inital intention for getSQLDataType
was to support R type string conversions
too but they look unreachable codes now because we were checking the type strings before actually calling getSQLDataType
in checkType
.
If the types are not in !is.null(PRIMITIVE_TYPES[[type]])
(case-sensitive), it looks throwing an error.
bigint
binary
boolean
byte
date
decimal
double
float
int
integer
smallint
string
timestamp
tinyint
array<...>
map<...>
struct<...>
In short, I think there should not be a behaviour change below types (intersection between getSQLDataType
and parseDataType
) ...
binary
string
double
float
boolean
timestamp
date
integer
byte
array<...>
map<...>
struct<...>
and these should be case-sensitive.
Additionally, we will support the types below (which are written in R's PREMISITVE_TYPES
but getSQLDataType
did not support before):
tinyint
smallint
int
bigint
decimal
Before
> structField("_col", "tinyint")
...
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Invalid type tinyint
at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
...
> structField("_col", "smallint")
...
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Invalid type smallint
at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
...
> structField("_col", "int")
...
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Invalid type int
at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
...
> structField("_col", "bigint")
...
Error in handleErrors(returnStatus, conn) :
java.lang.IllegalArgumentException: Invalid type bigint
at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
...
> structField("_col", "decimal")
...
java.lang.IllegalArgumentException: Invalid type decimal
at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
...
After
> structField("_col", "tinyint")
StructField(name = "_col", type = "ByteType", nullable = TRUE)>
> structField("_col", "smallint")
StructField(name = "_col", type = "ShortType", nullable = TRUE)>
> structField("_col", "int")
StructField(name = "_col", type = "IntegerType", nullable = TRUE)>
> structField("_col", "bigint")
StructField(name = "_col", type = "LongType", nullable = TRUE)>
> structField("_col", "decimal")
StructField(name = "_col", type = "DecimalType(10,0)", nullable = TRUE)>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wrote the details about this at my best. Yes, I think this should be targeted to master not 2.2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for looking into it. if I take the diff,
character
logical
numeric
raw
these are actually R native type names though, for which if I have to guess, is intentional that we support R native type in structField as well as Scala/Spark types.
I'm not sure how much coverage we have for something like this but is that going to still work with this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, however, for those types, we can't create that field because the check via checkType fails as it is not in the keys in PREMISITVE_TYPES
as below:
> structField("_col", "character")
Error in checkType(type) : Unsupported type for SparkDataframe: character
> structField("_col", "logical")
Error in checkType(type) : Unsupported type for SparkDataframe: logical
> structField("_col", "numeric")
Error in checkType(type) : Unsupported type for SparkDataframe: numeric
> structField("_col", "raw")
Error in checkType(type) : Unsupported type for SparkDataframe: raw
I double-checked this is the only place where we called getSQLDataType
and therefore they look unreachable (I hope you could double-check this when you have some time for this one just in case I missed something about this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see - then I suppose this was "broken" when checkType was added / over the past 2 years or so.
I'm ok with this change then - could you please check that each of the case in checkType
we have a test for in R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, probably those 12 cases in test_sparkSQL.R#L143-L194 and 5 cases in #17785 (comment) of total 17 cases in checkType above will be covered.
Let me double check them and will leave a comment here today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add positive cases for ...
bigint
tinyint
int
smallint
decimal
and negative cases for ...
short
varchar
long
char
Test build #76255 has finished for PR 17785 at commit
|
re: your title |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if we check as per #17785 (comment)
And AppVeyor passes
Test build #76270 has finished for PR 17785 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
merged to master, thanks! |
Thank you for checking this closely with me and merging it @felixcheung. |
What changes were proposed in this pull request?
It seems we are using
SQLUtils.getSQLDataType
for type string in structField. It looks we can replace this withCatalystSqlParser.parseDataType
.They look similar DDL-like type definitions as below:
Such type strings looks identical when R’s one as below:
R’s one is stricter because we are checking the types via regular expressions in R side ahead.
Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks
structField
is the only place that calls this method).How was this patch tested?
Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.