-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement #17901
Conversation
R/pkg/R/functions.R
Outdated
#' according to the rules in: | ||
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}. | ||
#' If the string cannot be parsed according to the specified format (or default), | ||
#' the value of the column will be null. | ||
#' The default format is 'yyyy-MM-dd'. | ||
#' By default, it follows casting rules to a date if the format is omitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks the default format is ...
yyyy
,yyyy-[m]m
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d
yyyy-[m]m-[d]d *
yyyy-[m]m-[d]dT*
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Lines 430 to 435 in 2269155
* `yyyy`, | |
* `yyyy-[m]m` | |
* `yyyy-[m]m-[d]d` | |
* `yyyy-[m]m-[d]d ` | |
* `yyyy-[m]m-[d]d *` | |
* `yyyy-[m]m-[d]dT*` |
which looks used in the casting rule to a date type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not sure if i should write out all the contents above ... these format above look actually a bit informal to me (dose anyone know if I understood this correctly?) for a use of documentation. Do you have any good idea for a better description maybe ... ? Let me leave another comment while addressing the comments if I come up with a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, let me give a shot with adding an example - cast(df$x, "date")
.
'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' + | ||
'(optionally) specified format.', | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, instead of deleting this we should keep it and we should add
for _name, _doc in _functions_2_2.items():
globals()[_name] = since(2.2)(_create_function(_name, _doc))
this is for doc tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in fact, we might need this as a standalone fix for 2.2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to my knowledge, these were for defining single argumented function that takes a column conveniently but we are defining them below already and both look taking additional format argument. Finally both look having the annotatioms correctly.
Let me double check and address this comment if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, that's right - we don't need them - not sure if these are left behind from before format
parameter was added or something.
Examples: | ||
> SELECT _FUNC_('2009-07-30 04:17:52'); | ||
2009-07-30 | ||
""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems not used.
9fcf107
to
9f75b4c
Compare
9f75b4c
to
45bf353
Compare
usage = "_FUNC_(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.", | ||
usage = """ | ||
_FUNC_(timestamp[, fmt]) - Parses the `timestamp` expression with the `format` expression to | ||
a timestamp. Returns null with invalid input. Default `fmt` is 'yyyy-MM-dd HH:mm:ss'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, you are assuming the default format is ISO. There are at least four common default formats, EUR, ISO, JIS and USA. Normally, the database has a register to store the default format. We cannot simply assume ISO is favorite for all the users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already use this as default format in the same APIs -
new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss")) |
Do you have any suggestion that I could try?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres has a single-argument to_timestamp function, but that is used to convert Unix epoch to timestamp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems we documented them here and there
Line 1835 in f21897f
#' The default format is 'yyyy-MM-dd HH:mm:ss'. |
spark/python/pyspark/sql/functions.py
Line 1014 in 63d90e7
using the optionally specified format. Default format is 'yyyy-MM-dd HH:mm:ss'. Specify |
If the suggestion can be simply done with the format in SimpleDateFormat
, I am willing to do this but in a quick look it looks not. Do you mind if we try this later in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is the default values we choose. I am not sure whether we should simply choose ISO as the default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do understand your concern but I am not introducing the default value. It is already there in coressponding APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we follow to_date
and using the casting rules if the format is not specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, let me give a shot.
Test build #76583 has finished for PR 17901 at commit
|
Test build #76584 has finished for PR 17901 at commit
|
Test build #76586 has finished for PR 17901 at commit
|
Test build #76585 has finished for PR 17901 at commit
|
@gatorsmile and @hvanhovell, I am not introducing the default value but using the same default value already there that we documented here and there. This PR simply supports single-argument in Changing the default value sounds orthogonal to me and does not look blocking this PR. |
R/pkg/R/functions.R
Outdated
#' according to the rules in: | ||
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}. | ||
#' If the string cannot be parsed according to the specified format (or default), | ||
#' the value of the column will be null. | ||
#' The default format is 'yyyy-MM-dd'. | ||
#' By default, it follows casting rules to a date if the format is omitted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?
R/pkg/R/functions.R
Outdated
#' | ||
#' @param x Column to parse. | ||
#' @param format string to use to parse x Column to DateType. (optional) | ||
#' @param format string to use to parse x column to a date column. (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Column
is a type in R, so it's intentional to captialize it
R/pkg/R/functions.R
Outdated
@@ -1827,15 +1827,15 @@ setMethod("to_json", signature(x = "Column"), | |||
|
|||
#' to_timestamp | |||
#' | |||
#' Converts the column into a TimestampType. You may optionally specify a format | |||
#' Converts the column into a timestamp column. You may optionally specify a format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there more info on timestamp column
? as it was, TimestampType
we could say or reference the Scala/Spark type
'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' + | ||
'(optionally) specified format.', | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, instead of deleting this we should keep it and we should add
for _name, _doc in _functions_2_2.items():
globals()[_name] = since(2.2)(_create_function(_name, _doc))
this is for doc tag
`SimpleDateFormats <http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html>`_. | ||
By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, not sure if it's clear to python user with casting rules to :class:pyspark.sql.types.DateType
Test build #76603 has finished for PR 17901 at commit
|
you can rebase to pick up the fix for the R tests |
Thank you everybody. Let me ty to address the comments. |
Test build #76636 has started for PR 17901 at commit |
Test build #76638 has started for PR 17901 at commit |
R/pkg/R/functions.R
Outdated
@@ -1757,7 +1757,8 @@ setMethod("toRadians", | |||
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}. | |||
#' If the string cannot be parsed according to the specified format (or default), | |||
#' the value of the column will be null. | |||
#' The default format is 'yyyy-MM-dd'. | |||
#' By default, it follows casting rules to a DateType if the format is omitted | |||
#' (equivalent with \code{cast(df$x, "date")}). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@felixcheung, I added an example here. Would this be enough?
Test build #76641 has started for PR 17901 at commit |
retest this please |
Test build #76659 has finished for PR 17901 at commit
|
Test build #76673 has finished for PR 17901 at commit
|
* @group datetime_funcs | ||
* @since 2.2.0 | ||
*/ | ||
def to_timestamp(s: Column): Column = withExpr { | ||
new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here we change the default value of the format string to be locale sensitive(same as cast(col, timestamp)
), is it ok? cc @rxin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin and @cloud-fan, I would rather take out the change here if this holds off this PR. This is essentially orthogonal with this PR.
thanks, merging to master! @rxin if you have any concern about the default value, we can change it later. |
Thank you so much. |
…L with documentation improvement ## What changes were proposed in this pull request? This PR proposes three things as below: - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`). - Support single argument for `to_timestamp` similarly with APIs in other languages. For example, the one below works ``` import org.apache.spark.sql.functions._ Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show() ``` prints ``` +----------------------------------------+ |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')| +----------------------------------------+ | 2016-12-31 00:12:00| +----------------------------------------+ ``` whereas this does not work in SQL. **Before** ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7 ``` **After** ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 ``` - Related document improvement for SQL function descriptions and other API descriptions accordingly. **Before** ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input. Extended Usage: Examples: > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00.0 ``` **After** ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_date('2009-07-30 04:17:52'); 2009-07-30 > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00 ``` ## How was this patch tested? Added tests in `datetime.sql`. Author: hyukjinkwon <[email protected]> Closes apache#17901 from HyukjinKwon/to_timestamp_arg.
…L with documentation improvement ## What changes were proposed in this pull request? This PR proposes three things as below: - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`). - Support single argument for `to_timestamp` similarly with APIs in other languages. For example, the one below works ``` import org.apache.spark.sql.functions._ Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show() ``` prints ``` +----------------------------------------+ |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')| +----------------------------------------+ | 2016-12-31 00:12:00| +----------------------------------------+ ``` whereas this does not work in SQL. **Before** ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7 ``` **After** ``` spark-sql> SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 ``` - Related document improvement for SQL function descriptions and other API descriptions accordingly. **Before** ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input. Extended Usage: Examples: > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00.0 ``` **After** ``` spark-sql> DESCRIBE FUNCTION extended to_date; ... Usage: to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to a date. Returns null with invalid input. By default, it follows casting rules to a date if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_date('2009-07-30 04:17:52'); 2009-07-30 > SELECT to_date('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 ``` ``` spark-sql> DESCRIBE FUNCTION extended to_timestamp; ... Usage: to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to a timestamp. Returns null with invalid input. By default, it follows casting rules to a timestamp if the `fmt` is omitted. Extended Usage: Examples: > SELECT to_timestamp('2016-12-31 00:12:00'); 2016-12-31 00:12:00 > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd'); 2016-12-31 00:00:00 ``` ## How was this patch tested? Added tests in `datetime.sql`. Author: hyukjinkwon <[email protected]> Closes apache#17901 from HyukjinKwon/to_timestamp_arg.
What changes were proposed in this pull request?
This PR proposes three things as below:
Use casting rules to a timestamp in
to_timestamp
by default (it wasyyyy-MM-dd HH:mm:ss
).Support single argument for
to_timestamp
similarly with APIs in other languages.For example, the one below works
prints
whereas this does not work in SQL.
Before
After
Related document improvement for SQL function descriptions and other API descriptions accordingly.
Before
After
How was this patch tested?
Added tests in
datetime.sql
.