-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement #17901
Changes from 3 commits
45bf353
f8921f4
b2d3b0a
497a229
b6f867c
fc02460
b038927
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1752,15 +1752,15 @@ setMethod("toRadians", | |
|
||
#' to_date | ||
#' | ||
#' Converts the column into a DateType. You may optionally specify a format | ||
#' Converts the column into a date column. You may optionally specify a format | ||
#' according to the rules in: | ||
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}. | ||
#' If the string cannot be parsed according to the specified format (or default), | ||
#' the value of the column will be null. | ||
#' The default format is 'yyyy-MM-dd'. | ||
#' By default, it follows casting rules to a date if the format is omitted. | ||
#' | ||
#' @param x Column to parse. | ||
#' @param format string to use to parse x Column to DateType. (optional) | ||
#' @param format string to use to parse x column to a date column. (optional) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
#' | ||
#' @rdname to_date | ||
#' @name to_date | ||
|
@@ -1827,15 +1827,15 @@ setMethod("to_json", signature(x = "Column"), | |
|
||
#' to_timestamp | ||
#' | ||
#' Converts the column into a TimestampType. You may optionally specify a format | ||
#' Converts the column into a timestamp column. You may optionally specify a format | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there more info on |
||
#' according to the rules in: | ||
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}. | ||
#' If the string cannot be parsed according to the specified format (or default), | ||
#' the value of the column will be null. | ||
#' The default format is 'yyyy-MM-dd HH:mm:ss'. | ||
#' | ||
#' @param x Column to parse. | ||
#' @param format string to use to parse x Column to DateType. (optional) | ||
#' @param format string to use to parse x column to a timestamp column. (optional) | ||
#' | ||
#' @rdname to_timestamp | ||
#' @name to_timestamp | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -144,12 +144,6 @@ def _(): | |
'measured in radians.', | ||
} | ||
|
||
_functions_2_2 = { | ||
'to_date': 'Converts a string date into a DateType using the (optionally) specified format.', | ||
'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' + | ||
'(optionally) specified format.', | ||
} | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems not used. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, instead of deleting this we should keep it and we should add
this is for doc tag There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. in fact, we might need this as a standalone fix for 2.2 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Up to my knowledge, these were for defining single argumented function that takes a column conveniently but we are defining them below already and both look taking additional format argument. Finally both look having the annotatioms correctly. Let me double check and address this comment if possible. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually, that's right - we don't need them - not sure if these are left behind from before |
||
# math functions that take two arguments as input | ||
_binary_mathfunctions = { | ||
'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' + | ||
|
@@ -987,9 +981,10 @@ def months_between(date1, date2): | |
def to_date(col, format=None): | ||
"""Converts a :class:`Column` of :class:`pyspark.sql.types.StringType` or | ||
:class:`pyspark.sql.types.TimestampType` into :class:`pyspark.sql.types.DateType` | ||
using the optionally specified format. Default format is 'yyyy-MM-dd'. | ||
Specify formats according to | ||
using the optionally specified format. Specify formats according to | ||
`SimpleDateFormats <http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html>`_. | ||
By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ditto, not sure if it's clear to python user with |
||
is omitted. | ||
|
||
>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t']) | ||
>>> df.select(to_date(df.t).alias('date')).collect() | ||
|
Original file line number | Diff line number | Diff line change | ||||||
---|---|---|---|---|---|---|---|---|
|
@@ -1148,13 +1148,6 @@ case class ToUTCTimestamp(left: Expression, right: Expression) | |||||||
/** | ||||||||
* Returns the date part of a timestamp or string. | ||||||||
*/ | ||||||||
@ExpressionDescription( | ||||||||
usage = "_FUNC_(expr) - Extracts the date part of the date or timestamp expression `expr`.", | ||||||||
extended = """ | ||||||||
Examples: | ||||||||
> SELECT _FUNC_('2009-07-30 04:17:52'); | ||||||||
2009-07-30 | ||||||||
""") | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems not used. |
||||||||
case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastInputTypes { | ||||||||
|
||||||||
// Implicit casting of spark will accept string in both date and timestamp format, as | ||||||||
|
@@ -1175,15 +1168,19 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn | |||||||
/** | ||||||||
* Parses a column to a date based on the given format. | ||||||||
*/ | ||||||||
// scalastyle:off line.size.limit | ||||||||
@ExpressionDescription( | ||||||||
usage = "_FUNC_(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.", | ||||||||
usage = """ | ||||||||
_FUNC_(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to | ||||||||
a date. Returns null with invalid input. By default, it follows casting rules to a date if | ||||||||
the `fmt` is omitted. | ||||||||
""", | ||||||||
extended = """ | ||||||||
Examples: | ||||||||
> SELECT _FUNC_('2009-07-30 04:17:52'); | ||||||||
2009-07-30 | ||||||||
> SELECT _FUNC_('2016-12-31', 'yyyy-MM-dd'); | ||||||||
2016-12-31 | ||||||||
""") | ||||||||
// scalastyle:on line.size.limit | ||||||||
case class ParseToDate(left: Expression, format: Option[Expression], child: Expression) | ||||||||
extends RuntimeReplaceable { | ||||||||
|
||||||||
|
@@ -1212,22 +1209,27 @@ case class ParseToDate(left: Expression, format: Option[Expression], child: Expr | |||||||
/** | ||||||||
* Parses a column to a timestamp based on the supplied format. | ||||||||
*/ | ||||||||
// scalastyle:off line.size.limit | ||||||||
@ExpressionDescription( | ||||||||
usage = "_FUNC_(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.", | ||||||||
usage = """ | ||||||||
_FUNC_(timestamp[, fmt]) - Parses the `timestamp` expression with the `format` expression to | ||||||||
a timestamp. Returns null with invalid input. Default `fmt` is 'yyyy-MM-dd HH:mm:ss'. | ||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Here, you are assuming the default format is ISO. There are at least four common default formats, EUR, ISO, JIS and USA. Normally, the database has a register to store the default format. We cannot simply assume ISO is favorite for all the users. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We already use this as default format in the same APIs -
Do you have any suggestion that I could try? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Postgres has a single-argument to_timestamp function, but that is used to convert Unix epoch to timestamp. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, it seems we documented them here and there Line 1835 in f21897f
spark/python/pyspark/sql/functions.py Line 1014 in 63d90e7
If the suggestion can be simply done with the format in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The problem is the default values we choose. I am not sure whether we should simply choose ISO as the default value. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do understand your concern but I am not introducing the default value. It is already there in coressponding APIs. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. shall we follow There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, let me give a shot. |
||||||||
""", | ||||||||
extended = """ | ||||||||
Examples: | ||||||||
> SELECT _FUNC_('2016-12-31 00:12:00'); | ||||||||
2016-12-31 00:12:00 | ||||||||
> SELECT _FUNC_('2016-12-31', 'yyyy-MM-dd'); | ||||||||
2016-12-31 00:00:00.0 | ||||||||
2016-12-31 00:00:00 | ||||||||
""") | ||||||||
// scalastyle:on line.size.limit | ||||||||
case class ParseToTimestamp(left: Expression, format: Expression, child: Expression) | ||||||||
extends RuntimeReplaceable { | ||||||||
|
||||||||
def this(left: Expression, format: Expression) = { | ||||||||
this(left, format, Cast(UnixTimestamp(left, format), TimestampType)) | ||||||||
} | ||||||||
|
||||||||
def this(left: Expression) = this(left, Literal("yyyy-MM-dd HH:mm:ss")) | ||||||||
|
||||||||
override def flatArguments: Iterator[Any] = Iterator(left, format) | ||||||||
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})" | ||||||||
|
||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2689,7 +2689,7 @@ object functions { | |
* @since 2.2.0 | ||
*/ | ||
def to_timestamp(s: Column): Column = withExpr { | ||
new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss")) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. here we change the default value of the format string to be locale sensitive(same as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rxin and @cloud-fan, I would rather take out the change here if this holds off this PR. This is essentially orthogonal with this PR. |
||
new ParseToTimestamp(s.expr) | ||
} | ||
|
||
/** | ||
|
@@ -2704,15 +2704,15 @@ object functions { | |
} | ||
|
||
/** | ||
* Converts the column into DateType. | ||
* Converts the column into `DateType` by casting rules to `DateType`. | ||
* | ||
* @group datetime_funcs | ||
* @since 1.5.0 | ||
*/ | ||
def to_date(e: Column): Column = withExpr { ToDate(e.expr) } | ||
def to_date(e: Column): Column = withExpr { new ParseToDate(e.expr) } | ||
|
||
/** | ||
* Converts the column into a DateType with a specified format | ||
* Converts the column into a `DateType` with a specified format | ||
* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html]) | ||
* return null if fail. | ||
* | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks the default format is ...
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala
Lines 430 to 435 in 2269155
which looks used in the casting rule to a date type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I am not sure if i should write out all the contents above ... these format above look actually a bit informal to me (dose anyone know if I understood this correctly?) for a use of documentation. Do you have any good idea for a better description maybe ... ? Let me leave another comment while addressing the comments if I come up with a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, let me give a shot with adding an example -
cast(df$x, "date")
.