Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement #17901

Closed
wants to merge 7 commits into from
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions R/pkg/R/functions.R
Original file line number Diff line number Diff line change
Expand Up @@ -1752,15 +1752,15 @@ setMethod("toRadians",

#' to_date
#'
#' Converts the column into a DateType. You may optionally specify a format
#' Converts the column into a date column. You may optionally specify a format
#' according to the rules in:
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
#' If the string cannot be parsed according to the specified format (or default),
#' the value of the column will be null.
#' The default format is 'yyyy-MM-dd'.
#' By default, it follows casting rules to a date if the format is omitted.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the default format is ...

  • yyyy,
  • yyyy-[m]m
  • yyyy-[m]m-[d]d
  • yyyy-[m]m-[d]d
  • yyyy-[m]m-[d]d *
  • yyyy-[m]m-[d]dT*

* `yyyy`,
* `yyyy-[m]m`
* `yyyy-[m]m-[d]d`
* `yyyy-[m]m-[d]d `
* `yyyy-[m]m-[d]d *`
* `yyyy-[m]m-[d]dT*`

which looks used in the casting rule to a date type.

Copy link
Member

@felixcheung felixcheung May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not sure if i should write out all the contents above ... these format above look actually a bit informal to me (dose anyone know if I understood this correctly?) for a use of documentation. Do you have any good idea for a better description maybe ... ? Let me leave another comment while addressing the comments if I come up with a better idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, let me give a shot with adding an example - cast(df$x, "date").

#'
#' @param x Column to parse.
#' @param format string to use to parse x Column to DateType. (optional)
#' @param format string to use to parse x column to a date column. (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column is a type in R, so it's intentional to captialize it

#'
#' @rdname to_date
#' @name to_date
Expand Down Expand Up @@ -1827,15 +1827,15 @@ setMethod("to_json", signature(x = "Column"),

#' to_timestamp
#'
#' Converts the column into a TimestampType. You may optionally specify a format
#' Converts the column into a timestamp column. You may optionally specify a format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more info on timestamp column? as it was, TimestampType we could say or reference the Scala/Spark type

#' according to the rules in:
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
#' If the string cannot be parsed according to the specified format (or default),
#' the value of the column will be null.
#' The default format is 'yyyy-MM-dd HH:mm:ss'.
#'
#' @param x Column to parse.
#' @param format string to use to parse x Column to DateType. (optional)
#' @param format string to use to parse x column to a timestamp column. (optional)
#'
#' @rdname to_timestamp
#' @name to_timestamp
Expand Down
11 changes: 3 additions & 8 deletions python/pyspark/sql/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -144,12 +144,6 @@ def _():
'measured in radians.',
}

_functions_2_2 = {
'to_date': 'Converts a string date into a DateType using the (optionally) specified format.',
'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' +
'(optionally) specified format.',
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, instead of deleting this we should keep it and we should add

for _name, _doc in _functions_2_2.items():
    globals()[_name] = since(2.2)(_create_function(_name, _doc))

this is for doc tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, we might need this as a standalone fix for 2.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, these were for defining single argumented function that takes a column conveniently but we are defining them below already and both look taking additional format argument. Finally both look having the annotatioms correctly.

Let me double check and address this comment if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that's right - we don't need them - not sure if these are left behind from before format parameter was added or something.

# math functions that take two arguments as input
_binary_mathfunctions = {
'atan2': 'Returns the angle theta from the conversion of rectangular coordinates (x, y) to' +
Expand Down Expand Up @@ -987,9 +981,10 @@ def months_between(date1, date2):
def to_date(col, format=None):
"""Converts a :class:`Column` of :class:`pyspark.sql.types.StringType` or
:class:`pyspark.sql.types.TimestampType` into :class:`pyspark.sql.types.DateType`
using the optionally specified format. Default format is 'yyyy-MM-dd'.
Specify formats according to
using the optionally specified format. Specify formats according to
`SimpleDateFormats <http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html>`_.
By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, not sure if it's clear to python user with casting rules to :class:pyspark.sql.types.DateType

is omitted.

>>> df = spark.createDataFrame([('1997-02-28 10:30:00',)], ['t'])
>>> df.select(to_date(df.t).alias('date')).collect()
Expand Down
2 changes: 1 addition & 1 deletion python/pyspark/sql/tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -2518,7 +2518,7 @@ def test_datetime_functions(self):
from datetime import date, datetime
df = self.spark.range(1).selectExpr("'2017-01-22' as dateCol")
parse_result = df.select(functions.to_date(functions.col("dateCol"))).first()
self.assertEquals(date(2017, 1, 22), parse_result['to_date(dateCol)'])
self.assertEquals(date(2017, 1, 22), parse_result['to_date(`dateCol`)'])

@unittest.skipIf(sys.version_info < (3, 3), "Unittest < 3.3 doesn't support mocking")
def test_unbounded_frames(self):
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1148,13 +1148,6 @@ case class ToUTCTimestamp(left: Expression, right: Expression)
/**
* Returns the date part of a timestamp or string.
*/
@ExpressionDescription(
usage = "_FUNC_(expr) - Extracts the date part of the date or timestamp expression `expr`.",
extended = """
Examples:
> SELECT _FUNC_('2009-07-30 04:17:52');
2009-07-30
""")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not used.

case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastInputTypes {

// Implicit casting of spark will accept string in both date and timestamp format, as
Expand All @@ -1175,15 +1168,19 @@ case class ToDate(child: Expression) extends UnaryExpression with ImplicitCastIn
/**
* Parses a column to a date based on the given format.
*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.",
usage = """
_FUNC_(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
a date. Returns null with invalid input. By default, it follows casting rules to a date if
the `fmt` is omitted.
""",
extended = """
Examples:
> SELECT _FUNC_('2009-07-30 04:17:52');
2009-07-30
> SELECT _FUNC_('2016-12-31', 'yyyy-MM-dd');
2016-12-31
""")
// scalastyle:on line.size.limit
case class ParseToDate(left: Expression, format: Option[Expression], child: Expression)
extends RuntimeReplaceable {

Expand Down Expand Up @@ -1212,22 +1209,27 @@ case class ParseToDate(left: Expression, format: Option[Expression], child: Expr
/**
* Parses a column to a timestamp based on the supplied format.
*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.",
usage = """
_FUNC_(timestamp[, fmt]) - Parses the `timestamp` expression with the `format` expression to
a timestamp. Returns null with invalid input. Default `fmt` is 'yyyy-MM-dd HH:mm:ss'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you are assuming the default format is ISO. There are at least four common default formats, EUR, ISO, JIS and USA. Normally, the database has a register to store the default format. We cannot simply assume ISO is favorite for all the users.

Copy link
Member

@gatorsmile gatorsmile May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use this as default format in the same APIs -

new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss"))

Do you have any suggestion that I could try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres has a single-argument to_timestamp function, but that is used to convert Unix epoch to timestamp.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we documented them here and there

#' The default format is 'yyyy-MM-dd HH:mm:ss'.
in Python and
using the optionally specified format. Default format is 'yyyy-MM-dd HH:mm:ss'. Specify
in R

If the suggestion can be simply done with the format in SimpleDateFormat, I am willing to do this but in a quick look it looks not. Do you mind if we try this later in a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the default values we choose. I am not sure whether we should simply choose ISO as the default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand your concern but I am not introducing the default value. It is already there in coressponding APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we follow to_date and using the casting rules if the format is not specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me give a shot.

""",
extended = """
Examples:
> SELECT _FUNC_('2016-12-31 00:12:00');
2016-12-31 00:12:00
> SELECT _FUNC_('2016-12-31', 'yyyy-MM-dd');
2016-12-31 00:00:00.0
2016-12-31 00:00:00
""")
// scalastyle:on line.size.limit
case class ParseToTimestamp(left: Expression, format: Expression, child: Expression)
extends RuntimeReplaceable {

def this(left: Expression, format: Expression) = {
this(left, format, Cast(UnixTimestamp(left, format), TimestampType))
}

def this(left: Expression) = this(left, Literal("yyyy-MM-dd HH:mm:ss"))

override def flatArguments: Iterator[Any] = Iterator(left, format)
override def sql: String = s"$prettyName(${left.sql}, ${format.sql})"

Expand Down
8 changes: 4 additions & 4 deletions sql/core/src/main/scala/org/apache/spark/sql/functions.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2689,7 +2689,7 @@ object functions {
* @since 2.2.0
*/
def to_timestamp(s: Column): Column = withExpr {
new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we change the default value of the format string to be locale sensitive(same as cast(col, timestamp)), is it ok? cc @rxin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin and @cloud-fan, I would rather take out the change here if this holds off this PR. This is essentially orthogonal with this PR.

new ParseToTimestamp(s.expr)
}

/**
Expand All @@ -2704,15 +2704,15 @@ object functions {
}

/**
* Converts the column into DateType.
* Converts the column into `DateType` by casting rules to `DateType`.
*
* @group datetime_funcs
* @since 1.5.0
*/
def to_date(e: Column): Column = withExpr { ToDate(e.expr) }
def to_date(e: Column): Column = withExpr { new ParseToDate(e.expr) }

/**
* Converts the column into a DateType with a specified format
* Converts the column into a `DateType` with a specified format
* (see [http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html])
* return null if fail.
*
Expand Down
2 changes: 2 additions & 0 deletions sql/core/src/test/resources/sql-tests/inputs/datetime.sql
Original file line number Diff line number Diff line change
Expand Up @@ -2,3 +2,5 @@

-- [SPARK-16836] current_date and current_timestamp literals
select current_date = current_date(), current_timestamp = current_timestamp();

select to_timestamp('2016-12-31 00:12:00'), to_timestamp('2016-12-31', 'yyyy-MM-dd');
10 changes: 9 additions & 1 deletion sql/core/src/test/resources/sql-tests/results/datetime.sql.out
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 1
-- Number of queries: 2


-- !query 0
Expand All @@ -8,3 +8,11 @@ select current_date = current_date(), current_timestamp = current_timestamp()
struct<(current_date() = current_date()):boolean,(current_timestamp() = current_timestamp()):boolean>
-- !query 0 output
true true


-- !query 1
select to_timestamp('2016-12-31 00:12:00'), to_timestamp('2016-12-31', 'yyyy-MM-dd')
-- !query 1 schema
struct<to_timestamp('2016-12-31 00:12:00', 'yyyy-MM-dd HH:mm:ss'):timestamp,to_timestamp('2016-12-31', 'yyyy-MM-dd'):timestamp>
-- !query 1 output
2016-12-31 00:12:00 2016-12-31 00:00:00