Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement #17901

Closed
wants to merge 7 commits into from

Conversation

HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented May 8, 2017

What changes were proposed in this pull request?

This PR proposes three things as below:

  • Use casting rules to a timestamp in to_timestamp by default (it was yyyy-MM-dd HH:mm:ss).

  • Support single argument for to_timestamp similarly with APIs in other languages.

    For example, the one below works

    import org.apache.spark.sql.functions._
    Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
    

    prints

    +----------------------------------------+
    |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
    +----------------------------------------+
    |                     2016-12-31 00:12:00|
    +----------------------------------------+
    

    whereas this does not work in SQL.

    Before

    spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
    Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
    

    After

    spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
    2016-12-31 00:12:00
    
  • Related document improvement for SQL function descriptions and other API descriptions accordingly.

    Before

    spark-sql> DESCRIBE FUNCTION extended to_date;
    ...
    Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
    Extended Usage:
        Examples:
          > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
           2016-12-31
    
    spark-sql> DESCRIBE FUNCTION extended to_timestamp;
    ...
    Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
    Extended Usage:
        Examples:
          > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
           2016-12-31 00:00:00.0
    

    After

    spark-sql> DESCRIBE FUNCTION extended to_date;
    ...
    Usage:
        to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
          a date. Returns null with invalid input. By default, it follows casting rules to a date if
          the `fmt` is omitted.
    
    Extended Usage:
        Examples:
          > SELECT to_date('2009-07-30 04:17:52');
           2009-07-30
          > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
           2016-12-31
    
    spark-sql> DESCRIBE FUNCTION extended to_timestamp;
    ...
     Usage:
        to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
          a timestamp. Returns null with invalid input. By default, it follows casting rules to
          a timestamp if the `fmt` is omitted.
    
    Extended Usage:
        Examples:
          > SELECT to_timestamp('2016-12-31 00:12:00');
           2016-12-31 00:12:00
          > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
           2016-12-31 00:00:00
    

How was this patch tested?

Added tests in datetime.sql.

#' according to the rules in:
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
#' If the string cannot be parsed according to the specified format (or default),
#' the value of the column will be null.
#' The default format is 'yyyy-MM-dd'.
#' By default, it follows casting rules to a date if the format is omitted.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks the default format is ...

  • yyyy,
  • yyyy-[m]m
  • yyyy-[m]m-[d]d
  • yyyy-[m]m-[d]d
  • yyyy-[m]m-[d]d *
  • yyyy-[m]m-[d]dT*

* `yyyy`,
* `yyyy-[m]m`
* `yyyy-[m]m-[d]d`
* `yyyy-[m]m-[d]d `
* `yyyy-[m]m-[d]d *`
* `yyyy-[m]m-[d]dT*`

which looks used in the casting rule to a date type.

Copy link
Member

@felixcheung felixcheung May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I am not sure if i should write out all the contents above ... these format above look actually a bit informal to me (dose anyone know if I understood this correctly?) for a use of documentation. Do you have any good idea for a better description maybe ... ? Let me leave another comment while addressing the comments if I come up with a better idea.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, let me give a shot with adding an example - cast(df$x, "date").

'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' +
'(optionally) specified format.',
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, instead of deleting this we should keep it and we should add

for _name, _doc in _functions_2_2.items():
    globals()[_name] = since(2.2)(_create_function(_name, _doc))

this is for doc tag

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in fact, we might need this as a standalone fix for 2.2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, these were for defining single argumented function that takes a column conveniently but we are defining them below already and both look taking additional format argument. Finally both look having the annotatioms correctly.

Let me double check and address this comment if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, that's right - we don't need them - not sure if these are left behind from before format parameter was added or something.

Examples:
> SELECT _FUNC_('2009-07-30 04:17:52');
2009-07-30
""")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems not used.

usage = "_FUNC_(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.",
usage = """
_FUNC_(timestamp[, fmt]) - Parses the `timestamp` expression with the `format` expression to
a timestamp. Returns null with invalid input. Default `fmt` is 'yyyy-MM-dd HH:mm:ss'.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, you are assuming the default format is ISO. There are at least four common default formats, EUR, ISO, JIS and USA. Normally, the database has a register to store the default format. We cannot simply assume ISO is favorite for all the users.

Copy link
Member

@gatorsmile gatorsmile May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already use this as default format in the same APIs -

new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss"))

Do you have any suggestion that I could try?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Postgres has a single-argument to_timestamp function, but that is used to convert Unix epoch to timestamp.

Copy link
Member Author

@HyukjinKwon HyukjinKwon May 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it seems we documented them here and there

#' The default format is 'yyyy-MM-dd HH:mm:ss'.
in Python and
using the optionally specified format. Default format is 'yyyy-MM-dd HH:mm:ss'. Specify
in R

If the suggestion can be simply done with the format in SimpleDateFormat, I am willing to do this but in a quick look it looks not. Do you mind if we try this later in a separate PR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is the default values we choose. I am not sure whether we should simply choose ISO as the default value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do understand your concern but I am not introducing the default value. It is already there in coressponding APIs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we follow to_date and using the casting rules if the format is not specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let me give a shot.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76583 has finished for PR 17901 at commit 9fcf107.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76584 has finished for PR 17901 at commit 9f75b4c.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76586 has finished for PR 17901 at commit f8921f4.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76585 has finished for PR 17901 at commit 45bf353.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member Author

@gatorsmile and @hvanhovell, I am not introducing the default value but using the same default value already there that we documented here and there.

This PR simply supports single-argument in to_timestamp function in SQL consistently with the same corresponding to_timestamp functions within Spark.

Changing the default value sounds orthogonal to me and does not look blocking this PR.

#' according to the rules in:
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
#' If the string cannot be parsed according to the specified format (or default),
#' the value of the column will be null.
#' The default format is 'yyyy-MM-dd'.
#' By default, it follows casting rules to a date if the format is omitted.
Copy link
Member

@felixcheung felixcheung May 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there more info we could provide for R users, who might not know where to look for this "casting rules to a date"?

#'
#' @param x Column to parse.
#' @param format string to use to parse x Column to DateType. (optional)
#' @param format string to use to parse x column to a date column. (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Column is a type in R, so it's intentional to captialize it

@@ -1827,15 +1827,15 @@ setMethod("to_json", signature(x = "Column"),

#' to_timestamp
#'
#' Converts the column into a TimestampType. You may optionally specify a format
#' Converts the column into a timestamp column. You may optionally specify a format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there more info on timestamp column? as it was, TimestampType we could say or reference the Scala/Spark type

'to_timestamp': 'Converts a string timestamp into a timestamp type using the ' +
'(optionally) specified format.',
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, instead of deleting this we should keep it and we should add

for _name, _doc in _functions_2_2.items():
    globals()[_name] = since(2.2)(_create_function(_name, _doc))

this is for doc tag

`SimpleDateFormats <http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html>`_.
By default, it follows casting rules to :class:`pyspark.sql.types.DateType` if the format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, not sure if it's clear to python user with casting rules to :class:pyspark.sql.types.DateType

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76603 has finished for PR 17901 at commit b2d3b0a.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

you can rebase to pick up the fix for the R tests

@HyukjinKwon
Copy link
Member Author

Thank you everybody. Let me ty to address the comments.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76636 has started for PR 17901 at commit 497a229.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76638 has started for PR 17901 at commit b6f867c.

@@ -1757,7 +1757,8 @@ setMethod("toRadians",
#' \url{http://docs.oracle.com/javase/tutorial/i18n/format/simpleDateFormat.html}.
#' If the string cannot be parsed according to the specified format (or default),
#' the value of the column will be null.
#' The default format is 'yyyy-MM-dd'.
#' By default, it follows casting rules to a DateType if the format is omitted
#' (equivalent with \code{cast(df$x, "date")}).
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@felixcheung, I added an example here. Would this be enough?

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76641 has started for PR 17901 at commit fc02460.

@HyukjinKwon
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76659 has finished for PR 17901 at commit fc02460.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76673 has finished for PR 17901 at commit b038927.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* @group datetime_funcs
* @since 2.2.0
*/
def to_timestamp(s: Column): Column = withExpr {
new ParseToTimestamp(s.expr, Literal("yyyy-MM-dd HH:mm:ss"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we change the default value of the format string to be locale sensitive(same as cast(col, timestamp)), is it ok? cc @rxin

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin and @cloud-fan, I would rather take out the change here if this holds off this PR. This is essentially orthogonal with this PR.

@cloud-fan
Copy link
Contributor

thanks, merging to master! @rxin if you have any concern about the default value, we can change it later.

@asfgit asfgit closed this in 720708c May 12, 2017
@HyukjinKwon
Copy link
Member Author

Thank you so much.

robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…L with documentation improvement

## What changes were proposed in this pull request?

This PR proposes three things as below:

- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).

- Support single argument for `to_timestamp` similarly with APIs in other languages.

  For example, the one below works

  ```
  import org.apache.spark.sql.functions._
  Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
  ```

  prints

  ```
  +----------------------------------------+
  |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
  +----------------------------------------+
  |                     2016-12-31 00:12:00|
  +----------------------------------------+
  ```

  whereas this does not work in SQL.

  **Before**

  ```
  spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
  Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
  ```

  **After**

  ```
  spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
  2016-12-31 00:12:00
  ```

- Related document improvement for SQL function descriptions and other API descriptions accordingly.

  **Before**

  ```
  spark-sql> DESCRIBE FUNCTION extended to_date;
  ...
  Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
  Extended Usage:
      Examples:
        > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
         2016-12-31
  ```

  ```
  spark-sql> DESCRIBE FUNCTION extended to_timestamp;
  ...
  Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
  Extended Usage:
      Examples:
        > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
         2016-12-31 00:00:00.0
  ```

  **After**

  ```
  spark-sql> DESCRIBE FUNCTION extended to_date;
  ...
  Usage:
      to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
        a date. Returns null with invalid input. By default, it follows casting rules to a date if
        the `fmt` is omitted.

  Extended Usage:
      Examples:
        > SELECT to_date('2009-07-30 04:17:52');
         2009-07-30
        > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
         2016-12-31
  ```

  ```
  spark-sql> DESCRIBE FUNCTION extended to_timestamp;
  ...
   Usage:
      to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
        a timestamp. Returns null with invalid input. By default, it follows casting rules to
        a timestamp if the `fmt` is omitted.

  Extended Usage:
      Examples:
        > SELECT to_timestamp('2016-12-31 00:12:00');
         2016-12-31 00:12:00
        > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
         2016-12-31 00:00:00
  ```

## How was this patch tested?

Added tests in `datetime.sql`.

Author: hyukjinkwon <[email protected]>

Closes apache#17901 from HyukjinKwon/to_timestamp_arg.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…L with documentation improvement

## What changes were proposed in this pull request?

This PR proposes three things as below:

- Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).

- Support single argument for `to_timestamp` similarly with APIs in other languages.

  For example, the one below works

  ```
  import org.apache.spark.sql.functions._
  Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
  ```

  prints

  ```
  +----------------------------------------+
  |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
  +----------------------------------------+
  |                     2016-12-31 00:12:00|
  +----------------------------------------+
  ```

  whereas this does not work in SQL.

  **Before**

  ```
  spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
  Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
  ```

  **After**

  ```
  spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
  2016-12-31 00:12:00
  ```

- Related document improvement for SQL function descriptions and other API descriptions accordingly.

  **Before**

  ```
  spark-sql> DESCRIBE FUNCTION extended to_date;
  ...
  Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
  Extended Usage:
      Examples:
        > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
         2016-12-31
  ```

  ```
  spark-sql> DESCRIBE FUNCTION extended to_timestamp;
  ...
  Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
  Extended Usage:
      Examples:
        > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
         2016-12-31 00:00:00.0
  ```

  **After**

  ```
  spark-sql> DESCRIBE FUNCTION extended to_date;
  ...
  Usage:
      to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
        a date. Returns null with invalid input. By default, it follows casting rules to a date if
        the `fmt` is omitted.

  Extended Usage:
      Examples:
        > SELECT to_date('2009-07-30 04:17:52');
         2009-07-30
        > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
         2016-12-31
  ```

  ```
  spark-sql> DESCRIBE FUNCTION extended to_timestamp;
  ...
   Usage:
      to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
        a timestamp. Returns null with invalid input. By default, it follows casting rules to
        a timestamp if the `fmt` is omitted.

  Extended Usage:
      Examples:
        > SELECT to_timestamp('2016-12-31 00:12:00');
         2016-12-31 00:12:00
        > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
         2016-12-31 00:00:00
  ```

## How was this patch tested?

Added tests in `datetime.sql`.

Author: hyukjinkwon <[email protected]>

Closes apache#17901 from HyukjinKwon/to_timestamp_arg.
@HyukjinKwon HyukjinKwon deleted the to_timestamp_arg branch January 2, 2018 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants