Support for specifying custom date format for date and timestamp types. #280

HyukjinKwon · 2016-03-04T03:00:23Z

https://github.com/databricks/spark-csv/issues/279
https://github.com/databricks/spark-csv/issues/262
https://github.com/databricks/spark-csv/issues/266

This PR adds the support to specify custom date format for DateType and TimestampType.

For TimestampType, this uses the given format to infer schema and also to convert the values
For DateType, this uses the given format to convert the values.
If the dateFormat is not given, then it works with Timestamp.valueOf() and Date.valueOf() for backwords compatibility.
When it's given, then it uses SimpleDateFormat for parsing data.

In addition, IntegerType, DoubleType and LongType have a higher priority than TimestampType in type inference. This means even if the given format is yyyy or yyyy.MM, it will be inferred as IntegerType or DoubleType. Since it is type inference, I think it is okay to give such precedences.

HyukjinKwon · 2016-03-04T03:02:45Z

@falaki Just to let you know, the original functions, tryParse...() were not mostly modified but just moved into inferField() as inner functions at src/main/scala/com/databricks/spark/csv/util/InferSchema.scala.

codecov-io · 2016-03-04T03:46:07Z

Current coverage is `86.66%`

Merging #280 into master will increase coverage by +0.19% as of 89d6e92

@@            master    #280   diff @@
======================================
  Files           12      12       
  Stmts          525     540    +15
  Branches       155     160     +5
  Methods          0       0       
======================================
+ Hit            454     468    +14
  Partial          0       0       
- Missed          71      72     +1

Review entire Coverage Diff as of 89d6e92

Powered by Codecov. Updated on successful CI builds.

falaki · 2016-03-04T06:00:40Z

src/main/scala/com/databricks/spark/csv/DefaultSource.scala

@@ -128,6 +128,8 @@ class DefaultSource
    val charset = parameters.getOrElse("charset", TextFile.DEFAULT_CHARSET.name())
    // TODO validate charset?

+    val dataFormat = parameters.getOrElse("charset", TextFile.DEFAULT_CHARSET.name())


This line needs to be removed.

falaki · 2016-03-04T06:07:07Z

src/main/scala/com/databricks/spark/csv/util/InferSchema.scala

-    nullValue: String = ""): DataType = {
+    nullValue: String = "",
+    dateFormatter: SimpleDateFormat = null): DataType = {
+    def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {


Indent is off for this entire block

Um.. Do you mean the indentation correction as below?

from

private[csv] def inferField(typeSoFar: DataType, field: String, nullValue: String = "", dateFormatter: SimpleDateFormat = null): DataType = { def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) { IntegerType } else { tryParseLong(field) } ...

to

private[csv] def inferField(typeSoFar: DataType, field: String, nullValue: String = "", dateFormatter: SimpleDateFormat = null): DataType = { def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) { IntegerType } else { tryParseLong(field) } ...

Oh I see. The problem is for lines above:

private[csv] def inferField(typeSoFar: DataType, field: String, nullValue: String = "", dateFormatter: SimpleDateFormat = null): DataType = { def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) { IntegerType } else { tryParseLong(field) }

I see. Thanks!

falaki · 2016-03-04T06:09:41Z

@HyukjinKwon left one more comment. Otherwise looks good. I can merge this before I cut the branch tonight.

HyukjinKwon · 2016-03-04T06:44:02Z

@falaki (although this is not the right thread to say this), what do you think about uncompressed and none options for compression codecs to set explicitly no compression?

This was merged in Spark, apache/spark#11464 and I was working on this for this library as well.

However, I just realised that this might not be some kind of what we must do identically for this, like compression option is supported for codec at Spark CSV data source but compression is not acceptable at this library.

If you think it is good to support uncompressed and none, then I will submit a PR about this within today.

falaki · 2016-03-04T06:54:48Z

Let's open an issue for it.

HyukjinKwon · 2016-03-04T07:08:14Z

Filed in https://github.com/databricks/spark-csv/issues/281.

barrybecker4 · 2016-03-04T17:03:41Z

Thank you for adding this! I will pull and build a local snapshot until 1.4.0 officially releases.
(Minor: there is a type in the doc. "specificy" should be "specify")

HyukjinKwon · 2016-03-07T01:37:12Z

@barrybecker4 Maybe would you create a PR for that typo (also for other typos if you know)?

5ean · 2016-05-04T20:20:45Z

I still face error in python. Maybe i did not use it correctly. Please kindly advise.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',dateFormat='dd/MM/yyyy hh:mm:ss').load('test.csv',schema = schema)

schema has timestamp type.

And the string in csv file '"25/02/2014 00:00:00"

exception:

Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

raam86 · 2017-07-06T15:09:04Z

The docs are not clear on how to do this so hopefully this can help:

https://stackoverflow.com/questions/43259485/how-to-load-csvs-with-timestamps-in-custom-format

spark.read
      .option("header", true)
      .option("inferSchema", true)
      .option("timestampFormat", "MM/dd/yyyy h:mm:ss a")
      .csv("PATH_TO_CSV")

shatestest · 2019-05-03T12:23:49Z

this is not working as expected. https://stackoverflow.com/questions/55965978/how-to-set-jdbc-partitioncolumn-type-to-date-in-spark-2-4-1/55966481#55966481

HyukjinKwon added 3 commits March 4, 2016 11:51

Support for specifying custom date format for date and timestamp types.

160dd82

Update readme.md

920c005

Add a newline at the end of test file and change default value

54ff604

HyukjinKwon added 3 commits March 4, 2016 12:18

Remove some arguments added before for binary compatibility.

cb58bab

Remove arguments added

20bd77c

Remove arguments added

6ff750c

falaki reviewed Mar 4, 2016
View reviewed changes

Remove unused variable

afaf371

falaki reviewed Mar 4, 2016
View reviewed changes

Indentation corrections

e61cdf7

falaki closed this in aa32be0 Mar 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for specifying custom date format for date and timestamp types. #280

Support for specifying custom date format for date and timestamp types. #280

HyukjinKwon commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

codecov-io commented Mar 4, 2016

falaki Mar 4, 2016

falaki Mar 4, 2016

HyukjinKwon Mar 4, 2016

falaki Mar 4, 2016

HyukjinKwon Mar 4, 2016

falaki commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

falaki commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

barrybecker4 commented Mar 4, 2016

HyukjinKwon commented Mar 7, 2016

5ean commented May 4, 2016

raam86 commented Jul 6, 2017 •

edited

Loading

shatestest commented May 3, 2019

Support for specifying custom date format for date and timestamp types. #280

Support for specifying custom date format for date and timestamp types. #280

Conversation

HyukjinKwon commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

codecov-io commented Mar 4, 2016

Current coverage is 86.66%

falaki Mar 4, 2016

Choose a reason for hiding this comment

falaki Mar 4, 2016

Choose a reason for hiding this comment

HyukjinKwon Mar 4, 2016

Choose a reason for hiding this comment

falaki Mar 4, 2016

Choose a reason for hiding this comment

HyukjinKwon Mar 4, 2016

Choose a reason for hiding this comment

falaki commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

falaki commented Mar 4, 2016

HyukjinKwon commented Mar 4, 2016

barrybecker4 commented Mar 4, 2016

HyukjinKwon commented Mar 7, 2016

5ean commented May 4, 2016

raam86 commented Jul 6, 2017 • edited Loading

shatestest commented May 3, 2019

Current coverage is `86.66%`

raam86 commented Jul 6, 2017 •

edited

Loading