Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for specifying custom date format for date and timestamp types. #280

Closed

Conversation

HyukjinKwon
Copy link
Member

https://github.com/databricks/spark-csv/issues/279
https://github.com/databricks/spark-csv/issues/262
https://github.com/databricks/spark-csv/issues/266

This PR adds the support to specify custom date format for DateType and TimestampType.

For TimestampType, this uses the given format to infer schema and also to convert the values
For DateType, this uses the given format to convert the values.
If the dateFormat is not given, then it works with Timestamp.valueOf() and Date.valueOf() for backwords compatibility.
When it's given, then it uses SimpleDateFormat for parsing data.

In addition, IntegerType, DoubleType and LongType have a higher priority than TimestampType in type inference. This means even if the given format is yyyy or yyyy.MM, it will be inferred as IntegerType or DoubleType. Since it is type inference, I think it is okay to give such precedences.

@HyukjinKwon
Copy link
Member Author

@falaki Just to let you know, the original functions, tryParse...() were not mostly modified but just moved into inferField() as inner functions at src/main/scala/com/databricks/spark/csv/util/InferSchema.scala.

@codecov-io
Copy link

Current coverage is 86.66%

Merging #280 into master will increase coverage by +0.19% as of 89d6e92

@@            master    #280   diff @@
======================================
  Files           12      12       
  Stmts          525     540    +15
  Branches       155     160     +5
  Methods          0       0       
======================================
+ Hit            454     468    +14
  Partial          0       0       
- Missed          71      72     +1

Review entire Coverage Diff as of 89d6e92

Powered by Codecov. Updated on successful CI builds.

@@ -128,6 +128,8 @@ class DefaultSource
val charset = parameters.getOrElse("charset", TextFile.DEFAULT_CHARSET.name())
// TODO validate charset?

val dataFormat = parameters.getOrElse("charset", TextFile.DEFAULT_CHARSET.name())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line needs to be removed.

nullValue: String = ""): DataType = {
nullValue: String = "",
dateFormatter: SimpleDateFormat = null): DataType = {
def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent is off for this entire block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um.. Do you mean the indentation correction as below?

  • from
  private[csv] def inferField(typeSoFar: DataType,
    field: String,
    nullValue: String = "",
    dateFormatter: SimpleDateFormat = null): DataType = {
    def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {
      IntegerType
    } else {
      tryParseLong(field)
    }
...
  • to
  private[csv] def inferField(typeSoFar: DataType,
    field: String,
    nullValue: String = "",
    dateFormatter: SimpleDateFormat = null): DataType = {
      def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {
        IntegerType
      } else {
        tryParseLong(field)
      }
...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. The problem is for lines above:

  private[csv] def inferField(typeSoFar: DataType,
      field: String,
      nullValue: String = "",
      dateFormatter: SimpleDateFormat = null): DataType = {
    def tryParseInteger(field: String): DataType = if ((allCatch opt field.toInt).isDefined) {
      IntegerType
    } else {
      tryParseLong(field)
    }

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thanks!

@falaki
Copy link
Member

falaki commented Mar 4, 2016

@HyukjinKwon left one more comment. Otherwise looks good. I can merge this before I cut the branch tonight.

@HyukjinKwon
Copy link
Member Author

@falaki (although this is not the right thread to say this), what do you think about uncompressed and none options for compression codecs to set explicitly no compression?

This was merged in Spark, apache/spark#11464 and I was working on this for this library as well.

However, I just realised that this might not be some kind of what we must do identically for this, like compression option is supported for codec at Spark CSV data source but compression is not acceptable at this library.

If you think it is good to support uncompressed and none, then I will submit a PR about this within today.

@falaki
Copy link
Member

falaki commented Mar 4, 2016

Let's open an issue for it.

@falaki falaki closed this in aa32be0 Mar 4, 2016
@HyukjinKwon
Copy link
Member Author

@barrybecker4
Copy link

Thank you for adding this! I will pull and build a local snapshot until 1.4.0 officially releases.
(Minor: there is a type in the doc. "specificy" should be "specify")

@HyukjinKwon
Copy link
Member Author

@barrybecker4 Maybe would you create a PR for that typo (also for other typos if you know)?

@5ean
Copy link

5ean commented May 4, 2016

I still face error in python. Maybe i did not use it correctly. Please kindly advise.
df = sqlContext.read.format('com.databricks.spark.csv').options(header='true',dateFormat='dd/MM/yyyy hh:mm:ss').load('test.csv',schema = schema)

schema has timestamp type.

And the string in csv file '"25/02/2014 00:00:00"

exception:

Caused by: java.lang.IllegalArgumentException: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff]

@raam86
Copy link

raam86 commented Jul 6, 2017

The docs are not clear on how to do this so hopefully this can help:

https://stackoverflow.com/questions/43259485/how-to-load-csvs-with-timestamps-in-custom-format

spark.read
      .option("header", true)
      .option("inferSchema", true)
      .option("timestampFormat", "MM/dd/yyyy h:mm:ss a")
      .csv("PATH_TO_CSV")

@shatestest
Copy link

this is not working as expected. https://stackoverflow.com/questions/55965978/how-to-set-jdbc-partitioncolumn-type-to-date-in-spark-2-4-1/55966481#55966481

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants