Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON #22951

Closed
wants to merge 12 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 5, 2018

What changes were proposed in this pull request?

In the PR, I propose to add new option locale into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to Locale.US.

How was this patch tested?

Added two tests for parsing a date from CSV/JSON - ноя 2018.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 5, 2018

I will update docs soon.

@SparkQA
Copy link

SparkQA commented Nov 5, 2018

Test build #98489 has finished for PR 22951 at commit 41154bd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 6, 2018

@HyukjinKwon @dongjoon-hyun Please, review the changes.

@SparkQA
Copy link

SparkQA commented Nov 6, 2018

Test build #98507 has finished for PR 22951 at commit 93da760.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Looks good. I or someone else should take a closer look before getting this in.

@dongjoon-hyun
Copy link
Member

Could you rebase this once again, @MaxGekk ?

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98541 has finished for PR 22951 at commit 6ab8501.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98543 has finished for PR 22951 at commit 6ab8501.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 7, 2018

Test build #98567 has finished for PR 22951 at commit 759bca6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

@dongjoon-hyun
Copy link
Member

Could you take a look once more, @HyukjinKwon ?

@HyukjinKwon
Copy link
Member

OMG, what does ноя 2018 mean BTW? haha

@@ -446,6 +450,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
If None is set, it uses the default value, ``1.0``.
:param emptyValue: sets the string representation of an empty value. If None is set, it uses
the default value, empty string.
:param locale: sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value, ``en-US``. For instance, ``locale`` is used while
parsing dates and timestamps.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ideally we should apply to decimal parsing too actually. But yea we can leave it separate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems parsing decimals using locale will be slightly tricky in JSON case because we leave this to Jackson by calling its method getCurrentToken and getDecimalValue, and I haven't found how to pass locale to it. Probably we will need a custom deserialiser?

In the CSV case, it should be easier since we convert strings ourselves. I will try to do that for CSV first of all when this PR be merged.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR for parsing decimals from CSV: #22979

HyukjinKwon
HyukjinKwon approved these changes Nov 8, 2018
@@ -349,7 +353,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non
negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None,
maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None,
columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None,
samplingRatio=None, enforceSchema=None, emptyValue=None):
samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add emptyValue in streaming.py in the same separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it exists in streaming.py:

enforceSchema=None, emptyValue=None):

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 8, 2018

OMG, what does ноя 2018 mean BTW? haha

It is 3 letters prefix of Ноябрь which is November in Russian.

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
#	sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98583 has finished for PR 22951 at commit 8834b4b.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 8, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98587 has finished for PR 22951 at commit 8834b4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 8, 2018

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98591 has finished for PR 22951 at commit 8834b4b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Nov 8, 2018

Test build #98598 has finished for PR 22951 at commit 8834b4b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@HyukjinKwon
Copy link
Member

Actually let me leave a cc for @srowen. I remember we talked about it before.

@asfgit asfgit closed this in 79551f5 Nov 9, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…SV/JSON

## What changes were proposed in this pull request?

In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to `Locale.US`.

## How was this patch tested?

Added two tests for parsing a date from CSV/JSON - `ноя 2018`.

Closes apache#22951 from MaxGekk/locale.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
@MaxGekk MaxGekk deleted the locale branch August 17, 2019 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants