-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25945][SQL] Support locale while parsing date/timestamp from CSV/JSON #22951
Conversation
I will update docs soon. |
Test build #98489 has finished for PR 22951 at commit
|
@HyukjinKwon @dongjoon-hyun Please, review the changes. |
Test build #98507 has finished for PR 22951 at commit
|
Looks good. I or someone else should take a closer look before getting this in. |
Could you rebase this once again, @MaxGekk ? |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
Test build #98541 has finished for PR 22951 at commit
|
retest this please |
Test build #98543 has finished for PR 22951 at commit
|
sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
Outdated
Show resolved
Hide resolved
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/CsvExpressionsSuite.scala
Outdated
Show resolved
Hide resolved
Test build #98567 has finished for PR 22951 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM.
Could you take a look once more, @HyukjinKwon ? |
OMG, what does |
@@ -446,6 +450,9 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |||
If None is set, it uses the default value, ``1.0``. | |||
:param emptyValue: sets the string representation of an empty value. If None is set, it uses | |||
the default value, empty string. | |||
:param locale: sets a locale as language tag in IETF BCP 47 format. If None is set, | |||
it uses the default value, ``en-US``. For instance, ``locale`` is used while | |||
parsing dates and timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ideally we should apply to decimal parsing too actually. But yea we can leave it separate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems parsing decimals using locale
will be slightly tricky in JSON case because we leave this to Jackson by calling its method getCurrentToken
and getDecimalValue
, and I haven't found how to pass locale to it. Probably we will need a custom deserialiser?
In the CSV case, it should be easier since we convert strings ourselves. I will try to do that for CSV first of all when this PR be merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the PR for parsing decimals from CSV: #22979
@@ -349,7 +353,7 @@ def csv(self, path, schema=None, sep=None, encoding=None, quote=None, escape=Non | |||
negativeInf=None, dateFormat=None, timestampFormat=None, maxColumns=None, | |||
maxCharsPerColumn=None, maxMalformedLogPerPartition=None, mode=None, | |||
columnNameOfCorruptRecord=None, multiLine=None, charToEscapeQuoteEscaping=None, | |||
samplingRatio=None, enforceSchema=None, emptyValue=None): | |||
samplingRatio=None, enforceSchema=None, emptyValue=None, locale=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add emptyValue
in streaming.py
in the same separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems it exists in streaming.py
:
spark/python/pyspark/sql/streaming.py
Line 567 in 08c76b5
enforceSchema=None, emptyValue=None): |
It is 3 letters prefix of |
# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/CsvFunctionsSuite.scala # sql/core/src/test/scala/org/apache/spark/sql/JsonFunctionsSuite.scala
Test build #98583 has finished for PR 22951 at commit
|
jenkins, retest this, please |
Test build #98587 has finished for PR 22951 at commit
|
jenkins, retest this, please |
Test build #98591 has finished for PR 22951 at commit
|
retest this please |
Test build #98598 has finished for PR 22951 at commit
|
Merged to master. |
Actually let me leave a cc for @srowen. I remember we talked about it before. |
…SV/JSON ## What changes were proposed in this pull request? In the PR, I propose to add new option `locale` into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded to `Locale.US`. ## How was this patch tested? Added two tests for parsing a date from CSV/JSON - `ноя 2018`. Closes apache#22951 from MaxGekk/locale. Authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose to add new option
locale
into CSVOptions/JSONOptions to make parsing date/timestamps in local languages possible. Currently the locale is hard coded toLocale.US
.How was this patch tested?
Added two tests for parsing a date from CSV/JSON -
ноя 2018
.