-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions #17693
Conversation
@ewasserman Could you fix the title to point out the JIRA In terms of behaviour, I see @marmbrus and @srowen in that JIRA. Please let me cc. |
@@ -149,7 +149,7 @@ case class GetJsonObject(json: Expression, path: Expression) | |||
|
|||
if (parsed.isDefined) { | |||
try { | |||
Utils.tryWithResource(jsonFactory.createParser(jsonStr.getBytes)) { parser => | |||
Utils.tryWithResource(jsonFactory.createParser(jsonStr.toString)) { parser => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should catch the exception below rather than introducing unnecessary string instance apparently. (Sorry, I left a comment and removed back. I thought it was from_json
/to_json
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you check whether the other exceptions could be thrown by the illegal outputs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the jsonStr.getBytes
unchanged.
/**
* Method for constructing parser for parsing
* the contents of given byte array.
*
* @since 2.1
*/
public JsonParser createParser(byte[] data) throws IOException, JsonParseException {
IOContext ctxt = _createContext(data, true);
if (_inputDecorator != null) {
InputStream in = _inputDecorator.decorate(ctxt, data, 0, data.length);
if (in != null) {
return _createParser(in, ctxt);
}
}
return _createParser(data, 0, data.length, ctxt);
}
I think we should capture both IOException
and JsonParseException
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought JsonParseException
extends IOException
.
Could you add a test case in |
Not suggesting doing it in this PR but maybe adding a SQL option to let the users choose the error handling strategy of all the JSON functions probably makes more sense here? The Spark JSON data source allows users to choose a parsing mode among:
|
@liancheng Good suggestion! Just like what we did for |
I like the idea but I am not sure of I think we don't explicitly support parse modes in both spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala Line 551 in 4658183
It sets |
@@ -393,7 +393,7 @@ case class JsonTuple(children: Seq[Expression]) | |||
} | |||
|
|||
try { | |||
Utils.tryWithResource(jsonFactory.createParser(json.getBytes)) { | |||
Utils.tryWithResource(jsonFactory.createParser(json.toString)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this change has performance penalty, we should just catch more exception below.
Reverted from use of toString on the org.apache.spark.unsafe.types.UTF8String by running the byte array through a java.io.Reader. This still fixes the bug and is also more efficient on the JSON parser side so it is a net performance win as well. |
@@ -149,7 +149,8 @@ case class GetJsonObject(json: Expression, path: Expression) | |||
|
|||
if (parsed.isDefined) { | |||
try { | |||
Utils.tryWithResource(jsonFactory.createParser(jsonStr.getBytes)) { parser => | |||
Utils.tryWithResource(jsonFactory.createParser(new InputStreamReader( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add some comments to say that, this is to avoid a bug in encoding detection, and we explicitly specify the encoding(UTF8) here.
LGTM |
ok to test |
Test build #76068 has finished for PR 17693 at commit
|
Test build #76153 has finished for PR 17693 at commit
|
…nctions ## What changes were proposed in this pull request? change to using Jackson's `com.fasterxml.jackson.core.JsonFactory` public JsonParser createParser(String content) ## How was this patch tested? existing unit tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Wasserman <[email protected]> Closes #17693 from ewasserman/SPARK-20314. (cherry picked from commit 57e1da3) Signed-off-by: Wenchen Fan <[email protected]>
thanks, merging to master/2.2 |
…nToStructs ## What changes were proposed in this pull request? A fix for the same problem was made in #17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`. ## How was this patch tested? Regression test Author: Burak Yavuz <[email protected]> Closes #17826 from brkyvz/SPARK-20549. (cherry picked from commit 86174ea) Signed-off-by: Wenchen Fan <[email protected]>
…nToStructs ## What changes were proposed in this pull request? A fix for the same problem was made in apache#17693 but ignored `JsonToStructs`. This PR uses the same fix for `JsonToStructs`. ## How was this patch tested? Regression test Author: Burak Yavuz <[email protected]> Closes apache#17826 from brkyvz/SPARK-20549.
…nctions ## What changes were proposed in this pull request? change to using Jackson's `com.fasterxml.jackson.core.JsonFactory` public JsonParser createParser(String content) ## How was this patch tested? existing unit tests Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Eric Wasserman <[email protected]> Closes apache#17693 from ewasserman/SPARK-20314. (cherry picked from commit 57e1da3) Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit c8803c0)
What changes were proposed in this pull request?
change to using Jackson's
com.fasterxml.jackson.core.JsonFactory
How was this patch tested?
existing unit tests
Please review http://spark.apache.org/contributing.html before opening a pull request.