-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize get_json_object Spark function using simdjson #5179
Conversation
✅ Deploy Preview for meta-velox canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we provide some background for this PR? Please also include Spark's implementation in the PR description.
} | ||
|
||
} // namespace | ||
} // namespace facebook::velox::functions::sparksql::test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add empty line at the end.
|
8ec0ed4
to
ebab33c
Compare
8ef9856
to
27ae51d
Compare
@rui-mo, please take a review. Thanks! |
FOLLY_ALWAYS_INLINE std::string getFormattedJsonPath( | ||
const arg_type<Varchar>& jsonPath) { | ||
// Makes a conversion from spark's json path, e.g., | ||
// converts "$.a.b" to "/a/b". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move this comment above getFormattedJsonPath
, and list all the rules for conversion? E.g. '$' will be ignored, '[' -> '/' etc.
} | ||
} | ||
case ondemand::json_type::boolean: { | ||
bool boolResult = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The init value seems to be not needed.
} | ||
case ondemand::json_type::boolean: { | ||
bool boolResult = false; | ||
rawResult.get_bool().get(boolResult); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to acquire the error code?
// This is a simple validation by checking whether the obtained result is | ||
// followed by valid char. Because ondemand parsing we are using ignores json | ||
// format validation for characters following the current parsing position. | ||
bool isValidEndingCharacter(const char* currentPos) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ensured all valid characters are covered?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least we covered all valid characters in spark UT & customer workloads.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this Spark-specific? Do we not need same logic for Presto?
CC: @Yuhta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just check document::at_end
to make sure there is no trailing content except whitespace, that is enough for Presto and other apps in Meta. I think Spark could do the same.
} else { | ||
rawResult = | ||
ctx.jsonDoc.at_pointer(getFormattedJsonPath(jsonPath).data()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
simdjson_resultondemand::value rawResult = formattedJsonPath_.has_value() ? one : the other
protected: | ||
std::optional<std::string> getJsonObject( | ||
std::optional<std::string> json, | ||
std::optional<std::string> jsonPath) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also update the documentation for get_json_object function?
https://github.com/facebookincubator/velox/blob/main/velox/docs/functions/spark/json.rst#json-functions-1
00c9c0a
to
65fcace
Compare
65fcace
to
f7d83cd
Compare
f7d83cd
to
6da58d6
Compare
Hi @mbasmanova, could you spare some time to review this pr? Thanks! |
@PHILO-HE CI is red. Would you check? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE Is this function different from Presto? Curious, what are the differences?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE What's the motivation for this PR? Please, update PR description to clarify.
@@ -124,7 +124,7 @@ void registerFunctions(const std::string& prefix) { | |||
// Register size functions | |||
registerSize(prefix + "size"); | |||
|
|||
registerFunction<JsonExtractScalarFunction, Varchar, Varchar, Varchar>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these any remaining usage of JsonExtractScalarFunction? If not, let's remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it is still used by JsonExprBenchmark.cpp
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is JsonExprBenchmark.cpp for? If it is just for benchmarking this function, then since function is no longer used, the benchmark is no longer needed and can be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova, I found JsonExprBenchmark.cpp is used to benchmark a set of functions. Maybe, we can keep it in the source code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE We can keep the benchmark, but there is no point in benchmarking unused JsonExtractScalarFunction, hence, let's remove both the function and benchmark code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we can move JsonExtractScalarFunction into the benchmark, assuming it is used to compare folly-based and simd-based implementations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just removed JsonExtractScalarFunction
and the related usage in benchmark. Thanks!
template <typename T> | ||
struct SIMDGetJsonObjectFunction { | ||
VELOX_DEFINE_FUNCTION_TYPES(T); | ||
std::optional<std::string> formattedJsonPath_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this variable private.
|
||
// Makes a conversion from spark's json path, e.g., converts | ||
// "$.a.b" to "/a/b". | ||
FOLLY_ALWAYS_INLINE std::string getFormattedJsonPath( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this private. Do you really need to inline this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just made it private.
This function can be frequently called for handling non-constant input path. May be better to make it inlined.
} | ||
} | ||
|
||
FOLLY_ALWAYS_INLINE simdjson::error_code extractStringResult( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private; ditto other places
if (error) { | ||
return false; | ||
} | ||
} catch (simdjson_error& e) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not have any exception here, every error should be represented in the return code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comment!
I found simdjson lib can throw exceptions for some cases.
For example, if we don't keep try/catch here, the below test will fail with an exception thrown:
"The JSON document has an improper structure: missing or superfluous commas, braces, missing key"
EXPECT_EQ(getJsonObject(R"({"hello"-3.5})", "$.hello"), std::nullopt);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta Jimmy, I wonder if that's why we are seeing unhandled "Unexpected trailing content in the JSON input" errors in some queries: T175957555 . Since throwing exceptions is expensive (under TRY), I wonder if there is a way to tell simdjson to not throw but rather return error code. Maybe we can open an issue in simdjson GitHub repo and ask about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE @mbasmanova Certain simdjson functions are throwing, you can tell from the signature, but all throwing functions have an alternative version that is returning error code. So we need to find out which function is throwing and replace them, then we can remove the catch here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yuhta, thanks for your suggestion! I found we can directly check the returned error code. The try/catch was just removed.
6da58d6
to
b4e49e9
Compare
Hi @mbasmanova, I just updated PR description and fixed the comments. Thanks for your review! |
@mbasmanova, could you take a review further? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add simdjson based get_json_object Spark function
The PR title suggests that this PR adds a new Spark function. However, the description suggests that the changes are to optimize an existing function not to add a new function. Which is correct? It would be nice to align PR title, PR description, and code changes.
velox/docs/functions/spark/json.rst
Outdated
@@ -22,6 +22,8 @@ JSON Functions | |||
|
|||
.. spark:function:: get_json_object(json, path) -> varchar | |||
|
|||
Extracts a json object from path:: | |||
Extracts a json object from ``path``. Returns NULL if it finds json string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to document what values for 'path' are supported? Maybe provide a link to some existing documentation for path?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova, I didn't find official document for valid json path. Just documented the pattern and gave some examples.
velox/docs/functions/spark/json.rst
Outdated
SELECT get_json_object('{"a":"b"}', '$.a'); -- 'b' | ||
SELECT get_json_object('{"a":{"b":"c"}}', '$.a'); -- '{"b":"c"}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this function take an argument of type VARCHAR and return result of type JSON? (this is what signature on L23 suggests)? Or, does this function return VARCHAR result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All inputs are VARCHAR and output is also VARCHAR. Just clarified in the doc. Thanks!
@mbasmanova, could you review this pr again? Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PHILO-HE Some comments. Please, rebase.
|
||
#include "velox/functions/prestosql/SIMDJsonFunctions.h" | ||
|
||
using namespace simdjson; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no 'using namespace' in header files, please
namespace facebook::velox::functions::sparksql { | ||
|
||
template <typename T> | ||
struct SIMDGetJsonObjectFunction { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove 'SIMD' prefix. Presto functions have it now, because, originally there were 2 sets of functions, but we should rename these as well.
} | ||
pairEnd = result.find("]", pairBegin); | ||
if (pairEnd == std::string::npos || result[pairEnd - 1] != '\'') { | ||
return "-1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a bit hacky. Are you relying on simdjson to reject this path later? It would be cleaner to raise an exception here and provide a clear error message.
This function seems to be working around a limitation on simdjson. Have you already asked in simdjson project if that limitation can be removed? Anyway, please, update PR description to explain this limitation and how you are working around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you relying on simdjson to reject this path later? It would be cleaner to raise an exception here and provide a clear error message.
@mbasmanova, yes, we depend on simdjson to return error code for illegal path, then this function will return NULL result at last, instead of throwing exception.
Have you already asked in simdjson project if that limitation can be removed?
I just left a comment in simdjson community to discuss this limitation:
simdjson/simdjson#2070 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API that returns NULL when json or path are invalid is challenging. Throwing an error would be better.
Note that simdjson has limited support for JSONPath, hence, it rejects both invalid and valid-but-not-supported paths. If Spark supports full JSONPath spec or a wider subset of the spec than simdjson, then this implementation will produce incorrect results.
Hence, some questions.
- What subset of JSONPath is supported in Spark?
- Is there any particular reason to not use SIMDJsonExtract[Scalar]Function directly or implement this in a similar way by leveraging velox/functions/prestosql/json/SIMDJsonExtractor.h ?
- Does Spark validate JSON document fully and return null if it is not valid? simdjson doesn't do that and may succeed at extracting the value from an invalid document. This will then cause Gluten to produce wrong results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API that returns NULL when json or path are invalid is challenging. Throwing an error would be better.
Note that simdjson has limited support for JSONPath, hence, it rejects both invalid and valid-but-not-supported paths. If Spark supports full JSONPath spec or a wider subset of the spec than simdjson, then this implementation will produce incorrect results.
Hence, some questions.
- What subset of JSONPath is supported in Spark?
@mbasmanova, thanks for your comment!
Spark has its own code to parse Json path. I have addressed all inconsistency found in Spark tests. For example, ['name']['id'] is supported by Spark, but not by Simdjson. This pr fixed this inconsistency by just pre-processing the path. So I feel it's better to return NULL instead of throwing exception for invalid path.
- Is there any particular reason to not use SIMDJsonExtract[Scalar]Function directly or implement this in a similar way by leveraging velox/functions/prestosql/json/SIMDJsonExtractor.h ?
The proposed pr has some handlings to align with Spark, e.g., check the validity of JSON document. And with this proposed patch, all Spark test cases can pass.
- Does Spark validate JSON document fully and return null if it is not valid? simdjson doesn't do that and may succeed at extracting the value from an invalid document. This will then cause Gluten to produce wrong results.
Yes, Spark fully validates JSON document. Simdjson only checks the validity before the position of extracted result. That's why this pr checked the characters after this position. Though it's not a full validation, it can meet our Spark users' requirement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark has its own code to parse Json path.
Do you have a pointer? Which subset (or superset) of JsonPath is supported in Spark?
I have addressed all inconsistency found in Spark tests.
It is good to have all tests pass, but it doesn't guarantee the the semantics match 100%. How do you know that tests cover all cases? We may need to read Spark's code for handing JsonPath to understand what it supports and how.
Also, it seems that simdjson would be parsing JsonPath repeatedly for every row. Is this desired? If we need to parse JsonPath ourselves anyway, why not handle that path as well?
// Spark's json path requires field name surrounded by single quotes if it is | ||
// specified in "[]". But simdjson lib requires not. This method just removes | ||
// such single quotes, e.g., converts "['a']['b']" to "[a][b]". | ||
FOLLY_ALWAYS_INLINE std::string removeSingleQuotes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for FOLLY_ALWAYS_INLINE here; please, remove
const arg_type<Varchar>& json, | ||
const arg_type<Varchar>& jsonPath) { | ||
// Spark requires the first char in jsonPath is '$'. | ||
if (jsonPath.size() < 1 || jsonPath.data()[0] != '$') { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this check is repeated in L36; perhaps, add a small helper function to avoid copy-paste
FOLLY_ALWAYS_INLINE simdjson::error_code extractStringResult( | ||
simdjson_result<ondemand::value> rawResult, | ||
out_type<Varchar>& result) { | ||
simdjson::error_code error; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this variable is not needed; this function can simply return a boolean
// can make simdjson's internal parsing position moved and then we | ||
// can check the validity of ending character. | ||
case ondemand::json_type::number: { | ||
switch (rawResult.get_number_type()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this code repeats somewhere else. Would you check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checked again. I didn't find the repeat code anywhere else.
* See the License for the specific language governing permissions and | ||
* limitations under the License. | ||
*/ | ||
#include "velox/functions/prestosql/types/JsonType.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this include? We shouldn't use Presto types in Spark code.
"$[1].other[1]"), | ||
"v2"); | ||
|
||
// Field not found. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is rather long. Consider extracting invalid cases into a separate test method.
class JsonFunctionTest : public SparkFunctionBaseTest { | ||
protected: | ||
std::optional<std::string> getJsonObject( | ||
const std::optional<std::string>& json, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to use optional here as this method doesn't allow unset inputs.
9d9ffbe
to
2d3c983
Compare
@@ -189,8 +189,6 @@ class JsonBenchmark : public velox::functions::test::FunctionBenchmarkBase { | |||
{"folly_json_array_length"}); | |||
registerFunction<SIMDJsonArrayLengthFunction, int64_t, Json>( | |||
{"simd_json_array_length"}); | |||
registerFunction<JsonExtractScalarFunction, Varchar, Json, Varchar>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova, I note you have moved JsonExtractScalarFunction
to this file. Do I need to put the deleted code back for this benchmark test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think changes to this file can be reverted. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mbasmanova, just reverted. Thanks!
7d4685c
to
bd51642
Compare
} | ||
pairEnd = result.find("]", pairBegin); | ||
if (pairEnd == std::string::npos || result[pairEnd - 1] != '\'') { | ||
return "-1"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not pass knowngly invalid path to simdjson, but instead cut processing short and return NULL.
I found some info about get_json_path implementation in Spark. https://issues.apache.org/jira/browse/SPARK-37857
A limited version of JSONPath is supported:
Syntax not supported that's worth noticing:
|
bd51642
to
0c63bac
Compare
This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions! |
fd14df7
to
69ffe07
Compare
This PR proposes an implementation for Spark
get_json_object
function based on simdjson lib. This function returns a json object, represented by VARCHAR, from json string by searching user-specified path.Spark source code link.