Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize get_json_object Spark function using simdjson #5179

Closed

Conversation

PHILO-HE
Copy link
Contributor

@PHILO-HE PHILO-HE commented Jun 7, 2023

This PR proposes an implementation for Spark get_json_object function based on simdjson lib. This function returns a json object, represented by VARCHAR, from json string by searching user-specified path.

Spark source code link.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 7, 2023
@netlify
Copy link

netlify bot commented Jun 7, 2023

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit 69ffe07
🔍 Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67062e20ab45690008f78586

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we provide some background for this PR? Please also include Spark's implementation in the PR description.

}

} // namespace
} // namespace facebook::velox::functions::sparksql::test
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add empty line at the end.

@wanweiqiangintel
Copy link
Contributor

_ No description provided. _
@PHILO-HE, Could you please add some description to make the PR clear?

@PHILO-HE PHILO-HE force-pushed the get-json-object-upstream branch 2 times, most recently from 8ec0ed4 to ebab33c Compare November 15, 2023 08:00
@PHILO-HE PHILO-HE changed the title Support get_json_object function for spark based on simdjson lib Add simdjson based get_json_object Spark function Nov 15, 2023
@PHILO-HE PHILO-HE marked this pull request as ready for review November 15, 2023 08:13
@PHILO-HE
Copy link
Contributor Author

@rui-mo, please take a review. Thanks!

FOLLY_ALWAYS_INLINE std::string getFormattedJsonPath(
const arg_type<Varchar>& jsonPath) {
// Makes a conversion from spark's json path, e.g.,
// converts "$.a.b" to "/a/b".
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this comment above getFormattedJsonPath, and list all the rules for conversion? E.g. '$' will be ignored, '[' -> '/' etc.

}
}
case ondemand::json_type::boolean: {
bool boolResult = false;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The init value seems to be not needed.

velox/functions/sparksql/SIMDJsonFunctions.h Outdated Show resolved Hide resolved
}
case ondemand::json_type::boolean: {
bool boolResult = false;
rawResult.get_bool().get(boolResult);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to acquire the error code?

// This is a simple validation by checking whether the obtained result is
// followed by valid char. Because ondemand parsing we are using ignores json
// format validation for characters following the current parsing position.
bool isValidEndingCharacter(const char* currentPos) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ensured all valid characters are covered?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least we covered all valid characters in spark UT & customer workloads.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this Spark-specific? Do we not need same logic for Presto?

CC: @Yuhta

Copy link
Contributor

@Yuhta Yuhta Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just check document::at_end to make sure there is no trailing content except whitespace, that is enough for Presto and other apps in Meta. I think Spark could do the same.

} else {
rawResult =
ctx.jsonDoc.at_pointer(getFormattedJsonPath(jsonPath).data());
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simdjson_resultondemand::value rawResult = formattedJsonPath_.has_value() ? one : the other

protected:
std::optional<std::string> getJsonObject(
std::optional<std::string> json,
std::optional<std::string> jsonPath) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const &

Copy link
Collaborator

@rui-mo rui-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE
Copy link
Contributor Author

Hi @mbasmanova, could you spare some time to review this pr? Thanks!

@mbasmanova
Copy link
Contributor

@PHILO-HE CI is red. Would you check?

@mbasmanova mbasmanova requested a review from Yuhta January 23, 2024 15:47
Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE Is this function different from Presto? Curious, what are the differences?

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE What's the motivation for this PR? Please, update PR description to clarify.

@@ -124,7 +124,7 @@ void registerFunctions(const std::string& prefix) {
// Register size functions
registerSize(prefix + "size");

registerFunction<JsonExtractScalarFunction, Varchar, Varchar, Varchar>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these any remaining usage of JsonExtractScalarFunction? If not, let's remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it is still used by JsonExprBenchmark.cpp.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is JsonExprBenchmark.cpp for? If it is just for benchmarking this function, then since function is no longer used, the benchmark is no longer needed and can be removed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova, I found JsonExprBenchmark.cpp is used to benchmark a set of functions. Maybe, we can keep it in the source code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE We can keep the benchmark, but there is no point in benchmarking unused JsonExtractScalarFunction, hence, let's remove both the function and benchmark code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we can move JsonExtractScalarFunction into the benchmark, assuming it is used to compare folly-based and simd-based implementations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just removed JsonExtractScalarFunction and the related usage in benchmark. Thanks!

velox/functions/sparksql/SIMDJsonFunctions.h Outdated Show resolved Hide resolved
template <typename T>
struct SIMDGetJsonObjectFunction {
VELOX_DEFINE_FUNCTION_TYPES(T);
std::optional<std::string> formattedJsonPath_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this variable private.


// Makes a conversion from spark's json path, e.g., converts
// "$.a.b" to "/a/b".
FOLLY_ALWAYS_INLINE std::string getFormattedJsonPath(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make this private. Do you really need to inline this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just made it private.
This function can be frequently called for handling non-constant input path. May be better to make it inlined.

}
}

FOLLY_ALWAYS_INLINE simdjson::error_code extractStringResult(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private; ditto other places

if (error) {
return false;
}
} catch (simdjson_error& e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have any exception here, every error should be represented in the return code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your comment!
I found simdjson lib can throw exceptions for some cases.
For example, if we don't keep try/catch here, the below test will fail with an exception thrown:
"The JSON document has an improper structure: missing or superfluous commas, braces, missing key"

EXPECT_EQ(getJsonObject(R"({"hello"-3.5})", "$.hello"), std::nullopt);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta Jimmy, I wonder if that's why we are seeing unhandled "Unexpected trailing content in the JSON input" errors in some queries: T175957555 . Since throwing exceptions is expensive (under TRY), I wonder if there is a way to tell simdjson to not throw but rather return error code. Maybe we can open an issue in simdjson GitHub repo and ask about that.

Copy link
Contributor

@Yuhta Yuhta Jan 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE @mbasmanova Certain simdjson functions are throwing, you can tell from the signature, but all throwing functions have an alternative version that is returning error code. So we need to find out which function is throwing and replace them, then we can remove the catch here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yuhta, thanks for your suggestion! I found we can directly check the returned error code. The try/catch was just removed.

@PHILO-HE
Copy link
Contributor Author

@PHILO-HE What's the motivation for this PR? Please, update PR description to clarify.

Hi @mbasmanova, I just updated PR description and fixed the comments. Thanks for your review!

@PHILO-HE
Copy link
Contributor Author

@mbasmanova, could you take a review further? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE

Add simdjson based get_json_object Spark function

The PR title suggests that this PR adds a new Spark function. However, the description suggests that the changes are to optimize an existing function not to add a new function. Which is correct? It would be nice to align PR title, PR description, and code changes.

@@ -22,6 +22,8 @@ JSON Functions

.. spark:function:: get_json_object(json, path) -> varchar

Extracts a json object from path::
Extracts a json object from ``path``. Returns NULL if it finds json string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to document what values for 'path' are supported? Maybe provide a link to some existing documentation for path?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova, I didn't find official document for valid json path. Just documented the pattern and gave some examples.

SELECT get_json_object('{"a":"b"}', '$.a'); -- 'b'
SELECT get_json_object('{"a":{"b":"c"}}', '$.a'); -- '{"b":"c"}'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this function take an argument of type VARCHAR and return result of type JSON? (this is what signature on L23 suggests)? Or, does this function return VARCHAR result?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All inputs are VARCHAR and output is also VARCHAR. Just clarified in the doc. Thanks!

@mbasmanova mbasmanova changed the title Add simdjson based get_json_object Spark function Optimize get_json_object Spark function using simdjson Jan 30, 2024
@PHILO-HE
Copy link
Contributor Author

@mbasmanova, could you review this pr again? Thanks!

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE Some comments. Please, rebase.


#include "velox/functions/prestosql/SIMDJsonFunctions.h"

using namespace simdjson;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no 'using namespace' in header files, please

namespace facebook::velox::functions::sparksql {

template <typename T>
struct SIMDGetJsonObjectFunction {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove 'SIMD' prefix. Presto functions have it now, because, originally there were 2 sets of functions, but we should rename these as well.

}
pairEnd = result.find("]", pairBegin);
if (pairEnd == std::string::npos || result[pairEnd - 1] != '\'') {
return "-1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a bit hacky. Are you relying on simdjson to reject this path later? It would be cleaner to raise an exception here and provide a clear error message.

This function seems to be working around a limitation on simdjson. Have you already asked in simdjson project if that limitation can be removed? Anyway, please, update PR description to explain this limitation and how you are working around it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you relying on simdjson to reject this path later? It would be cleaner to raise an exception here and provide a clear error message.

@mbasmanova, yes, we depend on simdjson to return error code for illegal path, then this function will return NULL result at last, instead of throwing exception.

Have you already asked in simdjson project if that limitation can be removed?

I just left a comment in simdjson community to discuss this limitation:
simdjson/simdjson#2070 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE

The API that returns NULL when json or path are invalid is challenging. Throwing an error would be better.

Note that simdjson has limited support for JSONPath, hence, it rejects both invalid and valid-but-not-supported paths. If Spark supports full JSONPath spec or a wider subset of the spec than simdjson, then this implementation will produce incorrect results.

Hence, some questions.

  • What subset of JSONPath is supported in Spark?
  • Is there any particular reason to not use SIMDJsonExtract[Scalar]Function directly or implement this in a similar way by leveraging velox/functions/prestosql/json/SIMDJsonExtractor.h ?
  • Does Spark validate JSON document fully and return null if it is not valid? simdjson doesn't do that and may succeed at extracting the value from an invalid document. This will then cause Gluten to produce wrong results.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE

The API that returns NULL when json or path are invalid is challenging. Throwing an error would be better.

Note that simdjson has limited support for JSONPath, hence, it rejects both invalid and valid-but-not-supported paths. If Spark supports full JSONPath spec or a wider subset of the spec than simdjson, then this implementation will produce incorrect results.

Hence, some questions.

  • What subset of JSONPath is supported in Spark?

@mbasmanova, thanks for your comment!
Spark has its own code to parse Json path. I have addressed all inconsistency found in Spark tests. For example, ['name']['id'] is supported by Spark, but not by Simdjson. This pr fixed this inconsistency by just pre-processing the path. So I feel it's better to return NULL instead of throwing exception for invalid path.

  • Is there any particular reason to not use SIMDJsonExtract[Scalar]Function directly or implement this in a similar way by leveraging velox/functions/prestosql/json/SIMDJsonExtractor.h ?

The proposed pr has some handlings to align with Spark, e.g., check the validity of JSON document. And with this proposed patch, all Spark test cases can pass.

  • Does Spark validate JSON document fully and return null if it is not valid? simdjson doesn't do that and may succeed at extracting the value from an invalid document. This will then cause Gluten to produce wrong results.

Yes, Spark fully validates JSON document. Simdjson only checks the validity before the position of extracted result. That's why this pr checked the characters after this position. Though it's not a full validation, it can meet our Spark users' requirement.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PHILO-HE

Spark has its own code to parse Json path.

Do you have a pointer? Which subset (or superset) of JsonPath is supported in Spark?

I have addressed all inconsistency found in Spark tests.

It is good to have all tests pass, but it doesn't guarantee the the semantics match 100%. How do you know that tests cover all cases? We may need to read Spark's code for handing JsonPath to understand what it supports and how.

Also, it seems that simdjson would be parsing JsonPath repeatedly for every row. Is this desired? If we need to parse JsonPath ourselves anyway, why not handle that path as well?

// Spark's json path requires field name surrounded by single quotes if it is
// specified in "[]". But simdjson lib requires not. This method just removes
// such single quotes, e.g., converts "['a']['b']" to "[a][b]".
FOLLY_ALWAYS_INLINE std::string removeSingleQuotes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for FOLLY_ALWAYS_INLINE here; please, remove

const arg_type<Varchar>& json,
const arg_type<Varchar>& jsonPath) {
// Spark requires the first char in jsonPath is '$'.
if (jsonPath.size() < 1 || jsonPath.data()[0] != '$') {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check is repeated in L36; perhaps, add a small helper function to avoid copy-paste

FOLLY_ALWAYS_INLINE simdjson::error_code extractStringResult(
simdjson_result<ondemand::value> rawResult,
out_type<Varchar>& result) {
simdjson::error_code error;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this variable is not needed; this function can simply return a boolean

// can make simdjson's internal parsing position moved and then we
// can check the validity of ending character.
case ondemand::json_type::number: {
switch (rawResult.get_number_type()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this code repeats somewhere else. Would you check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked again. I didn't find the repeat code anywhere else.

* See the License for the specific language governing permissions and
* limitations under the License.
*/
#include "velox/functions/prestosql/types/JsonType.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this include? We shouldn't use Presto types in Spark code.

"$[1].other[1]"),
"v2");

// Field not found.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is rather long. Consider extracting invalid cases into a separate test method.

class JsonFunctionTest : public SparkFunctionBaseTest {
protected:
std::optional<std::string> getJsonObject(
const std::optional<std::string>& json,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to use optional here as this method doesn't allow unset inputs.

@@ -189,8 +189,6 @@ class JsonBenchmark : public velox::functions::test::FunctionBenchmarkBase {
{"folly_json_array_length"});
registerFunction<SIMDJsonArrayLengthFunction, int64_t, Json>(
{"simd_json_array_length"});
registerFunction<JsonExtractScalarFunction, Varchar, Json, Varchar>(
Copy link
Contributor Author

@PHILO-HE PHILO-HE Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova, I note you have moved JsonExtractScalarFunction to this file. Do I need to put the deleted code back for this benchmark test?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think changes to this file can be reverted. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova, just reverted. Thanks!

@PHILO-HE PHILO-HE force-pushed the get-json-object-upstream branch 2 times, most recently from 7d4685c to bd51642 Compare May 6, 2024 14:05
}
pairEnd = result.find("]", pairBegin);
if (pairEnd == std::string::npos || result[pairEnd - 1] != '\'') {
return "-1";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not pass knowngly invalid path to simdjson, but instead cut processing short and return NULL.

@mbasmanova
Copy link
Contributor

I found some info about get_json_path implementation in Spark.

https://issues.apache.org/jira/browse/SPARK-37857

It seems the first implementation of Spark get_json_object does replicate Hive behaviour (https://github.com/apache/spark/pull/7901)

In hive documentation (https://cwiki.apache.org/confluence/display/hive/languagemanual+udf) it is stated that recursive decent (i.e ".." notation) is not supported

A limited version of JSONPath is supported:

  • $ : Root object
  • . : Child operator
  • [] : Subscript operator for array
  • * : Wildcard for []

Syntax not supported that's worth noticing:

  • : Zero length string as key
  • .. : Recursive descent
  • @ : Current object/element
  • () : Script expression
  • ?() : Filter (script) expression.
  • [,] : Union operator
  • [start:end.step] : array slice operator

FelixYBW pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 25, 2024
zhztheplayer pushed a commit to oap-project/velox that referenced this pull request Jul 26, 2024
zhztheplayer pushed a commit to zhztheplayer/velox that referenced this pull request Jul 27, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 29, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 30, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Jul 31, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 1, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 2, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 3, 2024
GlutenPerfBot pushed a commit to oap-project/velox that referenced this pull request Aug 4, 2024
Copy link

stale bot commented Oct 2, 2024

This pull request has been automatically marked as stale because it has not had recent activity. If you'd still like this PR merged, please comment on the PR, make sure you've addressed reviewer comments, and rebase on the latest main. Thank you for your contributions!

@stale stale bot added the stale label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants