Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we disable Simdjson's UTF-8 validation? #10639

Closed
PHILO-HE opened this issue Aug 1, 2024 · 8 comments
Closed

Should we disable Simdjson's UTF-8 validation? #10639

PHILO-HE opened this issue Aug 1, 2024 · 8 comments
Labels
enhancement New feature or request

Comments

@PHILO-HE
Copy link
Contributor

PHILO-HE commented Aug 1, 2024

Description

Simdjson has a build option called SIMDJSON_SKIPUTF8VALIDATION to control whether to check UTF-8 encoding validity for JSON input. It is OFF by default to not allow illegal UTF-8 encoding. But we recently found Spark disregards illegal UTF-8 encoding in JSON string. If Presto behaves same as Spark on this, we can set SIMDJSON_SKIPUTF8VALIDATION=ON at build time, then Simdjson will skip this check at runtime.

@mbasmanova

@PHILO-HE PHILO-HE added the enhancement New feature or request label Aug 1, 2024
@mbasmanova
Copy link
Contributor

Spark disregards illegal UTF-8 encoding in JSON string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Aug 2, 2024

Spark disregards illegal UTF-8 encoding in json string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

@mbasmanova, yes, Spark allows invalid UTF-8 encoding. If input contains invalid utf-8 encoded data, NULL is returned when Gluten + Velox is used. Only if I set SIMDJSON_SKIPUTF8VALIDATION=ON for simdjson build, the result is consistent with Spark.

I tried testing Presto with the same parquet file containing invalid utf-8 encoding. Looks Presto also allows invalid utf-8 encoding in json parsing, like Spark.

select json_extract(c1, '$.c') from tbl;

Could you help verify the result of Presto + Velox? You can use the parquet file unpacked from test.tar.gz. Thanks!

@mbasmanova
Copy link
Contributor

@PHILO-HE In general, I believe there are no guarantees in either Spark or Presto on behavior when input is invalid UTF-8. Hence, I doesn't think we need (or should) try to match such behavior. It might be better to simply say that input is expected to be valid UTF-8 and if not, there will an error or undefined behavior.

See https://prestodb.io/docs/current/functions/string.html#string-functions

CC: @FelixYBW @rui-mo

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Aug 2, 2024

@mbasmanova, thanks for your comment! I got your point. Now that we know there is no guaranteed behavior for invalid utf-8 input, should we just let Simdjson skip the check? See code link. I think JSON parser's performance can be improved if we do that.

@mbasmanova
Copy link
Contributor

JSON parser's performance can be improved if we do that.

I feel it is safer to keep the validation and fail loudly if it fails as opposed to returning some "strange" result and have the user guess what's going on. How much of a perf boost do you observe for valid use cases?

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Aug 7, 2024

How much of a perf boost do you observe for valid use cases?

@mbasmanova, I just tested. But no evident perf. gain in my test cases after the validation is disabled. Maybe, only some certain cases (e.g., long json input) can be benefited. We have made some change in Gluten. Let's close this issue. Thanks!

@PHILO-HE PHILO-HE closed this as completed Aug 7, 2024
@mbasmanova
Copy link
Contributor

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

@PHILO-HE
Copy link
Contributor Author

PHILO-HE commented Aug 7, 2024

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

@mbasmanova, we have some users require the output of Gluten + Velox is consistent with Spark. So we just set SIMDJSON_SKIPUTF8VALIDATION=ON to build simdjson lib and then install it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants