-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should we disable Simdjson's UTF-8 validation? #10639
Comments
@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto. |
@mbasmanova, yes, Spark allows invalid UTF-8 encoding. If input contains invalid utf-8 encoded data, NULL is returned when Gluten + Velox is used. Only if I set I tried testing Presto with the same parquet file containing invalid utf-8 encoding. Looks Presto also allows invalid utf-8 encoding in json parsing, like Spark. select json_extract(c1, '$.c') from tbl; Could you help verify the result of Presto + Velox? You can use the parquet file unpacked from test.tar.gz. Thanks! |
@PHILO-HE In general, I believe there are no guarantees in either Spark or Presto on behavior when input is invalid UTF-8. Hence, I doesn't think we need (or should) try to match such behavior. It might be better to simply say that input is expected to be valid UTF-8 and if not, there will an error or undefined behavior. See https://prestodb.io/docs/current/functions/string.html#string-functions |
@mbasmanova, thanks for your comment! I got your point. Now that we know there is no guaranteed behavior for invalid utf-8 input, should we just let Simdjson skip the check? See code link. I think JSON parser's performance can be improved if we do that. |
I feel it is safer to keep the validation and fail loudly if it fails as opposed to returning some "strange" result and have the user guess what's going on. How much of a perf boost do you observe for valid use cases? |
@mbasmanova, I just tested. But no evident perf. gain in my test cases after the validation is disabled. Maybe, only some certain cases (e.g., long json input) can be benefited. We have made some change in Gluten. Let's close this issue. Thanks! |
@PHILO-HE Curious, what is this change. |
@mbasmanova, we have some users require the output of Gluten + Velox is consistent with Spark. So we just set SIMDJSON_SKIPUTF8VALIDATION=ON to build simdjson lib and then install it. |
Description
Simdjson has a build option called
SIMDJSON_SKIPUTF8VALIDATION
to control whether to check UTF-8 encoding validity for JSON input. It is OFF by default to not allow illegal UTF-8 encoding. But we recently found Spark disregards illegal UTF-8 encoding in JSON string. If Presto behaves same as Spark on this, we can setSIMDJSON_SKIPUTF8VALIDATION=ON
at build time, then Simdjson will skip this check at runtime.@mbasmanova
The text was updated successfully, but these errors were encountered: