Should we disable Simdjson's UTF-8 validation? #10639

PHILO-HE · 2024-08-01T07:31:28Z

Description

Simdjson has a build option called SIMDJSON_SKIPUTF8VALIDATION to control whether to check UTF-8 encoding validity for JSON input. It is OFF by default to not allow illegal UTF-8 encoding. But we recently found Spark disregards illegal UTF-8 encoding in JSON string. If Presto behaves same as Spark on this, we can set SIMDJSON_SKIPUTF8VALIDATION=ON at build time, then Simdjson will skip this check at runtime.

@mbasmanova

The text was updated successfully, but these errors were encountered:

mbasmanova · 2024-08-01T08:39:45Z

Spark disregards illegal UTF-8 encoding in JSON string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

PHILO-HE · 2024-08-02T08:35:03Z

Spark disregards illegal UTF-8 encoding in json string.

@PHILO-HE Would you clarify what that means exactly? Does Spark allow invalid UTF-8 encoding? Do you have an example? We can try the same example in Presto.

@mbasmanova, yes, Spark allows invalid UTF-8 encoding. If input contains invalid utf-8 encoded data, NULL is returned when Gluten + Velox is used. Only if I set SIMDJSON_SKIPUTF8VALIDATION=ON for simdjson build, the result is consistent with Spark.

I tried testing Presto with the same parquet file containing invalid utf-8 encoding. Looks Presto also allows invalid utf-8 encoding in json parsing, like Spark.

select json_extract(c1, '$.c') from tbl;

Could you help verify the result of Presto + Velox? You can use the parquet file unpacked from test.tar.gz. Thanks!

mbasmanova · 2024-08-02T11:50:31Z

@PHILO-HE In general, I believe there are no guarantees in either Spark or Presto on behavior when input is invalid UTF-8. Hence, I doesn't think we need (or should) try to match such behavior. It might be better to simply say that input is expected to be valid UTF-8 and if not, there will an error or undefined behavior.

See https://prestodb.io/docs/current/functions/string.html#string-functions

CC: @FelixYBW @rui-mo

PHILO-HE · 2024-08-02T14:12:43Z

@mbasmanova, thanks for your comment! I got your point. Now that we know there is no guaranteed behavior for invalid utf-8 input, should we just let Simdjson skip the check? See code link. I think JSON parser's performance can be improved if we do that.

mbasmanova · 2024-08-07T05:22:37Z

JSON parser's performance can be improved if we do that.

I feel it is safer to keep the validation and fail loudly if it fails as opposed to returning some "strange" result and have the user guess what's going on. How much of a perf boost do you observe for valid use cases?

PHILO-HE · 2024-08-07T13:22:19Z

How much of a perf boost do you observe for valid use cases?

@mbasmanova, I just tested. But no evident perf. gain in my test cases after the validation is disabled. Maybe, only some certain cases (e.g., long json input) can be benefited. We have made some change in Gluten. Let's close this issue. Thanks!

mbasmanova · 2024-08-07T13:27:19Z

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

PHILO-HE · 2024-08-07T13:33:31Z

We have made some change in Gluten.

@PHILO-HE Curious, what is this change.

@mbasmanova, we have some users require the output of Gluten + Velox is consistent with Spark. So we just set SIMDJSON_SKIPUTF8VALIDATION=ON to build simdjson lib and then install it.

PHILO-HE added the enhancement New feature or request label Aug 1, 2024

PHILO-HE mentioned this issue Aug 5, 2024

[VL] Skip UTF-8 validation in JSON parsing apache/incubator-gluten#6661

Merged

PHILO-HE closed this as completed Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should we disable Simdjson's UTF-8 validation? #10639

Should we disable Simdjson's UTF-8 validation? #10639

PHILO-HE commented Aug 1, 2024

mbasmanova commented Aug 1, 2024

PHILO-HE commented Aug 2, 2024 •

edited

Loading

mbasmanova commented Aug 2, 2024

PHILO-HE commented Aug 2, 2024

mbasmanova commented Aug 7, 2024

PHILO-HE commented Aug 7, 2024 •

edited

Loading

mbasmanova commented Aug 7, 2024

PHILO-HE commented Aug 7, 2024

Should we disable Simdjson's UTF-8 validation? #10639

Should we disable Simdjson's UTF-8 validation? #10639

Comments

PHILO-HE commented Aug 1, 2024

Description

mbasmanova commented Aug 1, 2024

PHILO-HE commented Aug 2, 2024 • edited Loading

mbasmanova commented Aug 2, 2024

PHILO-HE commented Aug 2, 2024

mbasmanova commented Aug 7, 2024

PHILO-HE commented Aug 7, 2024 • edited Loading

mbasmanova commented Aug 7, 2024

PHILO-HE commented Aug 7, 2024

PHILO-HE commented Aug 2, 2024 •

edited

Loading

PHILO-HE commented Aug 7, 2024 •

edited

Loading