Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[VL] Results are mismatch with vanilla Spark on release-1.1 when use get_json_object operator #5253

Open
NEUpanning opened this issue Apr 2, 2024 · 18 comments
Labels
bug Something isn't working triage

Comments

@NEUpanning
Copy link
Contributor

NEUpanning commented Apr 2, 2024

Backend

VL (Velox)

Bug description

The following SQL might lead to wrong results, but it's not yet certain if there are other factors involved. Vanilla Spark results is NULL but gluten results is N.
select get_json_object(extend_attr,'$.11000022') from mart_catering.dim_deal_all_info_ss where mt_deal_id=922798418 and partition_date='2024-03-27';

The extend_attr field value is {"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}

Unfortunately,i cannot reproduce this issue using simple SQL like select get_json_object(col1,'$.11000022') from values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}') and the results is NULL same as Vanilla Spark results.

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

No response

@NEUpanning NEUpanning added bug Something isn't working triage labels Apr 2, 2024
@NEUpanning
Copy link
Contributor Author

cc @kecookier

@NEUpanning NEUpanning changed the title [VL] Results are mismatch with vanilla Spark when use get_json_object operator [VL] Results are mismatch with vanilla Spark on release-1.1 when use get_json_object operator Apr 2, 2024
@PHILO-HE
Copy link
Contributor

PHILO-HE commented Apr 8, 2024

@NEUpanning, I tried testing both main & 1.1 branch, but the result is NULL, consistent with Spark. Here is my test. Could you help check again?

DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;

@NEUpanning
Copy link
Contributor Author

@PHILO-HE Thanks for your reply. I can't reproduce it either, maybe there are other factors involved. I will take some time to figure out lately.

@wang-zhun
Copy link
Contributor

wang-zhun commented Jul 26, 2024

@NEUpanning Is the issue I'm experiencing the same as this one

with tb as (
select
    concat('{"a": "2", "b": "', string (unhex(hex(id))), '"}') as js,
    string(unhex(hex(id))) as st,
    id as ascii_num
FROM range(0, 255, 1, 10)
)
select 'gluten', ascii_num,st,get_json_object(js,  '$.a') from (select /*+ repartition(2) */ * from tb)
union all 
select 'vanilla', ascii_num,st,get_json_object(js,  '$.a')  from tb

@NEUpanning
Copy link
Contributor Author

@wang-zhun Could you show the result of this SQL?

@wang-zhun
Copy link
Contributor

image

@NEUpanning
Copy link
Contributor Author

@wang-zhun They may be related.

@wang-zhun
Copy link
Contributor

wang-zhun commented Jul 29, 2024

@PHILO-HE Reviewing the commit history, you have more expertise in this section. Could you help verify this issue?

// Spark
private[this] object SharedFactory {
  val jsonFactory = new JsonFactoryBuilder()
    // The two options below enabled for Hive compatibility
    .enable(JsonReadFeature.ALLOW_UNESCAPED_CONTROL_CHARS)
    .enable(JsonReadFeature.ALLOW_SINGLE_QUOTES)
    .build()
}

@wang-zhun
Copy link
Contributor

wang-zhun commented Jul 31, 2024

A simple modification to simdjson can resolve the issue, but it is uncertain if there will be other impacts

  • Disabling UTF-8 validation can resolve errors caused by single ASCII characters greater than 128
  • Disabling escape character validation can resolve errors caused by single ASCII characters less than 32
    image

@PHILO-HE
Copy link
Contributor

@wang-zhun, thanks for your investigation!
As extended ASCII is not compatible with utf-8, Simdjson's utf-8 validation doesn't pass on those cases. I also did a few test with some invalid utf-8 input, it looks vanilla Spark always disregards input's validity regarding utf-8. I will ask for the feedback of some users to see whether we should disable the validation.

@NEUpanning
Copy link
Contributor Author

@PHILO-HE Thanks for your effort in #6661. I've cherry-picked this PR on our branch to test if this issue has been resolved, but the results are still mismatch with vanilla Spark. The json value in this issue doesn't contain extended ASCII and isn't a valid json. Therefore, this issue will remain open.

@PHILO-HE
Copy link
Contributor

@NEUpanning, have you tried with branch-1.2 or main?

@NEUpanning
Copy link
Contributor Author

@PHILO-HE We are using v1.2.0-rc, higher version is not supported now.

@PHILO-HE
Copy link
Contributor

@NEUpanning, can the below case reproduce this issue on your side? If not, could you provide a small reproducible case?

DROP table if EXISTS tbl;
CREATE TABLE tbl(a STRING) USING parquet;
INSERT INTO tbl values('{"142":"[{\"112\":{\"template\":{\"A\":\"a\",\"RS\":\"a\",\"NRS\":\"a\"},\"label\":{\"fromNumber\":\"\",\"rsToNumber\":\"\",\"rsFromNumber\":\"\",\"toNumber\":\"\"},\"key\":\"A\"},\"141\":[[{\"name\":\"a\a\"},{\"number\":\"1\"},{\"price\":\"218\"},{\"size\":\"6\"},{\"unit\":\"a\"},{\"form\":\"a\"},{\"type\":\"a\"},{\"thickness\":\"\"}]]}]","11000022": "N"}');
SELECT get_json_object(a,'$.11000022') FROM tbl;

@NEUpanning
Copy link
Contributor Author

@PHILO-HE Same as before discussion, I still can't reproduce it without our table. When I find a small reproducible case, I'll get in touch with you asap. Thanks.

@jiangjiangtian
Copy link
Contributor

@PHILO-HE @NEUpanning @kecookier
I find a sql that have a result mismatch:

select get_json_object('{"businessCode":"xxx","msgId":"12","msgTime":23,"assistantId":34,"assistantAccount":"xxx","friend":{"wwUserId":"xxx","nickname":"\\๑","unionId":"xxxx","avatar":"xxx","userType":1,"wwCorpId":"","wwCorpName":"","wwAccount":""},"message":{"contentType":1,"content":"xxx"}}', '$.friend.unionId');

Gluten returns xxxx, but spark returns NULL.
The reason is that when the json parser in Spark meets \, it will further check the character after the \ to see whether it is a valid escape character. In this case, \๑ is not a valid escape character. So spark will return NULL.

https://github.com/FasterXML/jackson-core/blob/8744bd42770c9e277d995ef00fb518940efef3ef/src/main/java/com/fasterxml/jackson/core/json/ReaderBasedJsonParser.java#L2648-#L2699

But in simdjson, there exists no extra check.
So to fix this question, I think we need to add code in simdjson to do the check.

@PHILO-HE
Copy link
Contributor

@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration.
If our given JSON path is $.friend.nickname, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?

@jiangjiangtian
Copy link
Contributor

@jiangjiangtian, after some investigations, I found simdjson ondemand API only validates structural correctness of JSON doc and the result for given JSON path. It's for performance consideration. If our given JSON path is $.friend.nickname, null is returned since the result for this JSON path contains invalid escaped char. Do you strongly feel that we require the alignment with Spark on such cases?

@PHILO-HE Thanks for your investigations. We need to discuss internally to determine the requirement of the alignment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage
Projects
None yet
Development

No branches or pull requests

4 participants