Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

getJsonObject number normalization #1897

Conversation

thirtiseven
Copy link
Collaborator

Closes #1831

This PR supports number normalization in getJsonObject.

In getJsonObject in Spark, a float number is converted to a double when parsing and then converted to a string when returning. This PR uses stod in cudf and ftos_converter in jni to simulate this behavior. There may be some compatibility issues in ftos_converter because it is based on ryu, but it is acceptable because the difference is very minor. We use this solution to support double to string in spark-rapids as well.

For int number, the only special case I know is "-0" is normalized to "0". Note that "-0000000" is invalid as data.

We need to merge it into the feature branch after removing the cpu tests, because this pr calls device-only functions.

: output(nullptr), output_len(0), hide_outer_array_tokens(_hide_outer_array_tokens)
{
}
CUDF_HOST_DEVICE CUDF_HOST_DEVICE json_generator<>& operator=(const json_generator<>& other)
__device__ __device__ json_generator<>& operator=(const json_generator<>& other)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two device

@res-life
Copy link
Collaborator

Add several JNI cases will be good.

@res-life res-life requested a review from ttnghia March 26, 2024 06:40
@res-life
Copy link
Collaborator

Yes, -0000000 is not a valid JSON number.

@thirtiseven thirtiseven merged commit 2ae268a into NVIDIA:get-json-object-feature Mar 26, 2024
2 checks passed
thirtiseven added a commit that referenced this pull request Mar 27, 2024
* get-json-object:  Add JSON parser and parser utility (#1836)

* Add Json Parser;
Add Json Parser utility;
Define internal interfaces;
Copy get-json-obj CUDA code from cuDF;

Signed-off-by: Chong Gao <[email protected]>

* Code format

---------

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: match current field name (#1857)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: add utility write_escaped_text for JSON generator (#1863)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Add JNI for GetJsonObject (#1862)

* Add JNI for GetJsonObject

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* Parse json path in plugin

Signed-off-by: Haoyang Li <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* Use table_view

Signed-off-by: Haoyang Li <[email protected]>

* Update java

Signed-off-by: Haoyang Li <[email protected]>

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* use matched enum for type

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* upmerge

Signed-off-by: Haoyang Li <[email protected]>

* format

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>

* get-json-object: main flow (#1868)

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Optimize memory usage in match_current_field_name (#1889)

* Optimize match_current_field_name using less memory

Signed-off-by: Chong Gao <[email protected]>

* Convert a function to device code

* Add a JNI test case

* Add JNI test case

* Change nesting depth to 4

* Change nesting depth to 8 to fix test

Signed-off-by: Haoyang Li <[email protected]>

* remove clang format change

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* get-json-object: Recursive to iterative (#1890)

* Change recursive to iterative

Signed-off-by: Chong Gao <[email protected]>

---------

Signed-off-by: Chong Gao <[email protected]>
Co-authored-by: Chong Gao <[email protected]>

* Fix bug

* Format

* Use uppercase for path_instruction_type

Signed-off-by: Haoyang Li <[email protected]>

* Add test cases from Baidu

* Fix escape char error; add test case

* getJsonObject number normalization (#1897)

* Support number normalization

Signed-off-by: Haoyang Li <[email protected]>

* delete cpp test and add a java test case

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>

* Add test case

* Fix a escape/unescape size bug

Signed-off-by: Haoyang Li <[email protected]>

* Fix bug: handle leading zeros for number; Refactor

* Apply suggestions from code review

Co-authored-by: Nghia Truong <[email protected]>

* Address comments

Signed-off-by: Haoyang Li <[email protected]>

* fix java test

Signed-off-by: Haoyang Li <[email protected]>

* Add test cases; Fix a bug

* follow up escape/unescape bug fix

Signed-off-by: Haoyang Li <[email protected]>

* Minor refactor

* Add a case; Fix bug

---------

Signed-off-by: Chong Gao <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Co-authored-by: Chong Gao <[email protected]>
Co-authored-by: Haoyang Li <[email protected]>
Co-authored-by: Nghia Truong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants