Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf::io::json::detail::normalize_single_quotes outputs incorrect result when the input has \n character #17261

Closed
Tracked by #11630
ttnghia opened this issue Nov 7, 2024 · 0 comments · Fixed by #17266
Assignees
Labels
bug Something isn't working cuIO cuIO issue

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Nov 7, 2024

Reproducible with this input:

{\"a\": \"1\n2\"}
{\'a\': 12}

The output tokens, generated by cudf::io::json::detail::get_token_stream after preprocessing with cudf::io::json::detail::normalize_single_quotes are:

Input:
{"a": "1
2"}{'a': 12}
Tokens:
0, 4, 6, 7, 8, 9, 5, 1, 0, 1
Token indices:
0, 1, 1, 3, 6, 10, 11, 11, 0, 0

If remove the \n character then the output is correct:

Input:
{"a": "12"}{"a": 12}
Tokens:
0, 4, 6, 7, 8, 9, 5, 1, 0, 4, 6, 7, 10, 11, 5, 1
Token indices:
0, 1, 1, 3, 6, 9, 10, 10, 12, 13, 13, 15, 18, 20, 20, 20

Note:

  • Line delimiter between JSON objects is \0, not \n.
  • allow_unquoted_control is set to true.
  • Token indices are the positions of the tokens in the input string.
  • Token numbers are static_cast from enum token_t at
    enum token_t : PdaTokenT {

I suspect that it is due to the leftover character \n in

std::array<std::vector<SymbolT>, NUM_SYMBOL_GROUPS - 1> const qna_sgs{
{{'\"'}, {'\''}, {'\\'}, {'\n'}}};
, but I'm not 100% sure.

@ttnghia ttnghia added bug Something isn't working cuco cuCollections related issue labels Nov 7, 2024
@github-project-automation github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Nov 7, 2024
@ttnghia ttnghia added cuIO cuIO issue and removed cuco cuCollections related issue labels Nov 7, 2024
@ttnghia ttnghia linked a pull request Nov 7, 2024 that will close this issue
3 tasks
@rapids-bot rapids-bot bot closed this as completed in 5cbdcd0 Nov 9, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Nov 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants