Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303

Closed
Tracked by #9
revans2 opened this issue Mar 14, 2024 · 5 comments
Closed
Tracked by #9
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 14, 2024

Is your feature request related to a problem? Please describe.
This is really odd and documented at NVIDIA/spark-rapids#10596

But essentially if we want to enable normalization of single quoted values, we also then want to allow backslash escaping of single quoted values.

I am fine if we end up with a separate config for this, but it does make thinks a lot more difficult. Especially because this fits in with the validation too

#15222

@GregoryKimball
Copy link
Contributor

Thank you @revans2 for documenting this. I'm a bit surprised that libcudf is returning a null for this string.

"""{"data": "\'TESTING\'"}"""

I would expect this to be valid with and without single-quote normalization. Would you please help me understand what I'm missing?

@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Mar 14, 2024
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Mar 14, 2024
@shrshi
Copy link
Contributor

shrshi commented Mar 15, 2024

Hello @revans2, escaping of double quotes is expected behaviour when single quote normalization is enabled to handle the case of double quotes being present within single quotes. For example

{'TE"ST'} -> {"TE\"ST"}

If there are escaped single quotes within a single quoted string in the input, then the current behaviour of the quote normalization FST is to remove the backslash escapes since we no longer need them. For example

{'TE\'ST'} -> {"TE'ST"}

If I understand correctly, the quote normalizer is expected to retain the backslash i.e. {'TE\'ST'} -> {"TE\'ST"}
Is this correct?

@revans2
Copy link
Contributor Author

revans2 commented Mar 15, 2024

@shrshi

I am sorry. I think I have been moving too quickly for my own good. I should have written some tests as repro cases for this before filing the issue.

I think there is a mismatch between the escape handling in the parser and the single quote pre-processing.

TEST_F(JsonReaderTest, EscapedSingleQuotes)
{
  const std::string fname = temp_env->get_temp_dir() + "EscapedSingleQuotes.json";
  std::ofstream outfile(fname, std::ofstream::out);
  outfile << "{\"data\": \"\\'TESTING\\'\"}\n{\"data\": \"'TESTING'\"}";
  outfile.close();

  cudf::io::json_reader_options in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
      .lines(true)
      .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
      .mixed_types_as_string(true)
      .normalize_whitespace(true)
      .keep_quotes(true)
      .normalize_single_quotes(false);

  cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

  EXPECT_EQ(result.tbl->num_columns(), 1);
  EXPECT_EQ(result.tbl->num_rows(), 2);

  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
  EXPECT_EQ(result.metadata.schema_info[0].name, "data");

  cudf::test::print(result.tbl->get_column(0));

  cudf::test::strings_column_wrapper const strings(
      {"", "\"'TESTING'\""},
      {false, true});
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(result.tbl->get_column(0), strings);
}

TEST_F(JsonReaderTest, EscapedSingleQuotesNormalize)
{
  const std::string fname = temp_env->get_temp_dir() + "EscapedSingleQuotesNormalize.json";
  std::ofstream outfile(fname, std::ofstream::out);
  outfile << "{\"data\": \"\\'TESTING\\'\"}\n{\"data\": \"'TESTING'\"}";
  outfile.close();

  cudf::io::json_reader_options in_options =
    cudf::io::json_reader_options::builder(cudf::io::source_info{fname})
      .lines(true)
      .recovery_mode(cudf::io::json_recovery_mode_t::RECOVER_WITH_NULL)
      .mixed_types_as_string(true)
      .normalize_whitespace(true)
      .keep_quotes(true)
      .normalize_single_quotes(true);

  cudf::io::table_with_metadata result = cudf::io::read_json(in_options);

  EXPECT_EQ(result.tbl->num_columns(), 1);
  EXPECT_EQ(result.tbl->num_rows(), 2);

  EXPECT_EQ(result.tbl->get_column(0).type().id(), cudf::type_id::STRING);
  EXPECT_EQ(result.metadata.schema_info[0].name, "data");

  cudf::test::print(result.tbl->get_column(0));

  cudf::test::strings_column_wrapper const strings{"\"'TESTING'\"", "\"'TESTING'\""};
  CUDF_TEST_EXPECT_COLUMNS_EQUAL(result.tbl->get_column(0), strings);
}

The first test passes, which is what Spark would expect to see when single quote support is disabled, but when it is enabled (which is the default in Spark) nothing changes. I think that the problem is that anything in double quotes is not being processed by the quote normalization code. {"TE\'ST"} -> {"TE\'ST"}, when what I think I want is {"TE\'ST"} -> {"TE'ST"}.

So this was a missed requirement on my part.

@shrshi
Copy link
Contributor

shrshi commented Mar 15, 2024

Thank you for the detailed explanation, @revans2.
I've opened #15324 to address this.

rapids-bot bot pushed a commit that referenced this issue Mar 19, 2024
…15324)

This PR addresses the inconsistency in processing single quotes within a quoted string in the single quote normalizer. 
In the current implementation, when we have an escaped single quote within a single quoted string, the normalizer removes the backslash escape on converting the string to double quotes. However, the normalizer retains the contents of double quoted strings as-is i.e. if there are escaped single quotes within a double quoted string, the backslash character is retained in the output. 

We address this inconsistency by removing the escape character for single quotes in all double quoted string in the output.

Tackles #15303 to mimic Spark behavior.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Elias Stehle (https://github.com/elstehle)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #15324
@GregoryKimball
Copy link
Contributor

I believe this is closed by #15324

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Archived in project
Development

No branches or pull requests

3 participants