Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON single quote normalization API #14729

Merged
merged 40 commits into from
Jan 24, 2024
Merged
Show file tree
Hide file tree
Changes from 34 commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
a1120f6
single quote normalization api
shrshi Jan 9, 2024
c6b0ba3
test for normalization api
shrshi Jan 10, 2024
d9a8acf
fixes to test
shrshi Jan 11, 2024
cfe89e6
fix to tests
shrshi Jan 11, 2024
b2ce13b
pre-commit formatting fixes
shrshi Jan 11, 2024
2134cf8
finally, the test passes
shrshi Jan 11, 2024
04e9d82
try again with test stream
shrshi Jan 11, 2024
9f53d42
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 11, 2024
907aba9
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 11, 2024
fa11424
added option to normalize single quotes in read_json
shrshi Jan 12, 2024
adcbddf
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 12, 2024
2e86d89
formatting fixes
shrshi Jan 12, 2024
9925c10
adding testing_main
shrshi Jan 13, 2024
2838c74
java bindings
shrshi Jan 13, 2024
2313955
formatting fixes
shrshi Jan 13, 2024
a5bb42e
compile fix
shrshi Jan 13, 2024
3a6f267
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 13, 2024
0926a2f
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 16, 2024
e63bca0
Update java/src/test/java/ai/rapids/cudf/TableTest.java
shrshi Jan 16, 2024
005b5c2
Update java/src/test/java/ai/rapids/cudf/TableTest.java
shrshi Jan 16, 2024
1a8f5f3
added an error test for when normalize quotes is not enabled
shrshi Jan 16, 2024
a999ca4
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 16, 2024
6a151f5
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 17, 2024
2001866
addressing PR reviews; adding comments
shrshi Jan 18, 2024
b30e130
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 18, 2024
d0fefbd
moved tests; removed duplicated fst code
shrshi Jan 18, 2024
7520e03
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 18, 2024
55503e3
moved preprocess step to read_json
shrshi Jan 18, 2024
85e8053
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 18, 2024
a885277
PR reviews - modifiable input buffer in normalize quotes parameter
shrshi Jan 19, 2024
64135df
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 19, 2024
de1f1b3
don't need fully qualified name in enclosing namespace
shrshi Jan 20, 2024
df6c0f3
Merge branch 'fst-integration' of github.com:shrshi/cudf into fst-int…
shrshi Jan 20, 2024
8441b39
header files cleanup; more fully-qualified names cleanup
shrshi Jan 22, 2024
d5b9707
alphabetizing the new file in add_library
shrshi Jan 23, 2024
4e358fd
more header file cleanup
shrshi Jan 23, 2024
a79683d
guiding the consts eastwards
shrshi Jan 23, 2024
bcc2285
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 23, 2024
890d09b
formatting fix
shrshi Jan 23, 2024
ace46d3
Merge branch 'branch-24.02' into fst-integration
shrshi Jan 24, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -376,6 +376,7 @@ add_library(
src/io/json/legacy/json_gpu.cu
src/io/json/legacy/reader_impl.cu
src/io/json/write_json.cu
src/io/json/json_quote_normalization.cu
shrshi marked this conversation as resolved.
Show resolved Hide resolved
src/io/orc/aggregate_orc_metadata.cpp
src/io/orc/dict_enc.cu
src/io/orc/orc.cpp
Expand Down
14 changes: 13 additions & 1 deletion cpp/include/cudf/io/detail/json.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -51,4 +51,16 @@ void write_json(data_sink* sink,
json_writer_options const& options,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

/**
* @brief Normalize single quotes to double quotes using FST
*
* @param inbuf Input device buffer
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource to use for device memory allocation
*/
rmm::device_uvector<char> normalize_single_quotes(rmm::device_uvector<char>&& inbuf,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr);

} // namespace cudf::io::json::detail
33 changes: 32 additions & 1 deletion cpp/include/cudf/io/json.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2023, NVIDIA CORPORATION.
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -113,6 +113,9 @@ class json_reader_options {
// Whether to keep the quote characters of string values
bool _keep_quotes = false;

// Normalize single quotes
bool _normalize_single_quotes = false;

// Whether to recover after an invalid JSON line
json_recovery_mode_t _recovery_mode = json_recovery_mode_t::FAIL;

Expand Down Expand Up @@ -246,6 +249,13 @@ class json_reader_options {
*/
bool is_enabled_keep_quotes() const { return _keep_quotes; }

/**
* @brief Whether the reader should normalize single quotes around strings
*
* @returns true if the reader should normalize single quotes, false otherwise
*/
bool is_enabled_normalize_single_quotes() const { return _normalize_single_quotes; }

/**
* @brief Queries the JSON reader's behavior on invalid JSON lines.
*
Expand Down Expand Up @@ -324,6 +334,14 @@ class json_reader_options {
*/
void enable_keep_quotes(bool val) { _keep_quotes = val; }

/**
* @brief Set whether the reader should enable normalization of single quotes around strings.
*
* @param val Boolean value to indicate whether the reader should normalize single quotes around
* strings
*/
void enable_normalize_single_quotes(bool val) { _normalize_single_quotes = val; }

/**
* @brief Specifies the JSON reader's behavior on invalid JSON lines.
*
Expand Down Expand Up @@ -474,6 +492,19 @@ class json_reader_options_builder {
return *this;
}

/**
* @brief Set whether the reader should normalize single quotes around strings
*
* @param val Boolean value to indicate whether the reader should normalize single quotes
* of strings
* @return this for chaining
*/
json_reader_options_builder& normalize_single_quotes(bool val)
{
options._normalize_single_quotes = val;
return *this;
}

/**
* @brief Specifies the JSON reader's behavior on invalid JSON lines.
*
Expand Down
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2023-2024, NVIDIA CORPORATION.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did the copyrights drop the year 2023?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new file in this PR implementing the normalization FST, so I think only year 2024 is included in the copyright.

* Copyright (c) 2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -15,19 +15,13 @@
*/

#include <io/fst/lookup_tables.cuh>
#include <io/utilities/hostdevice_vector.hpp>

#include <cudf_test/base_fixture.hpp>
#include <cudf_test/cudf_gtest.hpp>
#include <cudf_test/testing_main.hpp>

#include <cudf/scalar/scalar_factories.hpp>
#include <cudf/strings/repeat_strings.hpp>
#include <cudf/io/detail/json.hpp>
#include <cudf/types.hpp>

#include <rmm/cuda_stream.hpp>
#include <rmm/cuda_stream_view.hpp>
#include <rmm/device_buffer.hpp>
#include <rmm/device_scalar.hpp>
#include <rmm/device_uvector.hpp>

#include <thrust/iterator/discard_iterator.h>
Expand All @@ -36,17 +30,16 @@
#include <string>
#include <vector>

namespace {
namespace cudf::io::json {

using SymbolT = char;
using StateT = char;
using SymbolOffsetT = uint32_t;

// Type used to represent the atomic symbol type used within the finite-state machine
// TODO: type aliasing to be declared in a common header for better maintainability and
// pre-empt future bugs
using SymbolT = char;
using StateT = char;
namespace normalize_quotes {

// Type sufficiently large to index symbols within the input and output (may be unsigned)
using SymbolOffsetT = uint32_t;
enum class dfa_states : char { TT_OOS = 0U, TT_DQS, TT_SQS, TT_DEC, TT_SEC, TT_NUM_STATES };
enum class dfa_states : StateT { TT_OOS = 0U, TT_DQS, TT_SQS, TT_DEC, TT_SEC, TT_NUM_STATES };
enum class dfa_symbol_group_id : uint32_t {
DOUBLE_QUOTE_CHAR, ///< Quote character SG: "
SINGLE_QUOTE_CHAR, ///< Quote character SG: '
Expand All @@ -62,7 +55,7 @@ constexpr auto TT_DQS = dfa_states::TT_DQS;
constexpr auto TT_SQS = dfa_states::TT_SQS;
constexpr auto TT_DEC = dfa_states::TT_DEC;
constexpr auto TT_SEC = dfa_states::TT_SEC;
constexpr auto TT_NUM_STATES = static_cast<char>(dfa_states::TT_NUM_STATES);
constexpr auto TT_NUM_STATES = static_cast<StateT>(dfa_states::TT_NUM_STATES);
constexpr auto NUM_SYMBOL_GROUPS = static_cast<uint32_t>(dfa_symbol_group_id::NUM_SYMBOL_GROUPS);

// The i-th string representing all the characters of a symbol group
Expand All @@ -80,7 +73,7 @@ std::array<std::array<dfa_states, NUM_SYMBOL_GROUPS>, TT_NUM_STATES> const qna_s
}};

// The DFA's starting state
constexpr char start_state = static_cast<char>(TT_OOS);
constexpr char start_state = static_cast<StateT>(TT_OOS);

struct TransduceToNormalizedQuotes {
/**
Expand Down Expand Up @@ -177,156 +170,33 @@ struct TransduceToNormalizedQuotes {
}
};

} // namespace
} // namespace normalize_quotes

// Base test fixture for tests
struct FstTest : public cudf::test::BaseFixture {};
namespace detail {

void run_test(std::string& input, std::string& output)
rmm::device_uvector<SymbolT> normalize_single_quotes(rmm::device_uvector<SymbolT>&& inbuf,
rmm::cuda_stream_view stream,
rmm::mr::device_memory_resource* mr)
{
// Prepare cuda stream for data transfers & kernels
rmm::cuda_stream stream{};
rmm::cuda_stream_view stream_view(stream);

auto parser = cudf::io::fst::detail::make_fst(
cudf::io::fst::detail::make_symbol_group_lut(qna_sgs),
cudf::io::fst::detail::make_transition_table(qna_state_tt),
cudf::io::fst::detail::make_translation_functor(TransduceToNormalizedQuotes{}),
auto parser = fst::detail::make_fst(
fst::detail::make_symbol_group_lut(normalize_quotes::qna_sgs),
fst::detail::make_transition_table(normalize_quotes::qna_state_tt),
fst::detail::make_translation_functor(normalize_quotes::TransduceToNormalizedQuotes{}),
stream);

auto d_input_scalar = cudf::make_string_scalar(input, stream_view);
auto& d_input = static_cast<cudf::scalar_type_t<std::string>&>(*d_input_scalar);

// Prepare input & output buffers
constexpr std::size_t single_item = 1;
cudf::detail::hostdevice_vector<SymbolT> output_gpu(input.size() * 2, stream_view);
cudf::detail::hostdevice_vector<SymbolOffsetT> output_gpu_size(single_item, stream_view);

// Allocate device-side temporary storage & run algorithm
parser.Transduce(d_input.data(),
static_cast<SymbolOffsetT>(d_input.size()),
output_gpu.device_ptr(),
rmm::device_uvector<SymbolT> outbuf(inbuf.size() * 2, stream, mr);
rmm::device_scalar<SymbolOffsetT> outbuf_size(stream, mr);
parser.Transduce(inbuf.data(),
static_cast<SymbolOffsetT>(inbuf.size()),
outbuf.data(),
thrust::make_discard_iterator(),
output_gpu_size.device_ptr(),
start_state,
stream_view);

// Async copy results from device to host
output_gpu.device_to_host_async(stream_view);
output_gpu_size.device_to_host_async(stream_view);

// Make sure results have been copied back to host
stream.synchronize();

// Verify results
ASSERT_EQ(output_gpu_size[0], output.size());
CUDF_TEST_EXPECT_VECTOR_EQUAL(output_gpu, output, output.size());
}

TEST_F(FstTest, GroundTruth_QuoteNormalization1)
{
std::string input = R"({"A":'TEST"'})";
std::string output = R"({"A":"TEST\""})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization2)
{
std::string input = R"({'A':"TEST'"} ['OTHER STUFF'])";
std::string output = R"({"A":"TEST'"} ["OTHER STUFF"])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization3)
{
std::string input = R"(['{"A": "B"}',"{'A': 'B'}"])";
std::string output = R"(["{\"A\": \"B\"}","{'A': 'B'}"])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization4)
{
std::string input = R"({"ain't ain't a word and you ain't supposed to say it":'"""""""""""'})";
std::string output =
R"({"ain't ain't a word and you ain't supposed to say it":"\"\"\"\"\"\"\"\"\"\"\""})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization5)
{
std::string input = R"({"\"'\"'\"'\"'":'"\'"\'"\'"\'"'})";
std::string output = R"({"\"'\"'\"'\"'":"\"'\"'\"'\"'\""})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization6)
{
std::string input = R"([{"ABC':'CBA":'XYZ":"ZXY'}])";
std::string output = R"([{"ABC':'CBA":"XYZ\":\"ZXY"}])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization7)
{
std::string input = R"(["\t","\\t","\\","\\\'\"\\\\","\n","\b"])";
std::string output = R"(["\t","\\t","\\","\\\'\"\\\\","\n","\b"])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization8)
{
std::string input = R"(['\t','\\t','\\','\\\"\'\\\\','\n','\b','\u0012'])";
std::string output = R"(["\t","\\t","\\","\\\"'\\\\","\n","\b","\u0012"])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid1)
{
std::string input = R"(["THIS IS A TEST'])";
std::string output = R"(["THIS IS A TEST'])";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid2)
{
std::string input = R"(['THIS IS A TEST"])";
std::string output = R"(["THIS IS A TEST\"])";
run_test(input, output);
}
outbuf_size.data(),
normalize_quotes::start_state,
stream);

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid3)
{
std::string input = R"({"MORE TEST'N":'RESUL})";
std::string output = R"({"MORE TEST'N":"RESUL})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid4)
{
std::string input = R"({"NUMBER":100'0,'STRING':'SOMETHING'})";
std::string output = R"({"NUMBER":100"0,"STRING":"SOMETHING"})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid5)
{
std::string input = R"({'NUMBER':100"0,"STRING":"SOMETHING"})";
std::string output = R"({"NUMBER":100"0,"STRING":"SOMETHING"})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid6)
{
std::string input = R"({'a':'\\''})";
std::string output = R"({"a":"\\""})";
run_test(input, output);
}

TEST_F(FstTest, GroundTruth_QuoteNormalization_Invalid7)
{
std::string input = R"(}'a': 'b'{)";
std::string output = R"(}"a": "b"{)";
run_test(input, output);
outbuf.resize(outbuf_size.value(stream), stream);
return outbuf;
}

CUDF_TEST_PROGRAM_MAIN()
} // namespace detail
} // namespace cudf::io::json
21 changes: 19 additions & 2 deletions cpp/src/io/json/read_json.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2022-2023, NVIDIA CORPORATION.
* Copyright (c) 2022-2024, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -22,6 +22,7 @@

#include <cudf/detail/nvtx/ranges.hpp>
#include <cudf/detail/utilities/vector_factories.hpp>
#include <cudf/io/detail/json.hpp>
#include <cudf/utilities/error.hpp>

#include <rmm/exec_policy.hpp>
Expand All @@ -45,6 +46,15 @@ size_t sources_size(host_span<std::unique_ptr<datasource>> const sources,
});
}

/**
* @brief Read from array of data sources into RMM buffer
*
* @param sources Array of data sources
* @param compression Compression format of source
* @param range_offset Number of bytes to skip from source start
* @param range_size Number of bytes to read from source
* @param stream CUDA stream used for device memory operations and kernel launches
*/
rmm::device_uvector<char> ingest_raw_input(host_span<std::unique_ptr<datasource>> sources,
compression_type compression,
size_t range_offset,
Expand Down Expand Up @@ -217,7 +227,14 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
"Multiple inputs are supported only for JSON Lines format");
}

auto const buffer = get_record_range_raw_input(sources, reader_opts, stream);
auto buffer = get_record_range_raw_input(sources, reader_opts, stream);

// If input JSON buffer has single quotes and option to normalize single quotes is enabled,
// invoke pre-processing FST
if (reader_opts.is_enabled_normalize_single_quotes()) {
buffer =
normalize_single_quotes(std::move(buffer), stream, rmm::mr::get_current_device_resource());
}

return device_parse_nested_json(buffer, reader_opts, stream, mr);
// For debug purposes, use host_parse_nested_json()
Expand Down
Loading
Loading