Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add option to nullify empty lines #17028

Conversation

karthikeyann
Copy link
Contributor

Description

This PR adds option to nullify empty lines. in pandas json reader, empty lines are ignored. But for spark empty lines still need to be a null row. So, this options will enable it only when recovery mode RECOVER_WITH_NULL is used.

TODO: unit tests.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 9, 2024
@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress cuIO cuIO issue non-breaking Non-breaking change labels Oct 9, 2024
Comment on lines 242 to +243
struct TransduceToken {
bool nullify_empty_lines;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this imply any performance hit? Please run benchmark with this. If there is any slowdown, we probably need to make this as a template argument (with sacrificing compile time) so we can optimize the code out if it is false.

@@ -73,5 +74,9 @@ table_with_metadata read_json(host_span<std::unique_ptr<datasource>> sources,
rmm::cuda_stream_view stream,
rmm::device_async_resource_ref mr);

std::tuple<rmm::device_buffer, char> preprocess(cudf::strings_column_view const& input,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this function is called only in testing. Do we ever need it in the source code in other places. If not, can we generate the test string directly without this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. you can test without this function. But idea is that each string row is appended with 1 delimiter that's not present in the strings. This function is provided by @shrshi for you to convert string column to a rmm buffer and delimiter easily.

@shrshi
Copy link
Contributor

shrshi commented Oct 25, 2024

This PR is waiting for #17178 to be resolved.

@shrshi
Copy link
Contributor

shrshi commented Oct 28, 2024

Closed since the performance improvement of cub::BatchedMemcpy over join_strings is not significant. This renders the bug fix #17178 unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants