Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Audit_3.0.1: Test for not ignoring rows in CSV that start with null char. Test in cudf #958

Open
Tracked by #2063
razajafri opened this issue Oct 15, 2020 · 3 comments
Labels
audit_3.0.1 Audit related tasks for 3.0.1 feature request New feature or request P2 Not required for release

Comments

@razajafri
Copy link
Collaborator

razajafri commented Oct 15, 2020

commit 68ff809060c312fee1fbe8d46657e6c7aec62dba
Author: Sean Owen [email protected]
Date: Wed Aug 26 00:25:58 2020 +0900

[SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
@razajafri razajafri added feature request New feature or request ? - Needs Triage Need team to review and classify audit_3.0.1 Audit related tasks for 3.0.1 labels Oct 15, 2020
@razajafri razajafri changed the title [FEA] Test for not ignoring rows in CSV that start with null char. Test in cudf [FEA] Audit_3.0.1: Test for not ignoring rows in CSV that start with null char. Test in cudf Oct 15, 2020
@sameerz sameerz added P0 Must have for release and removed ? - Needs Triage Need team to review and classify labels Oct 20, 2020
@kuhushukla
Copy link
Collaborator

@razajafri why do we need to audit this any further? COuld you elaborate here? Thanks!

@razajafri
Copy link
Collaborator Author

To support this we will need to make change in cudf to expose an API for the user to set/unset whether to ignore the lines starting with commentPrefix.

@revans2
Copy link
Collaborator

revans2 commented Nov 11, 2020

@razajafri if I read through the cudf code '\0' appears to disable comments

https://github.com/rapidsai/cudf/blob/aaba25034c3b927fc1fb7a80ab4947eb06a6c6b5/cpp/src/io/csv/csv_gpu.cu#L1015
https://github.com/rapidsai/cudf/blob/aaba25034c3b927fc1fb7a80ab4947eb06a6c6b5/cpp/src/io/csv/csv_gpu.cu#L1115

So what we would need to do is to fall back to the CPU if someone actually set the comment character to '\0' instead of going off of the default. We probably also should update the docs in cudf to explain this.

Because this is CSV and only for a very small corner case I don't think it is that critical to fix right now so dropping the priority.

@revans2 revans2 added P2 Not required for release and removed P0 Must have for release labels Nov 11, 2020
@revans2 revans2 mentioned this issue Oct 27, 2022
38 tasks
tgravescs pushed a commit to tgravescs/spark-rapids that referenced this issue Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audit_3.0.1 Audit related tasks for 3.0.1 feature request New feature or request P2 Not required for release
Projects
None yet
Development

No branches or pull requests

4 participants