Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing #6

Open
Tracked by #2063
revans2 opened this issue May 28, 2020 · 1 comment
Open
Tracked by #2063
Labels
feature request New feature or request P1 Nice to have for release performance A performance related task/issue SQL part of the SQL/Dataframe plugin

Comments

@revans2
Copy link
Collaborator

revans2 commented May 28, 2020

Is your feature request related to a problem? Please describe.
when parsing CSV currently the CPU will read through the data using the HadoopFileLinesReader and replace the line endings. It would be great from a performance standpoint to do a block copy of most of the data, and skip the line ending translation. This would require that the cudf CSV reader support line endings that are '\r', '\n', or '\r\n'. This is not a simple task but could reduce the CPU utilization significantly.

@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify SQL part of the SQL/Dataframe plugin performance A performance related task/issue labels May 28, 2020
@sameerz sameerz changed the title [FEA] explore ways not use HadoopFileLinesReader for CSV parseing [FEA] Explore ways to not use HadoopFileLinesReader for CSV parsing Oct 13, 2020
@sameerz sameerz removed the ? - Needs Triage Need team to review and classify label Oct 20, 2020
@revans2
Copy link
Collaborator Author

revans2 commented Oct 21, 2020

I filed rapidsai/cudf#6572 in cudf to try and support this.

wjxiz1992 pushed a commit to wjxiz1992/spark-rapids that referenced this issue Oct 29, 2020
Update scala app version to 0.2.2
@mattahrens mattahrens added the P1 Nice to have for release label Apr 27, 2022
@revans2 revans2 mentioned this issue Oct 27, 2022
38 tasks
gerashegalov pushed a commit to gerashegalov/spark-rapids that referenced this issue Nov 18, 2022
…tampNTZEnabled

Fix errors caused by 340+ not working on DB
wjxiz1992 referenced this issue in nvliyuan/yuali-spark-rapids Apr 26, 2024
* A hacky approach for regexpr rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Use contains instead for that case

Signed-off-by: Haoyang Li <[email protected]>

* add config to switch

Signed-off-by: Haoyang Li <[email protected]>

* Rewrite some rlike expression to StartsWith/EndsWith/Contains

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* add tests and config

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
wjxiz1992 referenced this issue in nvliyuan/yuali-spark-rapids Apr 26, 2024
* A hacky approach for regexpr rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Use contains instead for that case

Signed-off-by: Haoyang Li <[email protected]>

* add config to switch

Signed-off-by: Haoyang Li <[email protected]>

* Rewrite some rlike expression to StartsWith/EndsWith/Contains

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* add tests and config

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
wjxiz1992 referenced this issue in nvliyuan/yuali-spark-rapids Apr 26, 2024
* A hacky approach for regexpr rewrite

Signed-off-by: Haoyang Li <[email protected]>

* Use contains instead for that case

Signed-off-by: Haoyang Li <[email protected]>

* add config to switch

Signed-off-by: Haoyang Li <[email protected]>

* Rewrite some rlike expression to StartsWith/EndsWith/Contains

Signed-off-by: Haoyang Li <[email protected]>

* clean up

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* wip

Signed-off-by: Haoyang Li <[email protected]>

* add tests and config

Signed-off-by: Haoyang Li <[email protected]>

---------

Signed-off-by: Haoyang Li <[email protected]>
sperlingxx pushed a commit to sperlingxx/spark-rapids that referenced this issue May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request P1 Nice to have for release performance A performance related task/issue SQL part of the SQL/Dataframe plugin
Projects
None yet
Development

No branches or pull requests

3 participants