Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Split batches from parquet that are too large, and try to guess better before decompressing #4968

Closed
revans2 opened this issue Mar 16, 2022 · 0 comments · Fixed by rapidsai/cudf#11867, rapidsai/cudf#11961 or #6934
Assignees
Labels
P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing

Comments

@revans2
Copy link
Collaborator

revans2 commented Mar 16, 2022

Is your feature request related to a problem? Please describe.
You might consider this a bug or an enterprise feature. I am fine either way. Occasionally we get really crazy compression ratios on ORC and Parquet files. Right now we have a config to limit the size of the data based off of compressed data input. But if we hit a situation where the compression ratio is really good we can violate the batch size limit, by a lot. Especially if we are reading all of the columns in the file. It would be great if we could.

  1. guess at the output size of the data based off of row group metrics (number of rows) and schema. Most of the time we see this is for fixed width types and if we can tell early on that we are going to try to decompress too much data we should limit the amount sent to the GPU early.
  2. If we do guess wrong (by a lot, and we need to decide what a lot means) we should split the batch and allow some of the batches to be spillable while we process the first batch.
@revans2 revans2 added feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 16, 2022
@sameerz sameerz added task Work required that improves the product but is not user facing and removed feature request New feature or request ? - Needs Triage Need team to review and classify labels Mar 22, 2022
@revans2 revans2 mentioned this issue Apr 8, 2022
14 tasks
@revans2 revans2 added P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin labels Apr 12, 2022
@ttnghia ttnghia linked a pull request Nov 4, 2022 that will close this issue
@revans2 revans2 changed the title [FEA] Split batches from parquet/orc that are too large, and try to guess better before decompressing [FEA] Split batches from parquet that are too large, and try to guess better before decompressing Nov 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P0 Must have for release reliability Features to improve reliability or bugs that severly impact the reliability of the plugin task Work required that improves the product but is not user facing
Projects
None yet
4 participants