Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Can DataFusion handle larger than RAM datasets? #464

Closed
voltcode opened this issue Jun 1, 2021 · 4 comments
Closed

Question: Can DataFusion handle larger than RAM datasets? #464

voltcode opened this issue Jun 1, 2021 · 4 comments
Labels
question Further information is requested

Comments

@voltcode
Copy link

voltcode commented Jun 1, 2021

I browsed the readme and slides but failed to grok - can DataFusion handle larger than RAM datasets? In other words, if I register multiple parquet files, which size exceeds RAM, will they get all loaded into memory or will DataFusion carefully manage memory buffers to avoid out of memory exception?

As an extension of this question, I'd like to ask for pointers on how can one tune DataFusion resource usage if necessary ?

@alamb
Copy link
Contributor

alamb commented Jun 4, 2021

@voltcode -- DataFusion is at its core an in memory processing system.

That being said, depending on what the plan is doing, simply reading from a large number of parquet files does not necessarily mean they will be decompressed all at once into memory.

DataFusion has several features that keep the memory usage down:

  1. It will only read columns required for the query "projection pushdown"
  2. It will attempt to prune row groups (based on metadata) and skip them entirely if possible
  3. It has a "streaming" model of computation and so will read the parquet files into memory in small batches.

Certain operations in DataFusion are likely to consume large amounts of memory, notable "Sort" and "Join" (as well as grouping where there are large numbers of distinct groups)

@alamb
Copy link
Contributor

alamb commented Jun 4, 2021

I am not sure there is any documentation written about tuning resource usage of DataFusion -- perhaps @andygrove would know if such documentation existed

@jorgecarleitao jorgecarleitao added the question Further information is requested label Jun 6, 2021
@alamb
Copy link
Contributor

alamb commented Jun 18, 2021

Possibly related: #587 (feature to keep memory limit)

@alamb
Copy link
Contributor

alamb commented Aug 16, 2021

I think this question is answered so closing this ticket

@alamb alamb closed this as completed Aug 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants