You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I browsed the readme and slides but failed to grok - can DataFusion handle larger than RAM datasets? In other words, if I register multiple parquet files, which size exceeds RAM, will they get all loaded into memory or will DataFusion carefully manage memory buffers to avoid out of memory exception?
As an extension of this question, I'd like to ask for pointers on how can one tune DataFusion resource usage if necessary ?
The text was updated successfully, but these errors were encountered:
@voltcode -- DataFusion is at its core an in memory processing system.
That being said, depending on what the plan is doing, simply reading from a large number of parquet files does not necessarily mean they will be decompressed all at once into memory.
DataFusion has several features that keep the memory usage down:
It will only read columns required for the query "projection pushdown"
It will attempt to prune row groups (based on metadata) and skip them entirely if possible
It has a "streaming" model of computation and so will read the parquet files into memory in small batches.
Certain operations in DataFusion are likely to consume large amounts of memory, notable "Sort" and "Join" (as well as grouping where there are large numbers of distinct groups)
I am not sure there is any documentation written about tuning resource usage of DataFusion -- perhaps @andygrove would know if such documentation existed
I browsed the readme and slides but failed to grok - can DataFusion handle larger than RAM datasets? In other words, if I register multiple parquet files, which size exceeds RAM, will they get all loaded into memory or will DataFusion carefully manage memory buffers to avoid out of memory exception?
As an extension of this question, I'd like to ask for pointers on how can one tune DataFusion resource usage if necessary ?
The text was updated successfully, but these errors were encountered: