You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
data.table and most other approaches (except pydatatable and spark) are showing out-of-memory errors right now for the 50gb join: https://h2oai.github.io/db-benchmark/
was this always the case for data.table, or is this a (memory) performance regression?
It might be informative to also show the memory needed for a join to happen--ie constrain an image to have a certain amount of memory, run the tests, see which ones succeed, then increment memory allocation by 10gb and repeat
You'd need a server with more than the current 125gb though.
The text was updated successfully, but these errors were encountered:
@myoung3 thanks for those questions. I think they should belong to h2oai/db-benchmark repository. I will answer them here and close the issue. If you have any follow up questions please submit them in b-benchmark repo.
It was always like this
Showing how much memory it would require is not really doable. What could be useful instead is it present how much memory 1e7 and 1e8 data sizes required, then guessing memory required for 1e9 will be easier. This feature can be tracked in measure memory usage h2oai/db-benchmark#9
We don't want a server with more memory because we want benchmark to respect out-of-memory scenario. Pydatatable and Spark are the only tools (dask despite having those capabilities cannot resolve those queries) that handle out-of-memory data.
For join script we are loading 4 data: 50+0.5+5+50 = 105.5 GB of csv. This 105GB is just a data, not memory required for temporary reading, parsing, aggregating, finding groups, etc. Therefore it is expected that 1e9 data size for join task will require out-of-memory support to complete. We hope that over time more tools will add this feature, for data.table you can track it in #1336. In case of presenting out-of-memory scenario for groupby task we need bigger size than we currently have, and request for that can be tracked in h2oai/db-benchmark#39
data.table and most other approaches (except pydatatable and spark) are showing out-of-memory errors right now for the 50gb join:
https://h2oai.github.io/db-benchmark/
You'd need a server with more than the current 125gb though.
The text was updated successfully, but these errors were encountered: