data.table 50gb join benchmarks #4900

myoung3 · 2021-02-12T20:36:52Z

data.table and most other approaches (except pydatatable and spark) are showing out-of-memory errors right now for the 50gb join:
https://h2oai.github.io/db-benchmark/

was this always the case for data.table, or is this a (memory) performance regression?
It might be informative to also show the memory needed for a join to happen--ie constrain an image to have a certain amount of memory, run the tests, see which ones succeed, then increment memory allocation by 10gb and repeat

You'd need a server with more than the current 125gb though.

jangorecki · 2021-02-13T16:14:44Z

@myoung3 thanks for those questions. I think they should belong to h2oai/db-benchmark repository. I will answer them here and close the issue. If you have any follow up questions please submit them in b-benchmark repo.

It was always like this
Showing how much memory it would require is not really doable. What could be useful instead is it present how much memory 1e7 and 1e8 data sizes required, then guessing memory required for 1e9 will be easier. This feature can be tracked in measure memory usage h2oai/db-benchmark#9

We don't want a server with more memory because we want benchmark to respect out-of-memory scenario. Pydatatable and Spark are the only tools (dask despite having those capabilities cannot resolve those queries) that handle out-of-memory data.
For join script we are loading 4 data: 50+0.5+5+50 = 105.5 GB of csv. This 105GB is just a data, not memory required for temporary reading, parsing, aggregating, finding groups, etc. Therefore it is expected that 1e9 data size for join task will require out-of-memory support to complete. We hope that over time more tools will add this feature, for data.table you can track it in #1336. In case of presenting out-of-memory scenario for groupby task we need bigger size than we currently have, and request for that can be tracked in h2oai/db-benchmark#39

jangorecki closed this as completed Feb 13, 2021

jangorecki added the question label Feb 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table 50gb join benchmarks #4900

data.table 50gb join benchmarks #4900

myoung3 commented Feb 12, 2021

jangorecki commented Feb 13, 2021

data.table 50gb join benchmarks #4900

data.table 50gb join benchmarks #4900

Comments

myoung3 commented Feb 12, 2021

jangorecki commented Feb 13, 2021