Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data.table 50gb join benchmarks #4900

Closed
myoung3 opened this issue Feb 12, 2021 · 1 comment
Closed

data.table 50gb join benchmarks #4900

myoung3 opened this issue Feb 12, 2021 · 1 comment
Labels

Comments

@myoung3
Copy link
Contributor

myoung3 commented Feb 12, 2021

data.table and most other approaches (except pydatatable and spark) are showing out-of-memory errors right now for the 50gb join:
https://h2oai.github.io/db-benchmark/

  1. was this always the case for data.table, or is this a (memory) performance regression?
  2. It might be informative to also show the memory needed for a join to happen--ie constrain an image to have a certain amount of memory, run the tests, see which ones succeed, then increment memory allocation by 10gb and repeat

You'd need a server with more than the current 125gb though.

@jangorecki
Copy link
Member

@myoung3 thanks for those questions. I think they should belong to h2oai/db-benchmark repository. I will answer them here and close the issue. If you have any follow up questions please submit them in b-benchmark repo.

  1. It was always like this
  2. Showing how much memory it would require is not really doable. What could be useful instead is it present how much memory 1e7 and 1e8 data sizes required, then guessing memory required for 1e9 will be easier. This feature can be tracked in measure memory usage h2oai/db-benchmark#9

We don't want a server with more memory because we want benchmark to respect out-of-memory scenario. Pydatatable and Spark are the only tools (dask despite having those capabilities cannot resolve those queries) that handle out-of-memory data.
For join script we are loading 4 data: 50+0.5+5+50 = 105.5 GB of csv. This 105GB is just a data, not memory required for temporary reading, parsing, aggregating, finding groups, etc. Therefore it is expected that 1e9 data size for join task will require out-of-memory support to complete. We hope that over time more tools will add this feature, for data.table you can track it in #1336. In case of presenting out-of-memory scenario for groupby task we need bigger size than we currently have, and request for that can be tracked in h2oai/db-benchmark#39

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants