-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate using Ibis for the common interface library to any DF backend #89
Comments
This seems amazing! Single API with a polars-like feel, 20+ backends and great performance, it should be a priority. The API seems very expressive and it also supports geospatial operations. It seems to me that it can also support the chosen backend API, although it's not very straightforward (https://ibis-project.org/reference/scalar-udfs#ibis.expr.operations.udf.scalar.builtin). No mention of GPU backend (maybe we can open an issue to ask them about it?) but they might use RAPIDS-polars when it comes out. Regarding pandas performance, I believe it might always be a bottleneck. We might consider ditching it completely in the future. |
I read a bit about ibis and the greatest point it has is the use of duckdb as backend (which might be even more performant than lazy polars). EDIT: Another pro of Ibis is the possibility of using a distributed engine like PySpark. |
The problem is that we have an abstract DF interface, which is yet-another-Ibis. Ibis has long tried to support pandas themselves because it is crucial to their userbase growth. Do you resonate with Ibis' gripes while writing the concrete pandas backend for mesa-frames?
I'd say the main reason for using mesa-frames is performance, and people would be willing to use a faster option than pandas as long as the syntax is not as tedious as CUDA/FLAME 2. What is the fastest you can get while still on pandas syntax (i.e. Dask/Modin)? And whether it is faster than lazy Polars. We need to know this before proceeding. |
I think both Dask and Modin are definitely slower than polars / duckdb (on a single node). I think performance is key. However I fear modelers might not use mesa-frames because it's yet another API to learn. But having a single DF backend would simplify development greatly. We could also have a single ibis backend and return values in a pandas DataFrame for convenience. pandas now should have support for the pyarrow backend, so it shouldn't be too slow. I have to do some experiments to compare performance between different approaches. But this would be the easiest. |
Opened an issue on GPU Acceleration |
@rht checkout the answer to the issue. The potential with Ibis is quite promising. If Theseus gets released in the future, we could run Ibis models on multi-node GPUs, which would be a game-changer for companies using ABMs in production – it could enable much larger-scale simulations. You can transition smoothly from prototyping on a single CPU to running on multi-node GPUs without code changes. Interestingly, I read that GPUs provide the most significant speed-up for join operations, which is our main bottleneck. Adopting Ibis also seems like a good way to future-proof the codebase and reduce the complexity. |
Which means, the current mesa-frames will soon support single-node GPU, via Polars. What Ibis brings to the table is localized to multi-node GPU support.
Unless there is another multi-node GPU library that Ibis already supports, I think adding an Ibis implementation now won't provide any immediate benefit to the user. Also, to ease maintenance cost, supporting Ibis means we have to drop support for Polars, since Ibis already wraps Polars, and maintaining 3 backends is tedious. |
There is still the benefit of distributed CPU with the Spark backend. But if Theseus gets released in the future (and I don't see why it shouldn't), we would get multi-node GPU for free.
Yes, completely agree on this one. |
I'm creating an |
https://ibis-project.org/ has various backends: Dask, pandas, Polars, and many more. But apparently, the pandas backend is so problematic to maintain that they have decided that they will remove them.
The text was updated successfully, but these errors were encountered: