Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support Daft dataframe as a backend #8904

Open
1 task done
jaychia opened this issue Apr 5, 2024 · 12 comments
Open
1 task done

feat: Support Daft dataframe as a backend #8904

jaychia opened this issue Apr 5, 2024 · 12 comments
Labels
feature Features or general enhancements new backend PRs or issues related to adding new backends

Comments

@jaychia
Copy link

jaychia commented Apr 5, 2024

Which new backend would you like to see in Ibis?

Hi! I would like to explore building a backend for Ibis for Daft (www.getdaft.io)

I am one of the maintainers of the project, and we have had some user interest in using Ibis as an interface for Daft. Daft is a distributed query engine built with a Python dataframe API, with most of its internals written in Rust.

We're not sure where to begin/how to think about potential integrations but would love some pointers. Primarily:

  1. Are there a core set of features that should be implemented first? And how much surface area does this involve?
  2. How do we then incrementally build out new features to increase support of the Ibis API in totality?

Excited for this :)

Code of Conduct

  • I agree to follow this project's Code of Conduct
@jaychia jaychia added feature Features or general enhancements new backend PRs or issues related to adding new backends labels Apr 5, 2024
@lostmygithubaccount lostmygithubaccount self-assigned this Apr 8, 2024
@lostmygithubaccount
Copy link
Member

Hi @jaychia! We'd be happy to help with adding in a Daft backend for Ibis. Does or will Daft have a SQL interface? I'm assuming it's only the Python dataframe interface. Today, Ibis supports 3 Python dataframe backends:

  1. pandas
  2. Dask
  3. Polars

I believe the Daft API is most similar to the Polars API. Thus, I'd suggest taking a look at the Polars backend and starting there for the implementation of Daft. Generally the process will be to get the backend started -- creating a connection, implementing enough functionality to get data into it (create_table, read_parquet, etc.), and implementing basic operations. There's no minimum required per se, but it would be good to support most of the basics upon initial release (ordering, aggregations, filtering, etc.).

Ibis defines over 300 operations, many that won't be applicable to every backend. You can see current coverage here: https://ibis-project.org/support_matrix. So it's completely fine to start with a MVP for the Daft backend and increase coverage over time.

Let me know if you have any additional questions! I'd recommend essentially copying one of the existing backends (probably Polars), cutting it down, and working to get the test suite passing.

@lostmygithubaccount lostmygithubaccount removed their assignment Apr 8, 2024
@jaychia
Copy link
Author

jaychia commented Apr 9, 2024

Yes indeed - Daft is probably most similar to the Polars lazy API. We do not yet have a SQL frontend.

Would https://github.com/ibis-project/ibis/blob/main/ibis/backends/polars/tests/conftest.py be a good place to start to implement a backend?

@lostmygithubaccount
Copy link
Member

@jaychia apologies for the slow response! yes, something like that -- you can take a look at the Polars implementation, get the basic tests passing, and go from there

@jaychia
Copy link
Author

jaychia commented Aug 29, 2024

A quick update here:

The team is actively looking at building up a SQL frontend to Daft. We have basic support up already, with a more extensive roadmap detailed here: https://github.com/orgs/Eventual-Inc/projects/8/views/1

That might end up being the easiest way to integrate ibis, given that we can use SQL as the narrow waist between the ibis and Daft backends.

Let me know if that makes sense, and if that might be the better way forward?

@lostmygithubaccount
Copy link
Member

hi @jaychia, that sounds like it would be a great option. is there a specific SQL dialect daft is targeting? if so, we could probably re-use one of the existing SQL compilers within Ibis (provided by SQLGlot)

@jaychia
Copy link
Author

jaychia commented Sep 1, 2024

No specific dialect at the moment, we're still building out SQL support in Daft and can provide more updates as we go along.

IIUC then if we are compatible with any of SQLGlot's target dialects then we should be good to go? Am I understanding this correctly that Ibis does: dataframe syntax -> some SQL dialect --- SQLGlot ---> some target SQL dialect --> Daft dataframe query plan?

cc @universalmind303 who is working on our SQL support

@gforsyth
Copy link
Member

gforsyth commented Sep 3, 2024

IIUC then if we are compatible with any of SQLGlot's target dialects then we should be good to go

Correct.

Am I understanding this correctly that Ibis does: dataframe syntax -> some SQL dialect --- SQLGlot ---> some target SQL dialect --> Daft dataframe query plan?

Not quite, but close, it's dataframe syntax -> Ibis Internal Representation -> SQLGlot -> target SQL dialect -> Daft query plan

@afterthought
Copy link

Wondering if a daft/ibis integration would be able to benefit from a distributed daft to_delta implementation or if ibis has to stream back all the data first as it does in the default implementation here: https://github.com/ibis-project/ibis/blob/main/ibis/backends/__init__.py#L551. Is there a branch going with WIP related to this issue?

@jaychia
Copy link
Author

jaychia commented Oct 3, 2024

Daft does indeed have a distributed deltalake writer. Not sure how that would need to interface with Ibis though.

We also recently implemented SQL support which might help pave the way for easy Ibis integration, using SQL as the handoff point.

@gforsyth
Copy link
Member

gforsyth commented Oct 3, 2024

Is there a branch going with WIP related to this issue?

There is not, but we'd be happy to help someone get started working on it.

if a daft/ibis integration would be able to benefit from a distributed daft to_delta implementation

Yep, we would just map the to_delta function to whatever call Daft would make to perform the parallel write.

That's effectively what we do with PySpark: https://github.com/ibis-project/ibis/blob/main/ibis/backends/pyspark/__init__.py#L985-L1020

@afterthought
Copy link

@jaychia so do you think Daft is far enough for this ibis work to start?

@adamrimon
Copy link

adamrimon commented Dec 29, 2024

@gforsyth

There is not, but we'd be happy to help someone get started working on it.

I'd love to help with that. How can we connect?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements new backend PRs or issues related to adding new backends
Projects
Status: backlog
Development

No branches or pull requests

5 participants