-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executing an SQL query over a dataset #2871
Comments
Hi, I'll just chime in here with question on how should we approach this problems. It could be simple as the above example, but it could be as complex as what What do you think @ritchie46 @ghuls . Should we go as simple as adding SQL support for simple query (SELECT, GROUP BY) on single ctx = pl.sql.Context()
ctx.register(df1, 'df1')
ctx.register(df2, 'df2')
df3 = ctx.execute('... SOME SQL QUERY ') |
Hi, What I wanted is to create something more complex as you put in you example. Because my example is too easy at the end. |
@potter420, I think we should add a context if we decide to do this. Whereas you can have pandas behavior as a subset of that context. There has already been a very very simple start on this here: #2880 Anybody willing to improve/extend upon that is more than welcome. It is something I'd like to have in this project, but currently don't have the bandwith for to do on my own. There is already a postgres complient sql parser in Rust, so we need to make the mapping between that AST to a polars logical plan. This should also open up the possibilites for a polars sql cli that queries different files. |
Alright, complex it is then. I'll laid out some to do list here, please correct me :D
Please do comment if anything is missing and I will add them to the list. IMO, since we're dealing with immutable |
Yep, sounds good. Not that polars does allow mutations since not too long ago. But idiomatically I don't think we will need them much. |
@potter420 You might find some good inspiration from duckdb, which does much the same thing (we're now using it internally -at work- for operating more efficiently on intermediary pandas DataFrames from research, but it also groks arrow, csv, parquet, etc. Evaluating it for some out-of-core workloads next). |
Hi, I am also very interested in querying Polars Dataframes with SQL. I am working with the folks over at PRQL and in particular I'm building a prql-tool to make it easier to use PRQL to query existing datasources. I would love to support Polars as a backend but we currently only emit SQL. Moreover there is now a proliferation of these DataFrame APIs which are all very similar to each other but also independent. In particular I'm thinking of Polars API, Pandas API, Datafusion DataFrame API, Ibis, dplyr, Spark, to name just a few and there's probably more. The Substrait project recognises this and tries to provide a single Intermediate Representation (IR) to allow more interoperability. How about implementing a Substrait Consumer rather than YASP (Yet Another SQL Parser)? My sense is that the Substrait IR would be much more closely aligned to the Polars API and therefore should be more simple to implement. A further benefit would be that you would get more interoperability with other Query Frontends without limiting yourself to a particular SQL dialect. Datafusion already has a substrait extension package in Rust datafusion-substrait with producer and consumer implementations and their consumer implementation might be a useful starting point for you to assess the viability of implementing a Substrait Consumer for Polars. In order to still be able to consume SQL queries, you might be able to use/lean on the Datafusion Substrait Producer to parse the SQL queries and produce the Substrait IR. What do you think @potter420 and @ritchie46 ? P.S. Early adopters of Substrait appear to be Ibis, Datafusion, and DuckDB. P.P.S. I searched through your Issues list and there seemed to be no other mention of Substrait so far. |
This is an excellent overview of the Substrait project: |
I am interested in supporting this. The big impediment is currently my limited bandwith, so any help would be great. |
Just in case y'all haven't seen it yet, there's a Substrait validator written in Rust ( https://github.com/substrait-io/substrait-validator ) that could at the very least get you started on Substrait protobufs and what they contain. I'd love this feature to make R bindings for polars (I'm involved in writing an R substrait producer...if polars implemented a consumer it would make it much, much easier to happen. As incentive, DuckDB implements SQL->Substrait, so you'd get that for free (well, for an optional DuckDB dependency). |
closing this as completed. |
Related to substrait #7404 |
I would like to execute SQL queries over a dataframe or lazyframe
We expect to launch queries like: df.query("select * from df")
The text was updated successfully, but these errors were encountered: