Executing an SQL query over a dataset #2871

calavia88 · 2022-03-10T15:56:11Z

I would like to execute SQL queries over a dataframe or lazyframe

We expect to launch queries like: df.query("select * from df")

potter420 · 2022-03-23T18:01:05Z

Hi, I'll just chime in here with question on how should we approach this problems.

It could be simple as the above example, but it could be as complex as what datafusion already achieved.

What do you think @ritchie46 @ghuls . Should we go as simple as adding SQL support for simple query (SELECT, GROUP BY) on single dataframe, similar to pandas Query? Or we will have context, register dataframe with a context and execute query on that context (Similar to spark and datafusion), which then can support JOIN, UNION and many other complex stuff.
For example:

ctx = pl.sql.Context()
ctx.register(df1, 'df1')
ctx.register(df2, 'df2')
df3 = ctx.execute('... SOME SQL QUERY ')

calavia88 · 2022-03-23T18:30:26Z

Hi,

What I wanted is to create something more complex as you put in you example. Because my example is too easy at the end.

ritchie46 · 2022-03-23T19:13:05Z

@potter420, I think we should add a context if we decide to do this. Whereas you can have pandas behavior as a subset of that context.

There has already been a very very simple start on this here: #2880

Anybody willing to improve/extend upon that is more than welcome. It is something I'd like to have in this project, but currently don't have the bandwith for to do on my own. There is already a postgres complient sql parser in Rust, so we need to make the mapping between that AST to a polars logical plan.

This should also open up the possibilites for a polars sql cli that queries different files.

potter420 · 2022-03-24T02:02:02Z

ritchie46 · 2022-03-24T18:36:15Z

IMO, since we're dealing with immutable DataFrame, I would suggest that this context will stay away from all DML syntax like INSERT, UPDATE, DELETE. Rather we should go with the route of creating new DataFrame that are the result of DML, INSERT through bulk insert (CREATE new DF and UNION them with the existing). UPDATE through create new columns, DELETE by create new filtered DF.

Yep, sounds good. Not that polars does allow mutations since not too long ago. But idiomatically I don't think we will need them much.

alexander-beedie · 2022-08-10T09:53:51Z

@potter420 You might find some good inspiration from duckdb, which does much the same thing (we're now using it internally -at work- for operating more efficiently on intermediary pandas DataFrames from research, but it also groks arrow, csv, parquet, etc. Evaluating it for some out-of-core workloads next).

snth · 2022-09-21T08:59:17Z

Hi,

I am also very interested in querying Polars Dataframes with SQL. I am working with the folks over at PRQL and in particular I'm building a prql-tool to make it easier to use PRQL to query existing datasources. I would love to support Polars as a backend but we currently only emit SQL.

Moreover there is now a proliferation of these DataFrame APIs which are all very similar to each other but also independent. In particular I'm thinking of Polars API, Pandas API, Datafusion DataFrame API, Ibis, dplyr, Spark, to name just a few and there's probably more.

The Substrait project recognises this and tries to provide a single Intermediate Representation (IR) to allow more interoperability. How about implementing a Substrait Consumer rather than YASP (Yet Another SQL Parser)? My sense is that the Substrait IR would be much more closely aligned to the Polars API and therefore should be more simple to implement. A further benefit would be that you would get more interoperability with other Query Frontends without limiting yourself to a particular SQL dialect.

Datafusion already has a substrait extension package in Rust datafusion-substrait with producer and consumer implementations and their consumer implementation might be a useful starting point for you to assess the viability of implementing a Substrait Consumer for Polars. In order to still be able to consume SQL queries, you might be able to use/lean on the Datafusion Substrait Producer to parse the SQL queries and produce the Substrait IR.

What do you think @potter420 and @ritchie46 ?

P.S. Early adopters of Substrait appear to be Ibis, Datafusion, and DuckDB.

P.P.S. I searched through your Issues list and there seemed to be no other mention of Substrait so far.

snth · 2022-09-22T20:34:32Z

This is an excellent overview of the Substrait project:
Substrait: Rethinking DBMS Composability (VLDB 2022, CDMS workshop)

ritchie46 · 2022-09-23T09:26:23Z

I am interested in supporting this. The big impediment is currently my limited bandwith, so any help would be great.

paleolimbot · 2022-09-23T18:15:42Z

Just in case y'all haven't seen it yet, there's a Substrait validator written in Rust ( https://github.com/substrait-io/substrait-validator ) that could at the very least get you started on Substrait protobufs and what they contain. I'd love this feature to make R bindings for polars (I'm involved in writing an R substrait producer...if polars implemented a consumer it would make it much, much easier to happen. As incentive, DuckDB implements SQL->Substrait, so you'd get that for free (well, for an optional DuckDB dependency).

universalmind303 · 2023-04-23T18:37:45Z

closing this as completed.

eitsupi · 2023-05-17T10:54:20Z

Related to substrait #7404

calavia88 added the feature label Mar 10, 2022

ritchie46 added the help wanted Extra attention is needed label Mar 23, 2022

ritchie46 mentioned this issue Sep 24, 2022

feat[rust, python]: start supporting sql #4967

Merged

universalmind303 closed this as completed Apr 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executing an SQL query over a dataset #2871

Executing an SQL query over a dataset #2871

calavia88 commented Mar 10, 2022

potter420 commented Mar 23, 2022

calavia88 commented Mar 23, 2022

ritchie46 commented Mar 23, 2022

potter420 commented Mar 24, 2022 •

edited

Loading

ritchie46 commented Mar 24, 2022

alexander-beedie commented Aug 10, 2022 •

edited

Loading

snth commented Sep 21, 2022 •

edited

Loading

snth commented Sep 22, 2022

ritchie46 commented Sep 23, 2022

paleolimbot commented Sep 23, 2022

universalmind303 commented Apr 23, 2023

eitsupi commented May 17, 2023 •

edited

Loading

Executing an SQL query over a dataset #2871

Executing an SQL query over a dataset #2871

Comments

calavia88 commented Mar 10, 2022

potter420 commented Mar 23, 2022

calavia88 commented Mar 23, 2022

ritchie46 commented Mar 23, 2022

potter420 commented Mar 24, 2022 • edited Loading

ritchie46 commented Mar 24, 2022

alexander-beedie commented Aug 10, 2022 • edited Loading

snth commented Sep 21, 2022 • edited Loading

snth commented Sep 22, 2022

ritchie46 commented Sep 23, 2022

paleolimbot commented Sep 23, 2022

universalmind303 commented Apr 23, 2023

eitsupi commented May 17, 2023 • edited Loading

potter420 commented Mar 24, 2022 •

edited

Loading

alexander-beedie commented Aug 10, 2022 •

edited

Loading

snth commented Sep 21, 2022 •

edited

Loading

eitsupi commented May 17, 2023 •

edited

Loading