Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executing an SQL query over a dataset #2871

Closed
calavia88 opened this issue Mar 10, 2022 · 12 comments
Closed

Executing an SQL query over a dataset #2871

calavia88 opened this issue Mar 10, 2022 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@calavia88
Copy link

I would like to execute SQL queries over a dataframe or lazyframe

We expect to launch queries like: df.query("select * from df")

@potter420
Copy link
Contributor

Hi, I'll just chime in here with question on how should we approach this problems.

It could be simple as the above example, but it could be as complex as what datafusion already achieved.

What do you think @ritchie46 @ghuls . Should we go as simple as adding SQL support for simple query (SELECT, GROUP BY) on single dataframe, similar to pandas Query? Or we will have context, register dataframe with a context and execute query on that context (Similar to spark and datafusion), which then can support JOIN, UNION and many other complex stuff.
For example:

ctx = pl.sql.Context()
ctx.register(df1, 'df1')
ctx.register(df2, 'df2')
df3 = ctx.execute('... SOME SQL QUERY ')

@calavia88
Copy link
Author

Hi,

What I wanted is to create something more complex as you put in you example. Because my example is too easy at the end.

@ritchie46
Copy link
Member

@potter420, I think we should add a context if we decide to do this. Whereas you can have pandas behavior as a subset of that context.

There has already been a very very simple start on this here: #2880

Anybody willing to improve/extend upon that is more than welcome. It is something I'd like to have in this project, but currently don't have the bandwith for to do on my own. There is already a postgres complient sql parser in Rust, so we need to make the mapping between that AST to a polars logical plan.

This should also open up the possibilites for a polars sql cli that queries different files.

@ritchie46 ritchie46 added the help wanted Extra attention is needed label Mar 23, 2022
@potter420
Copy link
Contributor

potter420 commented Mar 24, 2022

Alright, complex it is then. I'll laid out some to do list here, please correct me :D

  • Context Object:

    • HashMap<String, DataFrame> that map a name to a pointer to dataframe.
    • execute
    • register
  • Query Parser: Basically translate SQL query to polars lazy expression, which will supports:

    • SELECT
    • GROUP BY
    • JOIN
    • UNION/EXCEPT/INTERSECT
    • WINDOW AGG
    • SQL function to polars: Static Map here, will consult postgres documentation as a reference
    • CREATE EXTERNAL TABLE: this will be used to read file, look the same as impala, Check this link. External is in a sense that the DataFrame was not explicitly declared in the Context
    • CREATE TABLE AS ...: Query and then save the result into the context for further processing.
    • UNLOAD (EXPR) ...: Write file to storage, could be external (optional), look the same as what Redshift did

Please do comment if anything is missing and I will add them to the list.

IMO, since we're dealing with immutable DataFrame, I would suggest that this context will stay away from all DML syntax like INSERT, UPDATE, DELETE. Rather we should go with the route of creating new DataFrame that are the result of DML, INSERT through bulk insert (CREATE new DF and UNION them with the existing). UPDATE through create new columns, DELETE by create new filtered DF.

@ritchie46
Copy link
Member

IMO, since we're dealing with immutable DataFrame, I would suggest that this context will stay away from all DML syntax like INSERT, UPDATE, DELETE. Rather we should go with the route of creating new DataFrame that are the result of DML, INSERT through bulk insert (CREATE new DF and UNION them with the existing). UPDATE through create new columns, DELETE by create new filtered DF.

Yep, sounds good. Not that polars does allow mutations since not too long ago. But idiomatically I don't think we will need them much.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Aug 10, 2022

@potter420 You might find some good inspiration from duckdb, which does much the same thing (we're now using it internally -at work- for operating more efficiently on intermediary pandas DataFrames from research, but it also groks arrow, csv, parquet, etc. Evaluating it for some out-of-core workloads next).

@snth
Copy link

snth commented Sep 21, 2022

Hi,

I am also very interested in querying Polars Dataframes with SQL. I am working with the folks over at PRQL and in particular I'm building a prql-tool to make it easier to use PRQL to query existing datasources. I would love to support Polars as a backend but we currently only emit SQL.

Moreover there is now a proliferation of these DataFrame APIs which are all very similar to each other but also independent. In particular I'm thinking of Polars API, Pandas API, Datafusion DataFrame API, Ibis, dplyr, Spark, to name just a few and there's probably more.

The Substrait project recognises this and tries to provide a single Intermediate Representation (IR) to allow more interoperability. How about implementing a Substrait Consumer rather than YASP (Yet Another SQL Parser)? My sense is that the Substrait IR would be much more closely aligned to the Polars API and therefore should be more simple to implement. A further benefit would be that you would get more interoperability with other Query Frontends without limiting yourself to a particular SQL dialect.

Datafusion already has a substrait extension package in Rust datafusion-substrait with producer and consumer implementations and their consumer implementation might be a useful starting point for you to assess the viability of implementing a Substrait Consumer for Polars. In order to still be able to consume SQL queries, you might be able to use/lean on the Datafusion Substrait Producer to parse the SQL queries and produce the Substrait IR.

What do you think @potter420 and @ritchie46 ?

P.S. Early adopters of Substrait appear to be Ibis, Datafusion, and DuckDB.

P.P.S. I searched through your Issues list and there seemed to be no other mention of Substrait so far.

@snth
Copy link

snth commented Sep 22, 2022

This is an excellent overview of the Substrait project:
Substrait: Rethinking DBMS Composability (VLDB 2022, CDMS workshop)

image

image

@ritchie46
Copy link
Member

I am interested in supporting this. The big impediment is currently my limited bandwith, so any help would be great.

@paleolimbot
Copy link

Just in case y'all haven't seen it yet, there's a Substrait validator written in Rust ( https://github.com/substrait-io/substrait-validator ) that could at the very least get you started on Substrait protobufs and what they contain. I'd love this feature to make R bindings for polars (I'm involved in writing an R substrait producer...if polars implemented a consumer it would make it much, much easier to happen. As incentive, DuckDB implements SQL->Substrait, so you'd get that for free (well, for an optional DuckDB dependency).

@universalmind303
Copy link
Collaborator

closing this as completed.

@eitsupi
Copy link
Contributor

eitsupi commented May 17, 2023

Related to substrait #7404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

8 participants