Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Backend.upsert #5391

Open
jcrist opened this issue Jan 31, 2023 · 15 comments
Open

Add Backend.upsert #5391

jcrist opened this issue Jan 31, 2023 · 15 comments
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements

Comments

@jcrist
Copy link
Member

jcrist commented Jan 31, 2023

Currently ibis supports create_table for creating (or replacing) a table, and insert for inserting new data into a table. Another common operation is upsert, which is supported by some backends. This is like insert, but also supports updating existing rows.

  • For the data analysis DB tasks that ibis is targeted at is there a use case for this operation?
  • What backends support this operation?
  • If we add support, what options should Backend.upsert support?
@jcrist jcrist added feature Features or general enhancements ddl Issues related to creating or altering data definitions labels Jan 31, 2023
@jcrist
Copy link
Member Author

jcrist commented Jan 31, 2023

cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.

@cpcloud
Copy link
Member

cpcloud commented Apr 22, 2023

This might be useful, but let's wait until someone requests it.

@cpcloud cpcloud closed this as completed Apr 22, 2023
@mw3i
Copy link

mw3i commented Dec 11, 2023

this would be a really nice feature to have. it would basically mean the only sql library my team needs to use is ibis -- rather than the current awkward mix of sqlalchemy + dataset (for upserting) + ibis

@lostmygithubaccount
Copy link
Member

reopening for triage, thanks for the input @mw3i!

@NickCrews
Copy link
Contributor

I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.

@mw3i do you have any other tips or resources I should look at as I go down this path?

@prmoore77
Copy link

cc @prmoore77 - someone mentioned in passing that you might have some use cases for upsert support.

hey @jcrist - yes, Upsert, or MERGE as it is known in quite a few databases is a very useful ETL tool. Having that functionality in back-ends which support the MERGE DML statement would be very useful, and likely get more folks to use Ibis for ETL needs. Thanks!

@lostmygithubaccount
Copy link
Member

I think this and some related work in ensuring Ibis DDL things work w/ PyArrow tables and batches would be great and help close the loop. then you can run queries from one backend and use the results to populate tables in another, e.g. PySpark -> Postgres or Trino -> Clickhouse as we've heard recently

@mw3i
Copy link

mw3i commented May 13, 2024

I have a data web app that has historically only been used for analysis using ibis, but I am about to need to add UPDATE features into it. I would love to stay inside the ibis ecosystem.

@mw3i do you have any other tips or resources I should look at as I go down this path?

Sorry for the late reply. Since ibis doesn't have upsert, our team uses the dataset library for inserts/updates/upserts. We do all our normal code in ibis, and then updates/upserts are done with:

import dataset

params = f"{dialect}://{user}:{password}@{url}:{port}/{database}" # <-- sqlalchemy database uri string
with dataset.connect(params) as dbx:
    dbx[tablename].upsert_many(
        list_of_dictionaries, # <-- each dict is a db entry
        column_to_use_as_identifier,
    )

It's not ideal but it lets us use ibis for everything except db writes (and ibis is just too good of a library to give up)

This was referenced May 13, 2024
@NickCrews
Copy link
Contributor

Thank you @mw3i, that looks great! Really if we implement #9186 and #9185, this would make it very easy to just basically pawn off this responsibility to users to do themselves with an external library. Same thing for other feature requests that come in: the easier we make it to interoperate with other libs, the less pressure there is for us to implement it ourselves.

@mw3i
Copy link

mw3i commented May 13, 2024

Fair point about interoperability (anything to reduce internal developer workload is a plus); though the dataset library in particular might be idiosyncratic (idk how many other libraries expect data to be formatted that way)

Given ibis leverages sqlalchemy already, it might not be too hard to maintain just 2 functions (insert and update, where update has an 'upsert' argument) that implement sqlalchemy inserts and upserts in the background without needing to make specific functions for each backend (though apologies if I'm being naive here)

I've seen some relative straightforward gists that implement pseudo upserts in sqlalchemy. Idk how easily something like that could fit into the framework ibis uses

@ramil-hx
Copy link

ramil-hx commented Dec 26, 2024

Can we implement Backend.upsert in Snowflake please. I need it for my project. I'm stuck how to insert and update if record exists. Please advise on alternatives.

@NickCrews
Copy link
Contributor

@ramil-hx as a workaround, this is what I currently use to upsert records into a duckdb database. To work with snowflake's dialect, you probably need to modify this slightly, both in regards to the SQL, and to check how they handle primary keys.

from ibis.backends.sql import SQLBackend
from ibis.expr import types as ir

def upsert(db: SQLBackend, table_name: str, new_data: ir.Table) -> ir.Table:
    """Upsert new data into a table using duckdb's INSERT OR REPLACE INTO tbl.

    This assumes that the destination table has a primary key constraint.
    Otherwise, the new rows will just be appended.

    Not all fields need to be present in the new data.
    Missing fields will be set to NULL.

    Returns the appended table.
    """
    existing_schema = db.table(table_name).schema()
    new_data = new_data.cast(existing_schema)
    new_data = new_data.select(*existing_schema)
    new_sql = db.compile(new_data)
    sql = f"INSERT OR REPLACE INTO {table_name} BY NAME \n{new_sql};"
    db.raw_sql(sql)
    return db.table(table_name)

@lmores
Copy link

lmores commented Jan 10, 2025

Hi @NickCrews (small world, eh!), have you ever used your workaround when new_data is a memtable? I get: duckdb.duckdb.CatalogException: Catalog Error: Table with name ibis_pandas_memtable_yynerbs36bezdf3us5awk7grji does not exist!.

In my use case the generated sql string is:

INSERT INTO target_table SELECT * FROM "ibis_pandas_memtable_yynerbs36bezdf3us5awk7grji" AS "t0" ON CONFLICT DO NOTHING;

Slightly different from yours as I need to run it both on duckdb and postgres, and needs to do nothing rather than updating; anyway, I totally see why the db engine complains! I am wondering if you managed to make it work also in these conditions.

@NickCrews
Copy link
Contributor

Lol hi!

I assume the problem is is that the memtable is actually in a different backend (the default duckdb one?) than the one you are upsetting into. Not at a computer to test, but if you start with new_data = backend.create_table(remember_to_cleanup, new data) then does that get the new data so where in the backend where you can get to it?

@lmores
Copy link

lmores commented Jan 17, 2025

Forgot to get back to you: you were right! My backend is postgres, creating an intermediate table as you suggested made it work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ddl Issues related to creating or altering data definitions feature Features or general enhancements
Projects
None yet
Development

No branches or pull requests

8 participants