Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for INSERT IGNORE / ON CONFLICT DO NOTHING #16949

Open
Tracked by #14496 ...
shravan2x opened this issue Aug 5, 2019 · 28 comments
Open
Tracked by #14496 ...

Add support for INSERT IGNORE / ON CONFLICT DO NOTHING #16949

shravan2x opened this issue Aug 5, 2019 · 28 comments

Comments

@shravan2x
Copy link

shravan2x commented Aug 5, 2019

I propose adding an API to indicate to EF Core that an insert (via Add or AddRange and variants) may fail. This is useful when adding data to a table that may already exist.

For instance, I have a program that collects analytics data from a source and inserts it all into a table. Some of these rows may be duplicates of previously inserted rows and so will fail unique constraints. These errors can be safely ignored since the rows already exist in the table.

Using individual INSERT IGNORE ... statements results in horrible performance (~15x slower than AddRangeAsync).

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 5, 2019

What Database provider?

@ajcvickers
Copy link
Contributor

EF Team Triage: This issue is not something that our team is planning to address in the EF6.x code base. This does not mean that we would not consider a community contribution to address this issue.

Moving forwards, our team will be fixing bugs, implementing small improvements, and accepting community contributions to the EF6.x code base. Larger feature work and innovation will happen in the EF Core code base (https://github.com/aspnet/EntityFramework).

Closing an issue in the EF6.x project does not exclude us addressing it in EF Core. In fact, a number of popular feature requests for EF have already been implemented in EF Core (alternate keys, batching in SaveChanges, etc.).

BTW this is a canned response and may have info or details that do not directly apply to this particular issue. While we'd like to spend the time to uniquely address every incoming issue, we get a lot traffic on the EF projects and that is not practical. To ensure we maximize the time we have to work on fixing bugs, implementing new features, etc. we use canned responses for common triage decisions.

@shravan2x
Copy link
Author

@ErikEJ MySql.Data and MySqlConnector (I tested on both).

@ajcvickers Isn't this already the EF Core repo? I'm confused what you mean by your comment.

@ajcvickers
Copy link
Contributor

@shravan2x It's a boilerplate message--see the note at the end of it. The point is we're not going to do it for EF6.

@shravan2x
Copy link
Author

shravan2x commented Aug 5, 2019

@ajcvickers Oh I see. I don't know which repo hosts the EF6 code base, but since this one is already the EF Core code base, can we leave the issue open?

EDIT: Hopefully I understood what you meant :)

@ajcvickers
Copy link
Contributor

@shravan2x Are you asking for this to be implemented in EF Core or EF6?

@shravan2x
Copy link
Author

shravan2x commented Aug 5, 2019

@ajcvickers EF Core. Sorry, my original question was mis-worded. I'll fix that.

@ajcvickers ajcvickers reopened this Aug 5, 2019
@ajcvickers
Copy link
Contributor

ajcvickers commented Aug 5, 2019

@shravan2x We will discuss in triage.

@ajcvickers ajcvickers added this to the Backlog milestone Aug 12, 2019
@ajcvickers
Copy link
Contributor

Notes from triage: it's not 100% clear that this is really useful in the context of an OR/M, but moving it to the backlog to consider at a later time.

@LeaFrock
Copy link

LeaFrock commented Oct 28, 2019

Notes from triage: it's not 100% clear that this is really useful in the context of an OR/M, but moving it to the backlog to consider at a later time.

At least for MySql, it's very useful.
I‘ve seen a lot of questions like INSERT-IGNORE-using-EF-Core about how to insert datas into a table , which already has datas and an unique index/key, without any exceptions(if the data exists, just skip it). MySql supports it by 'insert ignore into', while SqlServer is still unknown to me for the answer.
If EF Core provides the API, we'll get the best practice.

@mqudsi
Copy link

mqudsi commented Apr 16, 2020

@ajcvickers can you explain what your doubts about its suitability for an ORM are? I'm not sure I see the difference between this or other SQL translations.

When attempting to insert data that may already be present in the database without insert or ignore, you basically need to either prepopulate some sort of bloom filter or hash set beforehand and pray you don't run into any coherence issues or else use fine-grained transactions and throw-and-catch lots of exceptions.

While the SQL specifics are certainly provider-dependent (just like everything else), its certainly not a concept specific to a single SQL dialect. INSERT IGNORE is very closely linked with upsert, which is quickly becoming "a thing" in the rdbms world. MySQL hase INSERT IGNORE and REPLACE INTO, SQLite has INSERT OR IGNORE for primary key conflicts and ON CONFLICT [IGNORE|REPLACE], PostgreSQL has ON CONFLICT ____ DO [NOTHING|UPDATE ...], SQL Server has IF NOT EXISTS(..) for old versions and now MERGE ....

Please correct me if I'm wrong, but we have a general concept (need to insert content that may or may not already exist in the database and either update/abort/ignore if it already does) that is available in the most popular database options supported/targeted by EF Core, but requires different syntax for each implementation, which to me makes it definitely ORM-worthy.

The only issue is that it does not map cleanly to LINQ because LINQ expressions don't really have any concept of constraints. But for the most part, updates and inserts are orthogonal to querying and I imagine this could be implemented without touching the IQueryable interface at all; if you take the most simple example of either completely replacing or completely ignoring (i.e. disregarding the more complex options afforded by the PostgreSQL and SQL Server languages) then in its most basic form this could simply be an optional parameter to the DbSet<>.Add() method specifying the conflict resolution action with ABORT or equivalent (which is basically the current behavior) being the default and the only required version to support.

@ajcvickers
Copy link
Contributor

@mqudsi I don't remember the full extent of the discussion with the team. Nevertheless, this is something we have on the backlog to consider for a future release.

@tombohub
Copy link

tombohub commented Sep 2, 2021

Does anybody have a workaround?

@shravan2x
Copy link
Author

@tombohub I use this package which works fine - https://github.com/artiomchi/FlexLabs.Upsert

@roji
Copy link
Member

roji commented Sep 2, 2021

This is probably related to upsert support (#4526) - the two should probably be at least considered together.

On PostgreSQL, INSERT has an ON CONFLICT clause. ON CONFLICT DO NOTHING is the same as INSERT IGNORE, whereas ON CONFLICT DO UPDATE is upsert.

  • A tricky aspect here is the interaction with change tracking: if an insert is ignored, we end up with a discrepancy (since the existing row in the database may have other values as what we're trying to insert). This may make sense only with non-tracking inserts (Do not track after SaveChanges() #9118).
  • PostgreSQL allows defining ON CONFLICT actions per constraint (or not), so that primary key violations may be ignored/upserted while other unique index violations may still throw. MySQL INSERT IGNORE doesn't seem to support this - not sure if that's a problem.

@mqudsi
Copy link

mqudsi commented Sep 2, 2021

An issue/problem is that arbitrary columns may be passed to PostgreSQL ON CONFLICT but it'll fail at runtime if the columns aren't covered by a unique index/constraint. MySQL's INSERT IGNORE is much simpler (and therefore more ORM-friendly).

One would be forgiven for thinking that this particular sub-feature could be elided from any ignore/upsert implementation, but such an abstraction would be severely crippled because any unique constraints that include a nullable column would never trigger (because NULL != NULL in SQL). PostgreSQL users could get around this with a functional index/constraint that coalesces the nullable column with a reserved invalid value (i.e. default(Guid)), but I don't think MySQL can do that without needing to also use generated columns (where supported) - but a WHERE clause would have to be provided.

(With postgres, you could also use (one or more) partial index(es) but that would require the ORM ignore/upsert API to additionally support/require a WHERE clause specifying the same predicate as the partial index so it would apply - and you'd still have to specifically name the columns in question since, IIRC, postgres index inference only searches for matching non-partial indices if no columns are specifically named.)

Otherwise, multiple indices covering all permutations of one nullable column + all other columns would be required.

@roji
Copy link
Member

roji commented Sep 3, 2021

An issue/problem is that arbitrary columns may be passed to PostgreSQL ON CONFLICT but it'll fail at runtime if the columns aren't covered by a unique index/constraint. MySQL's INSERT IGNORE is much simpler (and therefore more ORM-friendly).

Interesting - what specific issues do you see with PG's arbitrary column support, and how do you see INSERT IGNORE this as more ORM-friendly? At the very least, the PG provider could generate ON CONFLICT IGNORE without any columns, providing (I think) the same behavior as MySQL, no? In that sense PG seems to be a superset of the MySQL functionality?

One would be forgiven for thinking that this particular sub-feature could be elided from any ignore/upsert implementation, but such an abstraction would be severely crippled because any unique constraints that include a nullable column would never trigger (because NULL != NULL in SQL).

I don't think this is really worse than other places where different database behavior leaks - EF Core doesn't really pretend to provide a uniform abstraction over all databases (e.g. string comparison is case-insensitive in some databases, insensitive in others). In other words, it may make sense to provide a general relational "add or ignore" mechanism, which would be implemented by ON CONFLICT on PG and INSERT IGNORE in MySQL, even if there are some behavioral differences. But I admit I haven't gone into the fine details here.

But your points (and suggested workarounds) are interesting... Note that PG also has exclusion constraints, which could be used with the IS DISTINCT FROM to disallow multiple nulls, if that's desired (just as standard SQL DISTINCT does consider NULLs equal). This would obviously be a PG-specific thing.

@AndriySvyryd AndriySvyryd changed the title Add support for INSERT IGNORE Add support for INSERT IGNORE / ON CONFLICT DO NOTHING Nov 4, 2021
@mqudsi
Copy link

mqudsi commented Mar 10, 2022

Not directly related to the issue but with regards to how it would be implemented and my earlier comments about the difficulties in using a unique index covering potentially null fields to handle insert ignore conflicts: PostgreSQL 15 is going to ship w/ support for UNIQUE [NULLS NOT DISTINCT], which is kind of a big deal. I guess this is a more ergonomic and directly usable alternative to the exclusion constraints Shay mentioned. I've also heard we can expect to see other providers also adding support for this?

More here: commit and discussion

@roji
Copy link
Member

roji commented Mar 11, 2022

@mqudsi thanks for pointing that out. Do you have any indication of other databases implementing this feature? For now, I've opened npgsql/efcore.pg#2298 to track this as an Npgsql-specific feature, but we can always promote it into EF Core's relational layer if multiple other databases add support.

@mqudsi
Copy link

mqudsi commented Mar 12, 2022

@roji Thanks for the interest in including support for this in Npgqsl!

I can't seem to find the article I was reading that mentioned support for another db, so perhaps I shouldn't have said others might be adding support.

(Technically the equivalent of a unique index that includes n non-distinct nullable columns (current pgsql approach is that either all or none of the nullable columns are treated as distinct, which helps preserve our sanity) is equivalent to 2^n unique indexes like on table(col1, col2) where col3 is null, on table(col1, col3) where col 2 is null, ... which an ORM could easily calculate, but I'm not sure it would be a smart thing to do unless the dev fully understands the implications!)

Regarding exclusion constraints - I looked rather hard but it seems that postgresql requires an actual symbolic operator rather than an equivalent expression for exclusion constraints (and there is none equivalent to IS DISTINCT FROM, excepting custom operators), but perhaps I missed something? While playing around with that, I did discover an alternative approach that might work even on other db engines: if you abuse the fact that (and I don't know if this, syntax aside, is postgres-specific) ARRAY[NULL] = ARRAY[NULL] then you can actually just create just one functional/computed index like

CREATE UNIQUE INDEX ON table( (ARRAY[notnull1, notnull2, nullable1, nullable2, nullable3]) ) 
        [ WHERE num_nulls(nullable1, nullable2, nullable3) > 0 ]

(You could optionally do USING HASH since we will never look up on its basis, depending on how often you expect a collision leading to a disk access, iiuc?)


(I never did reply to your question about why PostgreSQL's ability to name specific columns for a ON CONFLICT clause would be a problem if the shared functionality with MySQL's barebones INSERT IGNORE (an ON CONFLICT with no named columns) isn't a problem. And you're right: I just got so tangled up in my thinking of how best to abuse pgsql's features for my purposes that I completely missed the forest for the trees.)

@roji
Copy link
Member

roji commented Mar 14, 2022

Thanks for the continued insights - this is definitely interesting and valuable.

(Technically the equivalent of a unique index that includes n non-distinct nullable columns (current pgsql approach is that either all or none of the nullable columns are treated as distinct, which helps preserve our sanity) is equivalent to 2^n - 1 unique indexes like on table(col1, col2) where col3 is null, on table(col1, col3) where col 2 is null, ... which an ORM could easily calculate, but I'm not sure it would be a smart thing to do unless the dev fully understands the implications!)

I'm not sure exactly what you mean here - can you elaborate?

Regarding exclusion constraints - I looked rather hard but it seems that postgresql requires an actual symbolic operator rather than an equivalent expression for exclusion constraints (and there is none equivalent to IS DISTINCT FROM, excepting custom operators), but perhaps I missed something?

That's interesting - and I think you're right, exclusion constraints seem to only work with basic operators, which IS DISTINCT FROM apparently is not. I think this could also mean that IS DISTINCT FROM may not (currently) be sped up with indexes, which I vaguely remember is indeed the case.

@mqudsi
Copy link

mqudsi commented Mar 22, 2022

I'm not sure exactly what you mean here - can you elaborate?

Sure. Let's say you have multiple one-to-many mappings all for the same type, and for some reason it's going through a mapping table instead of having the principal key on each foreign entity directly: e.g. owners(user_id, document_id?, product_id?, foo_id?, ...) and you want to make sure no one document/product/foo is owned twice. With PostgreSQL 15 you could use

ALTER TABLE owners ADD CONSTRAINT one_owned_type CHECK( num_nonnulls(document_id, product_id, foo_id) = 1 );
CREATE UNIQUE INDEX owned_only_once on owners(document_id, product_id, foo_id) NULLS NOT DISTINCT;

You can accomplish the functional equivalent in current versions of PostgreSQL (or other databases) by creating multiple indexes that handle the various nullability cases:

/* some of the following index combinations will never be used because of our CHECK CONSTRAINT, but I've left them to illustrate the general approach */
CREATE UNIQUE INDEX owned_only_once1 on owners(document_id, product_id, foo_id) WHERE document_id IS NOT NULL AND product_id IS NOT NULL AND foo_id IS NOT NULL);
CREATE UNIQUE INDEX owned_only_once2 on owners(document_id, product_id) WHERE document_id IS NOT NULL AND product_id IS NOT NULL AND foo_id IS NULL;
CREATE UNIQUE INDEX owned_only_once3 on owners(document_id, foo_id) WHERE document_id IS NOT NULL AND product_id IS NULL AND foo_id IS NOT NULL;
CREATE UNIQUE INDEX owned_only_once4 on owners(document_id) WHERE document_id IS NOT NULL AND product_id IS NULL AND foo_id IS NULL;
CREATE UNIQUE INDEX owned_only_once5 on owners(product_id, foo_id) WHERE document_id IS NULL AND product_id IS NOT NULL AND foo_id IS NOT NULL;
CREATE UNIQUE INDEX owned_only_once6 on owners(product_id) WHERE document_id IS NULL AND product_id IS NOT NULL AND foo_id IS NULL;
CREATE UNIQUE INDEX owned_only_once7 on owners(foo_id) WHERE document_id IS NULL AND product_id IS NULL AND foo_id IS NOT NULL;
/* if you had non-nullable columns included in the index, this additional index would be required: */
CREATE UNIQUE INDEX owned_only_once8 on owners(/* non-nullable member cols here */) WHERE document_id IS NULL AND product_id IS NULL AND foo_id IS NULL;

Those indexes should be functionally equivalent (from the perspective of what entries are allowed or rejected) to the NULLS NOT DISTINCT index from postgres 15, just much more of a pita to maintain. I believe the pg15 impl also performs better, though the size (minus constant per-index sizes) should be the same w/ all the WHERE preconditions left in.

(I realized after writing that the example I picked is not the best since it's typically going to have the only-one-non-null constraint for sanity reasons. I couldn't come up with a better example off the top of my head, so I just kept it!)

@roji
Copy link
Member

roji commented Mar 22, 2022

Thanks @mqudsi, makes sense.

Note SqlServerIndexConvention, which is a SQL Server-specific convention that filters out NULLs on unique non-clustered indexes.

But let's continue any conversation on this in npgsql/efcore.pg#2298, as this is quite off-topic here.

@dragoon
Copy link

dragoon commented Jul 12, 2022

I also think this is a very important feature that even most basic app needs.

Consider a scenario where there is a 3rd-party data provider, and I can only query last N objects, that I later transform and insert in my db. Without this functionality I need to first check which object already exists by first querying the db, excluding the existing id, and only inserting the new ones.

@roji
Copy link
Member

roji commented Oct 19, 2022

Note #14464 which was about using the following technique to achieve "add or ignore":

INSERT INTO Users (UserName, Email)
SELECT @UserName, @Email
EXCEPT 
SELECT UserName, Email
FROM Users
WHERE Email = @Email

On databases where a specific syntax already exists for add or ignore, that's obviously best. On SQL Server - where no such thing exists - the above could be an alternative implementation. Another way to approach this would be with MERGE - which may be more efficient - though MERGE would probably have concurrency issues (all this needs to be researched).

@borisdj
Copy link

borisdj commented May 8, 2023

To add one solution, using EFCore.BulkExtensions this can be done with

If we want Insert only new and skip existing ones in Db (Insert_if_not_Exist) then use BulkInsertOrUpdate with config PropertiesToIncludeOnUpdate = new List<string> { "" }

Disc. I'm the author, and Lib. now has Dual License - a fix for OSS sustainability.
(Community version is free and covers most users, Commercial one is for companies over 1 mil. revenue)

@kitgrose
Copy link

I would really like to see something like this supported in EF Core.

My specific use-case: I'm writing a (simple) data importer which periodically imports CSV data into my application database. The data being imported in each CSV is relatively small, but the operation happens often enough that the database is expected to be very large over time, and the operation may run in parallel with other imports. Since the data in the database only comes from the same source, it's never an issue if a new record can't be added (because its primary key is already in the table); in fact it's very likely since the CSV data deliberately contains an hour or so of overlapping data each import to avoid missing records added around the time of export.

On SQL Server this can be achieved with the IGNORE_DUP_KEY argument on the index, but using that throws exceptions in EF Core, presumably because of entity change tracking. I'm sure you're already familiar with this property, but I struggled to find this issue again because none of the other comments mention that specific argument by name.

@roji
Copy link
Member

roji commented May 15, 2024

@kitgrose on SQL Server, this kind of thing is typically achieved via the MERGE statement, which is presumably what we'd use to implement the future insert-or-ignore API. For now, you should be able to drop down to SQL to code this specific importer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests