Bulk import for efficient importing of data from the client into the database #27333

roji · 2022-02-01T18:29:36Z

Databases usually provide various mechanisms for efficient bulk import of data from the client (e.g. .NET application) into the database. We already have issues tracking improving INSERT performance (#15059, #9118, #10443), but bulk import is a specialized scenario where the native database mechanisms work more efficiently (and often considerably so) than multiple INSERTs. Note also that this is different from bulk copy of data across tables within the database (INSERT ... SELECT, tracked by #27320).

Bulk import only allows targeting a table, so a simple method on DbSet should suffice (LINQ operators don't make sense here):

ctx.Blogs.ImportFrom(blogs);

ImportFrom should accept an IEnumerable parameter; ImportFromAsync should have overloads accepting both IEnumerable and IAsyncEnumerable parameters. The method would pull entity instances and import them into the database table via the provider-specific mechanism.

Additional notes

EF Core can provide a default implementation which just uses regular INSERTs (without tracking (#9118) and possibly with other optimizations). This would make the API always work, regardless whether a provider has provided a special implementation for it.
This method would not be tracking. This typically isn't what's needed when bulk-importing, and the optimized import mechanisms don't typically allow reading back generated values (i.e. the IDs).
Most databases also provide bulk import mechanisms which take CSV or other textual data. While this may be used under the hood (implementation detail), we wouldn't expose a user-facing API that deals with textual data; the focus is on an API where the user provides entity instances. An EF Core API accepting textual data would have no added value beyond just using the API directly.
- Similarly, some implementations allow importing data from a file on the database server - we again wouldn't expose an API for this as EF Core has no added value here.
We may want to provide some sort of hook for allowing users to pass provider-specific parameterization of the import process (e.g. SqlBulkCopy allows controlling the batch size, the timeout...). For example, the SQL Server provider could expose an additional overload accepting a custom SqlServerBulkCopyOptions, and there would be a way for that overload to pipe the user-provided options down to the provider's implementation.
Naming-wise, the word "import" is used because "copy" seems more ambiguous with copying across tables in the database (#27320).

Database support

SQL Server - takes DataTable or IDataReader as input.
PostgreSQL (Npgsql) - specialized API for binary import.
MySql
- MySqlConnector - similar to SQL Server's SqlBulkCopy (but seems to support only DataTable/IEnumerable, no IDataReader).
- Official driver (seems to be CSV/tab-delimited only).
SQLite: no special bulk import API apparently, but this dude wrote some tips on making inserts faster. It may make sense to wrap that in a bulk import implementation.
Cosmos. This is a good reason to have this abstraction in core rather than relational.

Community implementations

The text was updated successfully, but these errors were encountered:

AndriySvyryd · 2022-02-02T18:27:19Z

We might need an overload with anonymous types for shadow properties.

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

roji · 2022-02-02T19:01:07Z

We might need an overload with anonymous types for shadow properties.

Yep..

Note that this would also allow inserts for keyless entity types. So we might need a non-collection overload too.

Can you elaborate?

AndriySvyryd · 2022-02-02T20:44:55Z

Can you elaborate?

Just some sugar:

ctx.KeylessBlogs.ImportFrom(new KeylessBlog());

roji · 2022-02-02T20:56:28Z

Ah I see - you're saying this would be the only way to insert keyless entity types (because we don't support them in Add/SaveChanges), so it makes sense to allow "bulk import" of a single instance... Right.

roji · 2022-02-20T11:03:36Z

Note: we have #9118 for optionally not tracking after SaveChanges, which would unlock using SqlBulkCopy for standard SaveChanges. At that point, the advantage of a dedicated bulk import API (this issue) becomes bypassing the change tracking machinery, which may or may not be worth it.

yzorg · 2022-06-24T15:16:29Z

Vote for taking a very close look at EFCore.BulkExtensions. The most underrated .net OSS libraries I use.

It already respects EFCore table name mappings and column/property mappings. It supports all the "batch" scenarios: "upsert", insert only (only new), and "delete if missing" (MERGE with DELETE clause). I've needed all those scenarios. All way faster than current EFCore for 10k+ rows (aka importing a full dataset daily/monthly/x). It supports pulling back new identity values into entities, but off by default, which I think is sensible for bulk scenarios. But if you can reuse roji's RETURNING work here might speed it up.

roji · 2022-06-26T08:45:39Z

@yzorg thanks, yeah - we're aware of EFCore.BulkExtensions.

The plan specifically in this issue is only to wrap the database's native bulk import mechanism - SqlBulkCopy for SQL Server, binary COPY for PostgreSQL, etc. These lower-level API generally allow simply copying in large quantities of data into a table in the most efficient manner possible.

Various "merge" functionality such as upserting, deleting if missing, etc. are higher-level functionality which generally isn't covered by a database's bulk import mechanism, and are implemented by using additional database functionality. For example, it's typical to first use bulk import to efficiently copy data into a temporary table (e.g. via SqlBulkCopy), and then use MERGE to perform upsert/delete-if-missing/whatever between the temp table and the target table.

Such "compound" operations would be out of scope for this specific issue, but it would provide the first building block (bulk import). You could then use raw SQL to express whatever MERGE operation you want (we could possibly even provide an EF API for that, though I'm not sure it would be very valuable).

khteh · 2023-02-17T07:15:59Z

https://www.npgsql.org/doc/copy.html#binary-copy is rather crude. I need to ingest tens of thousands of record from Excel sheet into the DB without any duplicate.

roji · 2023-02-18T09:25:33Z

@khteh in what way is Npgsql's binary copy support crude? Deduplicating isn't a concern of bulk import in itself - that's something that needs to be done at a higher level (and therefore wouldn't be covered here in any case). More details on exactly what you need could help clarify things.

alrz · 2024-10-30T09:28:47Z

Would love to see this support IAsyncEnumerable as the input, in case we're pulling from another data source like mongo into sql.

roji added type-enhancement customer-reported area-save-changes labels Feb 1, 2022

This was referenced Feb 2, 2022

Copy/upsert data across tables in bulk (INSERT ... SELECT) #27320

Open

ExecuteUpdate/Delete (AKA bulk update, without loading data into memory) #795

Closed

ajcvickers added this to the Backlog milestone Feb 4, 2022

ajcvickers added the needs-design label Feb 4, 2022

roji mentioned this issue Feb 20, 2022

Speed up insert operation (BulkSaveChanges) #15059

Closed

roji mentioned this issue Mar 6, 2022

Optimize multiple insertions npgsql/efcore.pg#113

Open

ajcvickers mentioned this issue Apr 23, 2022

Do not track after SaveChanges() #9118

Open

ajcvickers added area-bulkcud and removed area-save-changes labels Jul 28, 2022

roji mentioned this issue Aug 21, 2022

Plan for Entity Framework Core 7.0 #26994

Closed

roji mentioned this issue Oct 15, 2022

Data migration using Parallel.For #29363

Closed

roji mentioned this issue Dec 19, 2022

Non-tracking ExecuteInsert #29897

Open

roji mentioned this issue Jan 6, 2023

ExecuteInsert (bulk insert) #29994

Closed

roji mentioned this issue Feb 23, 2023

Plan for Entity Framework Core 8 (EF8) #29853

Closed

ErikEJ mentioned this issue Jul 19, 2023

Way Faster Inserts #31310

Closed

roji mentioned this issue Feb 27, 2024

How can I correctly get NpgsqlDbType for an IProperty? npgsql/efcore.pg#3115

Closed

roji mentioned this issue Aug 8, 2024

EF integration for generating embeddings for database vector data #34387

Open

roji mentioned this issue Aug 22, 2024

Use EF 8 openjson for inserts #34512

Closed

roji mentioned this issue Aug 28, 2024

NpgsqlBatchCommand that contains COPY Command npgsql/npgsql#5816

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk import for efficient importing of data from the client into the database #27333

Bulk import for efficient importing of data from the client into the database #27333

roji commented Feb 1, 2022 •

edited

Loading

AndriySvyryd commented Feb 2, 2022

roji commented Feb 2, 2022

AndriySvyryd commented Feb 2, 2022

roji commented Feb 2, 2022

roji commented Feb 20, 2022

yzorg commented Jun 24, 2022

roji commented Jun 26, 2022

khteh commented Feb 17, 2023

roji commented Feb 18, 2023

alrz commented Oct 30, 2024

Bulk import for efficient importing of data from the client into the database #27333

Bulk import for efficient importing of data from the client into the database #27333

Comments

roji commented Feb 1, 2022 • edited Loading

Additional notes

Database support

Community implementations

AndriySvyryd commented Feb 2, 2022

roji commented Feb 2, 2022

AndriySvyryd commented Feb 2, 2022

roji commented Feb 2, 2022

roji commented Feb 20, 2022

yzorg commented Jun 24, 2022

roji commented Jun 26, 2022

khteh commented Feb 17, 2023

roji commented Feb 18, 2023

alrz commented Oct 30, 2024

roji commented Feb 1, 2022 •

edited

Loading