-
Notifications
You must be signed in to change notification settings - Fork 226
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize multiple insertions #113
Comments
Agree, this would be great. Here is a link which describes how it can be done with the ADO.NET provider for SQL Server. Some sample code taken from that page: using (SqlBulkCopy copy = new SqlBulkCopy(conn))
{
copy.DestinationTableName = "Quotes";
DataTable table = new DataTable("Quotes");
table.Columns.Add("Symbol", typeof(string));
table.Columns.Add("QuoteDate", typeof(DateTime));
table.Columns.Add("Price", typeof(decimal));
for (int i = 0; i < nRecords; ++i)
{
table.Rows.Add("MSFT", new DateTime(2000, 1, 1).AddDays(i), 12.3m);
}
copy.WriteToServer(table);
} If we could get an API that is basically swapping |
@perlun, I think there's a bit of confusion here. Npgsql already has an efficient bulk copy API which provides the same benefits as SqlBulkCopy. This issue is about using that bulk copy API from Npgsql's Entity Framework Core provider. EF Core already supports batched updates, where if you do several updates and then call This is far from a trivial change, and we need to benchmark first to see if the performance benefits (i.e. bulk copy vs. batched inserts) justify it. |
As part of some POC i did some performance optimizations around this library (Mainly ported the code from the sql server provider): https://gist.github.com/zlepper/7f55ab76547d81eb6eb403ad4feab06b You will probably want someone who knows this much better than me to actually look it over and do it, however i hope it can be a help to optimize inserts :) From personal experience: It doesn't speed up postgres a whole lot (The difference i saw in local testing could be attributed to chance). However in the case of Cockroach db, I saw up to 15x throughput, simply because it can bulk transactions much better, and it will probably matter even more for distributed clusters (I was just testing with a 3 node docker cluster on the same machine). |
@zlepper can you provide some context on exactly what your POC is doing? The provider already batches multiple updates/inserts into a single DbCommand, which means that everything happens in a single round-trip. The issue would be to implement insert specifically via Npgsql's binary copy API, which is a super-optimized, PostgreSQL-specific way to get data into the database. |
@roji Yes, I'm basically trying to insert a couple of billion rows into the database. So the optimization i did converts (With just a lot more rows per save): insert into foo(a, b)
values ('a', 'b');
insert into foo(a, b)
values ('c', 'd');
insert into foo(a, b)
values ('e', 'f'); into insert into foo(a,b)
values ('a', 'b'),
('c', 'd'),
('e', 'f'); (Of cause it's still parametized, i just didn't want to type that out, since that doesn't matter for the specific changes i did) So it's still just one roundtrip, but each statement becomes a transaction (If i have understood things correctly, i'm not that sharp on database things). My purpose of inserting data doesn't actually have anything to do with inserting it, i just need the data there, to test how some different databases handles it when there is a lot of data already. Do also watch out for the binary copy. In my experience that doesn't work in cockroach (At least the thing For my specific case i ended using the actual import apis, since they are by far the fastest, but that's a whole other thing. |
@zlepper When using the former (without your change), have you tried turning on automatic statement preparation? This has the potential of making the inserts run much faster than your modified version. Regardless, if you really are inserting billions of rows, consider bypassing EF and coding directly against the Npgsql COPY API - you'll definitely get the best perf possible. You can still use EF Core for all other access to the database. |
Would this then help even more with the bulk version also? But no, I was not aware that was a thing.
I mean, that is basically what I ended up doing, just using I have no doubt your suggestion will help with Postgres, however my main problem was actually the performance in Cockroachdb, where the bottleneck comes from the consistency (Which requires communication between nodes), and there the problem is insert statements with just one pair of values. Also very interestingly, i cannot find the |
It might, if you execute the same command (with the same number of parameters) multiple times. Yeah, if you're using Cockroach then everything definitely needs to be checked there specifically... I have no idea exactly what they support. |
Considering that we EF is currently chunking our insert sizes at 1000 rows per query, then yes, we would definitely fit under that :D But either way, if you can use what I send, feel free. If not, feel free not to :) Just thought it might help some of the general investigation work here :) |
Note: dotnet/efcore#9118 is planned for EF Core 7.0, which would unlock this. There are various methods we could use:
@YohDeadfall did some great benchmarking in npgsql/npgsql#2779 (comment). tl;dr insert with arrays work really well up until beyond 100 rows, at which point COPY overtakes it. One reason why COPY may be slow for less rows, is that it currently requires multiple roundtrips, i.e. BeginBinaryImport and the terminating Complete (this alsomeans that whether is worth doing COPY depends on latency to PostgreSQL). We could think about changing the ADO.NET COPY import API to be single-roundtrip, i.e. make BeginBinaryImport do nothing (a bit like BeginTransaction), and only flush and wait on Complete. We may want this to be an opt-in, not sure. I'm also not sure whether COPY always works and has no limitations, so an opt-out is a good idea (probably a minimum row threshold for switching to COPY, which can be set to 0 to disable). Note also dotnet/efcore#27333, which is about a special EF Core API for bulk importing. So it may make sure to implement that API with COPY, and to implement SaveChanges with insert with arrays. Though if we do implement single-roundtrip-COPY, COPY may be a better option all the time - experiment. |
Note that even if we have single-roundtrip-COPY at the ADO layer, it's still not going to be possible to batch that with other, non-COPY commands at the EF level. We'd need some way to e.g. integrate a COPY operation inside an NpgsqlBatch, which would mean representing a COPY operation via NpgsqlBatchCommand. That would be a whole new API which pulls rows from the user (more similar to SqlBulkCopy) rather than today's API, where the user pushes rows into the database. Or allow the user to embed a lambda in the special NpgsqlBatchCommand, where they push the rows with today's API. It's an interesting idea, but a bit far-fetched. Probably just go with array insert for now. |
One more argument against COPY here, is that if the user really is importing a massive amount of rows - enough for the difference between array insert and COPY to be significant - they also want to avoid the overhead associated with change tracking in EF Core. So they want to use a dedicated import API as in dotnet/efcore#27333, which bypasses all that. |
Updated benchmark with the various candidate methods, based on @YohDeadfall's benchmarks in npgsql/npgsql#2779 (comment):
tl;dr we should do Insert_with_arrays starting from 2-5 rows (benchmark to get the exact threshold). Notes:
/cc @AndriySvyryd Benchmark codeBenchmarkRunner.Run<Benchmark>();
public class Benchmark
{
[Params(1, 2, 5, 10, 100, 500, 1000, 10000)]
public int NumRows { get; set; }
private NpgsqlConnection _connection;
private NpgsqlCommand _command;
private async Task Setup()
{
_connection = new NpgsqlConnection("Host=localhost;Username=test;Password=test;Max Auto Prepare = 10");
await _connection.OpenAsync();
await using var command = _connection.CreateCommand();
command.CommandText = "DROP TABLE IF EXISTS foo; CREATE TABLE foo (id INT, data INT)";
await command.ExecuteNonQueryAsync();
}
[GlobalSetup(Target = nameof(Batched_inserts))]
public async Task Setup_Batched_inserts()
{
await Setup();
_command = _connection.CreateCommand();
_command.CommandText = new StringBuilder()
.AppendJoin(" ", Enumerable.Range(0, NumRows).Select(i => $"INSERT INTO foo (data) VALUES (@p{i});"))
.ToString();
for (var i = 0; i < NumRows; i++)
{
var param = _command.CreateParameter();
param.ParameterName = "p" + i;
param.Value = i;
_command.Parameters.Add(param);
}
}
[GlobalSetup(Target = nameof(Insert_with_multiple_rows))]
public async Task Setup_Insert_with_multiple_rows()
{
await Setup();
_command = _connection.CreateCommand();
var stringBuilder = new StringBuilder("INSERT INTO foo (data) VALUES");
for (var i = 0; i < NumRows; i++)
{
stringBuilder
.Append(i == 0 ? " " : ", ")
.Append($"(@p{i})");
var param = _command.CreateParameter();
param.ParameterName = "p" + i;
param.Value = i;
_command.Parameters.Add(param);
}
_command.CommandText = stringBuilder.ToString();
}
[GlobalSetup(Target = nameof(Insert_with_multiple_rows_sorted))]
public async Task Setup_Insert_with_multiple_rows_sorted()
{
await Setup_Insert_with_multiple_rows();
_command.CommandText = @$"WITH bar AS
(
{_command.CommandText} RETURNING id
)
SELECT * FROM bar ORDER BY id";
}
[GlobalSetup(Target = nameof(Insert_with_arrays))]
public async Task Setup_Insert_with_arrays()
{
await Setup();
_command = _connection.CreateCommand();
_command.CommandText = "INSERT INTO foo (data) SELECT * FROM unnest(@i)";
var param = _command.CreateParameter();
param.ParameterName = "i";
param.Value = Enumerable.Range(0, NumRows).ToArray();
_command.Parameters.Add(param);
}
[GlobalSetup(Target = nameof(Insert_with_arrays_sorted))]
public async Task Setup_Insert_with_arrays_sorted()
{
await Setup_Insert_with_arrays();
_command.CommandText = @$"WITH bar AS
(
{_command.CommandText} RETURNING id
)
SELECT * FROM bar ORDER BY id";
}
[GlobalSetup(Target = nameof(Copy))]
public Task Setup_Copy()
=> Setup();
[Benchmark(Baseline = true)]
public async Task Batched_inserts()
=> await _command.ExecuteNonQueryAsync();
[Benchmark]
public async Task Insert_with_multiple_rows()
=> await _command.ExecuteNonQueryAsync();
[Benchmark]
public async Task Insert_with_multiple_rows_sorted()
=> await _command.ExecuteNonQueryAsync();
[Benchmark]
public async Task Insert_with_arrays()
=> await _command.ExecuteNonQueryAsync();
[Benchmark]
public async Task Insert_with_arrays_sorted()
=> await _command.ExecuteNonQueryAsync();
[Benchmark]
public async Task Copy()
{
using var importer = _connection.BeginBinaryImport(
"COPY foo (data) FROM STDIN (FORMAT binary)");
for (var i = 0; i < NumRows; i++)
{
await importer.StartRowAsync();
await importer.WriteAsync(i);
}
await importer.CompleteAsync();
}
} |
An important problem with the above optimizations, is that they don't guarantee the ordering of database-generated values returned via the RETURNING clause; this means we can't match up the returned IDs to their corresponding entity instances client-side. The solution here would probably be to use MERGE in the same way as SQL Server - MERGE is coming to PostgreSQL 15. |
This is great stuff! It's always been a pain point with DB-seeding scripts that utilize business logic for seeding and you end up seeding your graph in 20-30 seconds then waiting 5 minutes for the EF Core inserts 😅 |
@douglasg14b in general, EF's seeding feature isn't suitable for use with many rows for various reasons (e.g. seeding is present in the migration snapshots on disk, which would become huge). So I don't think this is very relevant for seeding. |
@roji I'm not talking about using EFs seeding features. Just logic that generates data & state, and many entities, in C#, and with DbContext. |
Note that for inserting a large number of rows (bulk insert/import), dotnet/efcore#27333 is in general a better approach (expose Npgsql's binary COPY via a standard EF API). |
The SQL Server provider has some sort of bulk insert optimization, look into it. At the most extreme we can do a COPY operation, although that would probably be going a bit overboard.
The text was updated successfully, but these errors were encountered: