Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New System.Data.Common batching API #28633

Closed
Tracked by #24110 ...
roji opened this issue Feb 6, 2019 · 245 comments · Fixed by #54467
Closed
Tracked by #24110 ...

New System.Data.Common batching API #28633

roji opened this issue Feb 6, 2019 · 245 comments · Fixed by #54467
Assignees
Labels
api-approved API was approved in API review, it can be implemented area-System.Data
Milestone

Comments

@roji
Copy link
Member

roji commented Feb 6, 2019

This proposal has been approved in principle, subject to confirmation by several database providers that the API shape is good.

Issues tracking provider implementations:


This issue is based on previous discussions in #15375 and #17445.

Background

Batching multiple SQL statements can be critical for database performance, as they can be executed in a single roundtrip rather than waiting for each statement to complete before sending the next one. System.Data.Common doesn't currently provide a first-class API for this, but some providers (e.g. SqlClient) allow batching statements by concatenating them together into DbCommand.CommandText, separated by semicolons:

var cmd = connection.CreateCommand();
cmd.CommandText = "INSERT INTO table (1); INSERT INTO table (2)";

The problem with this approach, is that most databases require separate protocol messages for each statement (e.g. PostgreSQL, MySQL), forcing the database's ADO.NET provider to parse the SQL client-side and to split on semicolons. This is both unreliable (parsing SQL is difficult) and bad for performance - ideally an ADO.NET provider should simply forward user-provided SQL to the database, without parsing or rewriting it (see #25022 for a related issue forcing providers to parse SQL).

As a specific case, SQL Server does natively support the multi-statement commands, but even there this has drawbacks for performance, detailed by @GSPP in dotnet/SqlClient#19. Also, as @bgribaudo pointed out, the current concatenation-based approach doesn't allow invoking multiple stored procedures in batching fashion. SqlClient does include SqlCommandSet, which performs efficient, non-concatenation-based batching, but that class is internal and currently usable only via DbDataSet, and not for general usage (NHibernate apparently accesses this via reflection). This proposal would allow exposing SqlCommandSet via a public API.

Goals

  • Provide a structured way to execute multiple SQL statements in a single roundtrip, without any need for client-side parsing of SQL.
  • Keep the API consistent with other ADO.NET APIs, and specifically close to DbCommand (both are "executable"), to reduce the conceptual complexity for adoption.
  • Allow mixing in different types of statements in the same batch (insert, update, select). The current concatenation approach supports this, and so does our reader API (multiple resultsets).
  • Provide non-aggregated access to the number of rows affected, for each individual command in the batch.

Proposed API

public abstract class DbBatch : IDisposable, IAsyncDisposable
{
    public DbBatchCommandCollection BatchCommands => DbBatchCommands;
    protected abstract DbBatchCommandCollection DbBatchCommands { get; }

    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected abstract DbBatchCommand CreateDbBatchCommand();

    #region Execution (mirrors DbCommand)

    // Delegates to ExecuteDbDataReader
    public DbDataReader ExecuteReader(CommandBehavior behavior = CommandBehavior.Default);
    protected abstract DbDataReader ExecuteDbDataReader(CommandBehavior behavior);

    // Delegate to ExecuteDbDataReaderAsync
    public Task<DbDataReader> ExecuteReaderAsync(CancellationToken cancellationToken = default);
    public Task<DbDataReader> ExecuteReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken = default);

    protected abstract Task<DbDataReader> ExecuteDbDataReaderAsync(CommandBehavior behavior, CancellationToken cancellationToken);

    public abstract int ExecuteNonQuery();
    public abstract Task<int> ExecuteNonQueryAsync(CancellationToken cancellationToken = default);

    public abstract object ExecuteScalar();
    public abstract Task<object> ExecuteScalarAsync(CancellationToken cancellationToken = default);

    #endregion

    #region Execution properties (mirrors DbCommand)

    public abstract int Timeout { get; set; }

    // Delegates to DbConnection
    public DbConnection Connection { get; set; }
    protected abstract DbConnection DbConnection { get; set; }

    // Delegates to DbTransaction
    public DbTransaction Transaction { get; set; }
    protected abstract DbTransaction DbTransaction { get; set; }

    #endregion
    
    #region Other methods mirroring DbCommand
    
    public abstract void Prepare();
    public abstract Task PrepareAsync(CancellationToken cancellationToken = default);
    public abstract void Cancel();
        
    #endregion

    #region Standard dispose pattern

    public void Dispose() { ... }
    protected virtual void Dispose(bool disposing) {}

    #endregion
}

public class DbBatchCommandCollection : Collection<DbBatchCommand>
{
}

public abstract class DbBatchCommand
{
    public abstract string CommandText { get; set; }
    public abstract CommandType CommandType { get; set; }
    public abstract int RecordsAffected { get; }
    
    public DbParameterCollection Parameters => DbParameterCollection;
    protected abstract DbParameterCollection DbParameterCollection { get; }
}

public class DbConnection
{
    public DbBatch CreateBatch() => CreateDbBatch();
    protected virtual DbBatch CreateDbBatch() => throw new NotSupportedException();
    
    public DbBatchCommand CreateBatchCommand() => CreateDbBatchCommand();
    protected virtual DbBatchCommand CreateDbBatchCommand() => throw new NotSupportedException();
    
    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbProviderFactory
{
    public virtual DbBatch CreateBatch() => throw new NotSupportedException();
    public virtual DbBatchCommand CreateBatchCommand() => throw new NotSupportedException();
    
    // Covers both CreateBatch and CreateBatchCommand
    public virtual bool CanCreateBatch => false;
}

public class DbException
{
    public DbBatchCommand BatchCommand => DbBatchCommand;
    protected virtual DbBatchCommand DbBatchCommand => null;
}

General usage and examples

Usage is fairly trivial and aligned with the existing DbCommand APIs. Users first create a new DbBatch, either by calling DbCommandFactory.CreateBatch(), or by instantiating one directly. Commands are added into the batch, execution properties (e.g. Timeout) are set on it, and finally on of the Execute*() methods are called on it. Connection, Transaction and Timeout are specified on the DbBatch, like they are set on DbCommand for un-batched operations.

Here is a code sample using DbProviderFactory for database portability:

using var batch = dbProviderFactory.CreateBatch();

var cmd1 = dbProviderFactory.CreateBatchCommand();
cmd1.CommandText = "UPDATE table SET f1=@p1 WHERE f2=@p2";
var p1 = dbProviderFactory.CreateParameter();
var p2 = dbProviderFactory.CreateParameter();
p1.Value = 8;
p2.Value = 9;
cmd1.Parameters.Add(p1);
cmd1.Parameters.Add(p2);
batch.Add(cmd1);

var cmd2 = dbProviderFactory.CreateBatchCommand();
cmd2.CommandText = "SELECT * FROM table WHERE f2=@p3";
var p3 = dbProviderFactory.CreateParameter();
p3.Value = 8;
cmd2.Parameters.Add(p3);
batch.Add(cmd2);

batch.Connection = conn;
batch.Transaction = transaction;

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

The verboseness of this API corresponds to how ADO.NET currently looks. General usability improvements are planned for later.

Here is a suggested code sample working against a specific provider and using initializers for better readability and terseness:

using var batch = new XyzBatch
{
    Connection = conn,
    Transaction = transaction,
    BatchCommands =
    {
        new XyzBatchCommand("UPDATE table SET f1=@p1 WHERE f2=@p2")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
                new XyzParameter("p2", 9),
            }
        },
        new XyzBatchCommand("SELECT * FROM table WHERE f2=@p1")
        {
            Parameters =
            {
                new XyzParameter("p1", 8),
            }
        }
    }
};

using var reader = batch.ExecuteReader();
// read contains one resultset, from SELECT

Design and naming motivations

The original proposal had the batch holding a list of DbCommand instead of the new DbBatchCommand type. The change, originally proposed by @bgribaudo, was motivated by the following reasons:

  • DbCommand would have had some dead properties when executed as part of a batch (Connection, Transaction, CommandTimeout).
  • DbCommand is disposable, raising various questions around its lifespan (does the command get disposed with the batch or not, is it OK for DbException to reference a disposable object...).

The name DbBatchCommand will hopefully convey the similarity between that type and DbCommand (both hold SQL, parameters and a CommandType), while at the same time keeping them distinct (DbBatchCommand isn't executable outside of a DbBatch). The name DbStatement was also considered, although it seems this would introduce more confusion:

  • It's not directly identifiable as a batch component ("why can't I execute it?")
  • On some providers (e.g. SqlClient) a single DbBatchCommand can still contain concatenation batching, so a DbStatement would actually contain multiple statements.

Affected rows

DbCommand has APIs for finding out how many rows were affected by the command:

  • The return value of ExecuteNonQuery[Async]() returns an int
  • The DbDataReader.RecordsAffected property

However, both of these provide an aggregate number; if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately.

Providing non-aggregate affected rows could be done via an overload of ExecuteNonQuery[Async]() which returns an int[], or which accepts a user-provided int[] and populates it. The approaches have their complexities (perf, array management...), and as we're introducing a new DbBatchCommand dedicated to batching, we propose to simply add a RecordsAffected property on it instead. The provider would populate this property for each DbBatchCommand to indicate how many rows were affected by that command.

As with DbDataReader.RecordsAffected, this property would contain -1 for SELECT statements, and 0 if no rows were affected or the statement failed.

Command behavior

DbCommand.ExecuteReader() accepts a CommandBehavior enum which allows execution behavior to be tweaked. Even if it seems to be a rare case, users may need to specify different behaviors for different batch commands. As a result, DbBatch.ExecuteReader() does not accept a behavior parameter, and DbBatchCommand has a CommandBehavior property instead.

This requires the user to set any non-default behavior on each and every batch command. A batch-wide default could be accepted by DbBatch.ExecuteReader(), but that would require a way to distinguish between the batch default and the enum's default value (CommandBehavior.Default), e.g. by making it nullable on the DbCommandBatch. The introduced complexity doesn't seem to be worth it.

Note that not all providers will support all command behaviors on batched commands - it is expected that in some cases the entire batch will have to share the same behavior. Also, CommandBehavior.CloseConnection makes no sense on batched commands except the last, and the provider should probably throw on this.

DbException.Command

Since we now execute batches, we should provide a way of letting users know which command in the batch triggered an exception. We propose to do this by introducing a new virtual property on DbException, called BatchCommand, pointing to the relevant DbBatchCommand instance. In non-batched command execution this property would be left unassigned.

Notes on batch size management

In some cases, there may be some value in having the provider's batching implementation implicitly manage batch sizes. Examples:

  • There may be some database hard limit (on the number of statements, on the number of parameters). The implementation could break down the user-provided batch into several sub-batches, to save the user the trouble of doing that. This would mean wrapping multiple internal readers with a single reader, which isn't trivial (although we could provide an implementation).
    • Note: it seems that the SQL Server 2100-parameter limit will not apply here, as it applies to each individual statement rather than to the batch. So we're not sure we have a good real-world example.
  • It may be inefficient to batch statements below a certain number (because of overheads associated with batches). In this case the implementation would simply not batch.
    • For example, with current concatenation-based batching on SQL Server, batching 2 statements performs worse than not batching.

The pro here is to free the user from knowing low-level, database-specific details, transparently increasing perf and reducing friction for everyone (we're assuming the user has no special knowledge which would assist in the decision-making). The con here is more complexity and a general uncertainty that this is needed - any examples from the community would be helpful.

Regardless, this can be viewed as an implementation detail of a provider; although we can provide guidelines, at the end of the day each provider will implement the behavior they desire.

Backwards-compatibility

As a new API, no breakage is being introduced, but it has been proposed to provide a backwards compatibility shim for the new API that either performs concatenation-based batching, or simply executes without batching at all.

As of now, the decision is to not provide any sort of shim: trying to use the new batching API with providers that haven't implemented support for it will throw a NotSupportedException. The reasons for this are:

  • Silently falling to non-batching execution would give the false expectation/illusion of batching without delivering on it.
  • Concatenation-based batching doesn't work if the different batched commands have named parameters with the same name (since parameter lists would need to be unified and names must be unique). This is expected to be a common scenario as multiple identical statements are executed with different parameter values (but identical names).
  • We want to avoid the complexity and effort of implementing the shim. For example, a non-batching shim would need to somehow combine the readers from the individual commands into a single reader interface for user consumption. We don't think this is worth the effort.
  • Most applications work against a single database type, and so can either use the new API or not, based on knowledge of what the provider supports. Multi-database applications (and layers) are much more rare, and it's expected they can take the added burden of detecting whether batching is supported or not, and implementing two code paths.

Feature detection

As the API will throw for providers which don't implement it, a way must be provided for consumers to know whether the API is supported. The proposal is to add a CanCreateBatch bool property to both DbProviderFactory and DbConnection, alongside the CreateCommandSet() methods.

Additional Notes

  • No guarantees are made with regards to batch transactionality (do the batch commands execute in an implicit transaction) or to error handling (if a command fails, are later commands automatically skipped). These details are provider-dependent. However, wherever possible, it is suggested that implementations provide both transactionality and skip-on-failure behavior.
  • DbBatch is disposable, like DbCommand. This is to allow for cases where it maps to a native resource that needs to be freed, and may also help with pooling/recycling of instances (to be elaborated in a separate proposal).
  • We hope to add a DbBatch.Add(string commandText), which would create the CommandText internally and be a lighter API, but some design issues need to resolved around parameter management. We're deferring this for now as it's non-critical sugar.
  • An alternative proposal was made by @bgribaudo in API Proposal: DbCommandSet -- allows multiple DbCommands to be transmitted to server in single message #28794, in which a separate DbDataReader is returned for each DbBatchCommand, instead of one covering the entire batch. The goal was to help user identify command boundaries in resultsets when commands return more than one resultset (or even a variable number of resultsets). Due to the assumed rareness of the problem and the existence of other workarounds we think it's better to leave on DbDataReader for the entire batch.

Open questions

  • DbCommand and DbBatch share a lot of surface APIs. In theory there could be an interface capturing those similarities, allowing users to abstract away what's being executed (but suffers from the same versioning flaws of interfaces, modulu default interface methods).
  • Does DbBatch need to implement System.ComponentModel.Component, like all other ADO objects (for Visual Studio designer)? There seems to be little change that VS would be updated for this, but possibly for consistency.

Edit history

Date Modification
2019-02-14 DbCommandSet now implements IEnumerable<DbCommand> and includes the two GetEnumerator() methods, as per @bgrainger's suggestion
2019-02-18 Updated proposal with different concrete options for backwards compatibility, batch size management, and added a usage example.
2019-02-22 Updated with the current decision on backwards compatibility (no shim), added feature detection discussion, added non-aggregated rows affected to goals.
2019-02-23 Added missing DbConnection.CreateCommandSet() and renamed DbProviderFactory.CreateCommandSet() (it previously had a Db which doesn't belong). Added CanCreateCommandSet property to DbProviderFactory, DbConnection.
2019-02-25 Added public non-virtual DbConnection.CreateCommandSet() along protected virtual CreateDbCommandSet()
2019-03-04 Added DbConnection, DbTransaction to DbCommandSet. Corrected ExecuteReaderAsync overloads, corrected some comments.
2019-03-06 DbCommandSet now includes a Commands list instead of implementing IEnumerable<DbCommand> itself.
2019-03-12 Renamed DbCommandSet to DbBatch, and changed it to hold a new DbBatchCommand type instead of DbCommand. Added readable/terse code sample, added explanatory notes and rearranged sections slightly.
2019-04-18 Added CommandBehavior to DbBatchCommand and removed the commandBehavior parameter to DbBatch.ExecuteReader(); added a section to explain. Added RecordsAffected to DbBatchCommand and updated the affected rows section.
2019-04-20 Final typo pass (thanks @bgrainger) and added sentence on CommandBehavior support.
2019-05-14 Note on values of RecordsAffected (for SELECT, for failed statements), add standard Dispose pattern methods to DbBatch, fixed issues raised by @bgrainger
2019-05-23 Apply standard ADO.NET pattern to DbBatch with virtual ExecuteDbDataReader and non-virtual ExecuteDataReader
2019-06-17 Removed DbBatch.CancelAsync() following removal of DbCommand.CancelAsync() in #28596, thanks @Wraith2. Replaced IList<DbBatchCommand> with DbBatchCommandCollection.
2021-06-21 Make DbException.DbBatchCommand protected
2021-06-29 Move CommandBehavior from DbBatchCommand to DbBatch.ExecuteReader
2021-07-01 Remove the setter for DbBatchCommand.RecordsAffected
2021-07-01 Added DbBatch.CreateBatchCommand and CreateDbBatchCommand
@roji
Copy link
Member Author

roji commented Feb 6, 2019

/cc @divega @ajcvickers @bgrainger @ErikEJ @bricelam @AndriySvyryd

@divega
Copy link
Contributor

divega commented Feb 6, 2019

Thanks @roji for putting this together. A couple of points:

@bgrainger
Copy link
Contributor

bgrainger commented Feb 6, 2019

Typo: IAsyncDisposal.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Backwards-compatibility shim

Add:

  • The shim will set the CommandTimeout, Connection and Transaction properties on the batch command to the values of its corresponding properties.
  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)


MySQL implementation thoughts:

For the "text" protocol (the default), MySQL already supports all commands in a batch being sent in a single network packet. MySqlConnector would probably end up concatenating all the SQL from the DbCommandSet into one payload (maybe by performing multiple UTF-8 conversions into one output buffer, not by really concatenating actual C# strings). The DbCommandSet may not provide a practical enhancement for most MySQL users (who use the "text" protocol). Additionally, concatenating all the SQL means that it would be highly unlikely that DbException.Command could be set correctly (just as with the backwards-compatibility shim).

For the "binary" protocol (i.e., prepared commands), this would be a helpful API improvement. MySQL disallows more than one statement being prepared at once, so MySqlConnector has to parse SQL and split commands apart. The design goals of this proposal eliminate that requirement.

However, for MySQL, the second command in the batch wouldn't actually be sent until DbDataReader.NextResult[Async] is called. (I.e., DbCommandSet.ExecuteReaderAsync doesn't send all "binary protocol" commands in one network payload.) I don't think that would cause any problems with this proposal but executing “multiple SQL statements in a single roundtrip” would not be possible.

@bgribaudo
Copy link
Contributor

@roji, thank you for writing this up. I am excited about the possibilities!

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@divega maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case. Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@bgrainger

Typo: IAsyncDisposal.

Thanks, fixed.

Will DbCommandSet implement the Dispose pattern (i.e., protected virtual Dispose(bool disposing), or public virtual Dispose() { /* no op */ }, or force all derived types to implement both Dispose and DisposeAsync (to satisfy the interfaces' requirements)?

Good question. Most of the ADO.NET base classes implement System.ComponentModel.Component, which implements IDisposable with the dispose pattern (i.e. virtual empty implementation of Dispose(bool)). It definitely seems like DbCommandSet should also implement that pattern as well (possibly by extending Component). I'm guessing that the same pattern applies for IAsyncDisposable - I'll look into this soon and update the proposal. Let me know what you think.

Is it worth implementing IEnumerable a) to retrieve the commands in the batch, b) to enable C# collection initialization syntax?

Richer access to the commands in the set is one thing I wanted to get feedback on. I'm not sure it makes sense for the command set to expose full list semantics - what's your view?

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable (which says nothing about addition/modification); all it requires is for the type in question to have an Add() method, which we already have in the proposal above.

The shim will set the CommandTimeout, Connection and Transaction properties on the batch command to the values of its corresponding properties.

Are you referring only to the shim, or to the DbCommandSet spec in general? For the latter, I'm not sure this makes much sense - what value does it provide over clearly specifying that they are ignored in batched execution? In addition, setting CommandTimeout could be construed as meaning that each command has its own timeout, which definitely doesn't work in the general batching case. So all in all, unless there's a persuasive reason, I'd rather we just ignored these.

  • Prepare[Async]() will simply call the same method on each of commands in the batch.

Or should it call the same method on the batch command it creates? (I don't understand the benefit of preparing the DbCommand objects in the DbCommandSet, which won't actually be executed themselves.)

You're right, this was a mistake. In theory it's possible to prepare the batch command as you suggest, and reuse it again with different parameter values. The issue is how to invalidate if any of the user-provided commands are modified (e.g. change in CommandText) - it doesn't seem like it will be possible to handle this by the shim. It's probably better to simply not do anything here.

Thanks for your comments on the MySQL provider. To be sure I'm understanding things correctly, here are a few questions:

  • Are parameters supported in the text protocol? If so, does your provider format parameter values in text, and are they integrated inline in the SQL or sent out-of-band?
  • Is the choice between the text and binary protocols something that the user controls somehow? Or is text used for regular commands and binary for prepared?
  • I seem to remember a conversation about this, but to be sure - so the MySQL binary protocol does not and cannot allow for batching at all (or is this not yet supported by your driver)? Can't multiple messages be sent (with increasing sequence IDs) be pipelined without waiting for previous responses? Also, this page seems to indicate that it may at least be possible to concatenate multiple statements in a single COM_STMT_PREPARE. I'm definitely far from an MySQL expert though, so all this may be wrong.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@bgribaudo,

Currently, the proposal allows command behavior to be specified once for the entire set of commands. However, command behavior can be command-specific. The same behavior may not apply to all commands that could go into a command set. To accommodate this, what would you think of changing the proposal so that behavior may optionally be specified when commands are added (e.g. Add(someCommand, CommandBehavior.KeyInfo))?

Most of the value on CommandBehavior seem like they make sense only at the batch level - CloseConnection, SingleResult, SingleRow, SequentialAccess (since a single reader is returned, and would not likely support switching between sequential and non-sequential between commands across providers). A good way to think about CommandBehavior is that actually affects the reader, rather than the command - it's a parameter to DbCommand.ExecuteReader(), and cannot be specified where there is no reader (ExecuteNoQuery(), ExecuteScalar()). In batched mode, we're still going to have a single reader (possibly with multiple resultsets), so multiple CommandBehaviors don't really seem to fit well.

@bgrainger
Copy link
Contributor

Will DbCommandSet implement the Dispose pattern

In favour of implementing the standard Dispose pattern for consistency. I have no experience with IAsyncDisposable but would lean towards implementing the same pattern if one exists.

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

Also, this page seems to indicate that it may at least be possible to concatenate multiple statements in a single COM_STMT_PREPARE.

I have not found this to be true in practice. I thought I had once found documentation confirming that it's not supported but don't remember where that was. But that's a good point that some future MySQL Server might permit it.

@divega
Copy link
Contributor

divega commented Feb 7, 2019

Regarding the additional overload accepting an int[] for records affected, I'm not sure what you mean anymore... The proposal above includes an overload which returns int[], but of course feel free to edit and add. I think this is one of the main open questions which still need to be resolved.

@roji no problem. I added what I described on my emails as option 2, since it is a variation of option 1.

maybe it makes sense to repurpose dotnet/SqlClient#19 to track the implementation of the new batching API in SqlClient specifically? That seems to be its main focus in any case.

Ok, done.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

Unless I'm mistaken, C# collection initialization syntax actually has nothing to do with IEnumerable

I believe you are mistaken; see docs.

Sorry, my bad - you're right of course. That's what comes out of sloppy testing at 3AM.

Perhaps unfortunately, this does immediately imply access to the commands in the collection, which feels like a distraction from the point of this class. It does already have Add and Clear, and Count might make sense, which is halfway to ICollection<T>. But Contains, CopyTo, and Remove feel completely unnecessary, make implementations more complex, and imply some kind of equality for DbCommand objects. (This would probably end up just being reference equality, but is that what users expect?)

It doesn't feel like the point of DbCommandSet is to be a generic container of commands with full IList API. There's no obviously best answer here (to me), so I think I would lean towards keeping your proposed API, not implementing IEnumerable, and giving up C# collection initialisation until it becomes clear that that would be a high-value addition for typical users of the API. (I don't yet have a good sense for who will be the typical user; if it ends up primarily being used in ORM implementations, then some C# syntax sugar is less important.)

I agree with most of what you say, although I do think there's real value in providing collection initialization - maybe a good mid-way solution is to implement IEnumerable<DbCommand> but not IList<DbCommand>.

The shim will set the CommandTimeout, Connection and Transaction properties

Are you referring only to the shim

Yes, I was suggesting that you specify the behaviour of the shim implementation w.r.t. those properties. I think the shim should do something with them when creating its batch command, and this seems like the right thing to do.

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others? Another possible issue is that it would be somewhat incoherent to have the shim do it, but not to require the same of actual provider implementations, and I don't really see the sense in that either...

Are parameters supported in the text protocol?

No, they are not. MySqlConnector parses the CommandText, finds ? or @param tokens (with the minimum smarts necessary to not detect these in string literals, comments, etc.) then substitutes in the (escaped) parameter values.

I see, so if I understand correctly you support formatting and parsing parameter values both in text (for unprepared) and in binary (for prepared)... Npgsql used to work the same way a long while ago. Note also #25022 which I plan to tackle soon, and which may also obviate the need to parse for parameter substitution/interpolation. The ultimate goal here is to allow you to write your driver with any sort of parsing/rewriting - hopefully that can work out for you. If there are any other reasons you parse/rewrite (aside from batching and parameter substitution), could you post a note about them?

Is the choice between the text and binary protocols something that the user controls somehow?

Yes, through a convoluted mechanism. Binary protocol is implied by calling MySqlCommand.Prepare. However, because prepared commands support only a limited subset of SQL, and this might be surprising to users, Connector/NET added a IgnorePrepare option, which defaults to True. MySqlConnector implements the same convention. To use the binary protocol, you have to set IgnorePrepare=False, then call .Prepare() on each command.

Thanks for this information - this kind of cross-provider/cross-database information is very important. In any case, it's unfortunate that MySQL imposes these restrictions...

so the MySQL binary protocol does not and cannot allow for batching at all

Correct. Each individual statement is given its own (integral) statement ID, and an "execute statement" command specifies a single statement ID. The client prepares all the statements in advance, then sends "execute statement ID 1", then reads the result set, then sends "execute statement ID 2", then reads its result set, etc.

Some MySQL Servers will permit pipelining, i.e., the client sending one network packet that contains multiple "execute statement ID n" payloads, then the server reading those one-at-a-time and processing them in order. However, there are a wide variety of proxies and backends (Aurora, Azure, TiDB, Percona, etc.) that speak the MySQL protocol and pipelining tends to break them. (It could possibly be added as an opt-in option for people who needed maximum performance and were confident their DB supported it.)

OK. IMHO batching is important enough for perf that it may make sense to provide an opt-in for this, although it's true that you also have the text protocol that already supports this. Regarding special backends and their capabilities, in the PostgreSQL case it's sometimes possible for the driver to detect this and automatically switch off/on certain features, maybe it's possible in your case as well (just thinking out loud).

Thanks again for all the detail and the valuable input into this - please don't hesitate to share any other thoughts on this proposal etc.

@FransBouma
Copy link
Contributor

Looks great :)

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

This is for .NET Core x.y only? Or will this be available on .net full as well?

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@FransBouma

For the shim: e.g. DB2 requires BEGIN/END around batched statements, so the shim itself needs to consult the ADO.NET provider whether it can simply concat the statements or needs to do additional things. Some databases don't support select statements batched together with update statements for instance, so your shim should anticipate on that, consulting the ADO.NET provider. Some databases don't support batching at all (access/oledb) so for these, the shim should simply execute them one by one.

I wasn't planning on the shim actually implementing all the different behaviors, but it's an interesting thought... Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

When concatenating statements, you obviously run into the problem of parameters with the same name. You therefore have to make the parameter names unique. This can be problematic as it requires you to parse the statement (as far as I can see).

Yes, I noted this problem above in the proposal - my plan is currently to throw in this case. It definitely seems like a non-starter for the shim to parse SQL to uniquify parameter names (especially as different databases have different SQL dialects etc.). This is after all only a backwards-compat shim, and it seems reasonable for it not to work for some edge cases - of course provider implementations of DbCommandSet can do what they want.

Most ADO.NET providers have no real limit on the # of parameters except... SQL Server! :) So you have to consult the ADO.NET provider there too in the shim how much statements you can batch in 1 go automatically based on the # of parameters. I don't think it's that important to offer a 'batch size' value for the shim.

That's a good point. I'd expect this kind of thing to be managed as an implementation detail by SqlClient's future implementation of DbCommandSet (as mentioned above, SqlClient actually has an internal SqlCommandSet which actually isn't publicly accessible). I do agree that we shouldn't concern ourselves with this at the general API level. Of course it's also possible for the user to execute two batches instead of one in certain situation (not that passing this onto the user is a good idea).

This is for .NET Core x.y only? Or will this be available on .net full as well?

I don't have a definite answer for this yet, but as a general rule new APIs get added to .NET Core (or rather to .NET Standard) and not necessarily to .NET Framework.

@FransBouma
Copy link
Contributor

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

@bgribaudo
Copy link
Contributor

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command.

For example, let's say I execute a command set consisting of two commands, both of which call StoredProcA. This proc returns zero, one or two result sets depending on circumstances. If executing the command set returns four result sets, I know that the first two correspond with the first call to StoredProcA and the last two correspond with the second call. However, if, say, only two total result sets are returned, sorting out which came from where is a bit trickier--did they both come from the first proc invocation, the second invocation or did one come from each?

To accommodate situations like this, is something like the following an option?

public class DbCommandSet … {
  // executes each command in the set
  CommandSetResults Execute(); 
  Task<CommandSetResults> ExecuteAsync();
   … 
}

// corresponds with the results returned by a single command in the set
public class CommandSetResults {
   DbDataReader GetReader(); // returns data reader for first result set from command; subsequent result sets from the command are accessed by calling LastResults() on the last-returned data reader instance
   object GetScaler(); 
   CommandSetResults CommandSetResults(); // returns a CommandSetResults for the next command's set of result sets, similar to how DbDataReader.NextResults() is used to get the next result set from a single command
   int AffectedRows { get; }
}

With an API like this, the developer passes in a set of commands then fetches out a set of command results--the granularity passed in is the same as what is returned. It also eliminates the need to figure out how to return a combined records effected count (because that number is available on a per-command basis) and potentially simplifies the exception situation (an exception associated with a command's results can be thrown while that command's results are being worked with, eliminating the need to figure out how to identify how to tag the exception to indicate which command triggered it).

@GSPP
Copy link

GSPP commented Feb 7, 2019

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.


The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.


Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.


Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.


Feel free to close my now superseded issue https://github.com/dotnet/corefx/issues/31070. I am very happy to see progress on this. This is going to a major performance improvement.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@FransBouma

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support. Providers which do simply have their own implementation of DbCommandSet which does not derive from the shim - they can do whatever it is they want. Thus, DB2 would automatically add BEGIN/END, SqlClient would split the batch when there are too many parameters, etc.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

I likely overlooked it, but if you don't have this yet, I think if there are commands added to the DbCommandSet which are assigned to different connections it should throw.

The idea was simply to ignore the Connection property on the commands in the set, as well as the Transaction and Timeout properties - only those on the DbCommandSet are used.

@GSPP
Copy link

GSPP commented Feb 7, 2019

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

We'd need to identify all the common but relevant properties in which providers differ.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@GSPP

if the command contained more than one statement, it's currently impossible to find out how many rows were affected by each statement separately

This is critical for ORMs.

In general, this new API should expose all the raw power that is there. It should not drop information or limit the possible actions. This API is for consumers that maximally care about performance and flexibility.

We're in agreement, the question is really about how best to expose the non-aggregated rows affected.

The int[] recordsAffectedBreakdown allocation seems insignificant compared to the cost of making a database call. Even SELECT NULL is like 10000x the cost of that allocation. It's also G0 garbage.

That may be true in the general case, but don't forget that this is an API abstraction that's also used to contact local and even in-memory databases such as Sqlite. We've also seen that when fetching cached data from PostgreSQL over a low-latency 10Gbps network link, even minor-seeming allocations start to have an effect in the TechEmpower benchmarks. I agree that these cases may be somewhat artificial and not related to real-world usage, but I'm generally averse to introducing APIs that necessarily cause allocations if there's another way.

Is the timeout considered global or per command? This was not clearly stated as far as I can tell. IMO, it should be a global timeout. Most of the time, you want a timeout for the entire business transactions. If you expect creating a customer order takes 100ms then you want to abort if the entire thing takes e.g. 10s. You don't care about the duration of individual queries at all.

I agree - and per-command timeouts are probably not technically possible in at least some (or even most) of the providers. There's a note on all this in the proposal above - CommandTimeout is ignored on the individual commands (like Connection and Transaction), and DbCommandSet has its own Timeout property.

Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

Feel free to close my now superseded issue dotnet/SqlClient#19. I am very happy to see progress on this.

As @divega wrote on that issue, it's still useful to keep it to track the implementation of the SqlClient DbCommandSet - presumably the existing internal SqlCommandSet would be changed to conform to the new API being shaped here. So I think we should keep it open for now.

This is going to a major performance improvement.

Great :) Thanks for your own analysis and description in dotnet/SqlClient#19, it helped push the realization that a true batching API really is necessary.

Maybe the provider should be able to tell if it does support batching and other parameters such as what the maximum efficient number of commands and parameters is (before splitting needs to happen). Callers such as EF could interrogate the provider and adapt their behavior.

This is similar to the point @FransBouma made above. The way I'm looking at it, the API consumer (e.g. EF Core) should not be concerned with such implementation details - they should simply construct a single big batch and execute it. If a specific provider's batching implementation considers that the batch should be split - either because it must be (e.g. too many parameters) or because it would run more efficiently that way - then it would do so internally in a transparent way.

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see dotnet/efcore#9270). The SqlClient implementation would simply not batch at all for these cases.

There may be some details which make it impossible or undesirable to totally abstract this away from the user - possibly around exception handling - but at the moment I can't see a reason not to.

@FransBouma
Copy link
Contributor

FransBouma commented Feb 7, 2019

@roji

Since I don't think a way exists to ask providers whether/how they support batching, I assume you're expecting the shim to check which provider is being used and to hardcode the different behaviors, right?

yes, in the case when it's created directly. It could do that when the first DbCommand is added. With the DbProviderFactory route I recon the ADO.NET provider could create a specific derived type of the DbCommandSet shim and simply execute things sequentially if batching isn't supported.

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

This raises the question of how desirable it is for the shim to start implementing provider-specific logic, which really belongs in the provider implementations. Some basic functionality may make sense: knowing that some specific providers simply don't support batching, and executing the command sequentially could be useful for the transitional period. But ultimately the expectation is for providers to provide their own implementation, totally unrelated to the shim.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user.

So for each query concatenated you append a unique parameter and a unique snippet. A bit overkill (I decided not to go for this in my framework, as when 1 query fails, e.g. in the middle, the whole batch is rolled back but you likely have to rerun the whole batch anyway as that's way easier than to sift through which command is the wrong one and why, then schedule the batch again without the ones already succeeded).

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@bgribaudo

If I'm understanding this correctly, when a command set returns multiple result sets, the first result set is accessed by calling ExecuteReader() (or ExecuteReaderAsync()). Then, subsequent result sets are fetched by calling NextResult() on the last-returned DbDataReader instance (as returned by execute or the last call to NextResult()). All result sets returned by the set of commands are returned one after the other, with no distinction made as to which result set came from which command.

That's correct. Note that this is how the ADO.NET API already works when you use concatenation-based batching.

It's possible (at least in the SQL Server world) for a single command to return zero, one or multiple result sets--and the number returned can dynamically vary (e.g. a stored procedure might be programmed to return one result set under certain conditions and two under others). Making sense out of the returned result sets in a situation like this can get tricky/be impossible without knowing which set came from which command.
[...]

That's a very valid point - the scenario of a single command returning a variable number of resultsets is indeed tricky, even if it seems like quite an edge case. Your proposal for a CommandSetResults (or some version thereof) could help, but IMHO this approach has the following drawbacks:

  • It makes the API for the common case much clunkier and more complicated. A user batching a couple queries now has to go through (and understand) an additional layer. Everyone already knows DbDataReader, so it's desirable to leave that in place.
  • There's an advantage in having DbCommandSet have the same API as DbCommand - the two are quite similar at the moment, and that similarity could even be captured by an interface. Aside from being facilitating a coherent and more understandable API, this could enable scenarios where one part of a program creates a command or commandset, and another part executes it and consumes the results without knowing whether it was a batch or not.

In addition, even without this something like this, there are some workarounds which can allow users to distinguish which resultset belongs to which command: one could look at the shape of the resultset (which columns it has) to determine which stored procedure generated it, or one could even insert an arbitrary SELECT 'delimiter' in the batch, which consuming code would look for as a marker that the previous command completed. This could be considered a bit hacky, but it works well and doesn't seem to have any downsides.

To summarize, I think the concern you raise is valid. However, in my opinion it's too much of a rare case to consider (significantly) complicating the API surface for the general, simple case - especiall when valid workarounds seem to exist.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@FransBouma

The way I was seeing it, the shim is only there for providers which haven't (yet) implemented any support.

Ok, though if the shim fails, it's of no use to the user, so the user should be aware of the limits up front. E.g. if DB2 doesn't yet have their own implementation (and thus falls back on the shim, like any other provider) it will fail. This might be OK, but the user might be confused: "I use the default one and it breaks". So documentation might be in order that using the shim likely doesn't work in some cases.

Agreed. Another approach would be for the shim to contain a hard-coded list of the specific providers which support cancatenation-based batching, and only do it for them; for other providers it would fall back to non-batched execution. This would ensure that the shim always works, although it doesn't always actually batch.

A more extreme approach would be to give up entirely on actually batching in the shim, and always execute in non-batched fashion (this could make sense if we think that the duplicate parameter name problem is going to occur a lot - not sure about this).

Be aware too that mixing select commands and update commands might not always work with all db's. So if you concatenate statements in the shim, the user might run into this limitation too. I think documentation is enough here, as it breaks at runtime and it's not determinable up front if it will.

Absolutely - this is a database-specific detail that we shouldn't make any attempt to handle. Documentation will be important here.

yeah this is a bit of a tough call: concatenating statements and then not uniquing the parameter names already makes it fail on most systems, so if you don't do that, the user has to do that, and they likely already take other precautions (like group inserts together in 1 shim instance, updates in another etc.).

That's true - I think we should do our reasonable best, but rewriting user-provided SQL of unknown dialects seems like it's a non-starter. Here as well, documentation will hopefully help people out, and providers will hopefully implement their own DbCommandSet to obviate the shim entirely.

Regarding the values returned by ExecuteNonQuery(Async) which uses the concatenated query command: There is a trick to use here which allows us to determine per update statement whether it affected 0 or more rows: append ;SELECT param=rowcountvar to the update query, where param is an output parameter and rowcountvar is the rowcount var of the db (e.g. @@ROWCOUNT) . If the parameter is 0, the update statement affected 0 rows and should be reported as such to the user.
[...]

Yes, this is similar to what EF Core does for SQL Server in order to implement optimistic concurrency, which requires knowing for each update if it worked or not. This is exactly the kind of details that I'd expect the SqlDbCommandBatch to take care of internally, abstracting away the task from O/RMs and users. By the way, PostgreSQL provides this information as part of the wire protocol, so no additional hacky queries are needed.

@GSPP
Copy link

GSPP commented Feb 7, 2019

Thanks @roji. It seems I overlooked a few statements in this long thread.

should not be concerned with such implementation details

Unless I'm mistaken, for SQL Server there's even a threshold below which batching isn't efficient at all (see aspnet/EntityFrameworkCore#9270). The SqlClient implementation would simply not batch at all for these cases.

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.


A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW. Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

@bgrainger
Copy link
Contributor

OK, I understand, but I don't see why the shim should do anything with these properties - can you explain why you think that's important? The shim would obviously set those properties on the internally-used batch command it creates (since that's the one that actually executes), but why the others?

I think we're saying the exact same thing here; I may have just worded it confusingly.

I am not proposing that DbCommandSet modify properties on any DbCommand that is added to the set.

All I'm suggesting is adding to the spec a bullet point that explains how the shim will set the properties on the internally-created batch command. Your spec mentions how CommandText is used; I thought it would be good to be thorough and explicitly document how all properties on DbCommandSet get applied to the internally-created batch command. (The bullet point I suggested adding deliberately paralleled your existing second bullet point.)

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@bgrainger ahh, understood! I indeed thought you wanted the shim to modify the properties on the user-provided commands. Added a bullet point as suggested.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@GSPP

These details do not matter for semantics but for performance. The caller can adjust its behavior in ways that the provider cannot. For example, the provider does not understand the nature of the queries and the network latency environment.
[...]
I believe quite strongly, that the provider should never try to optimize the batch size or suppress batching in some cases. It's the wrong place to make these decisions.

I'd like to understand this point a bit better - it also seems to contradict somewhat with your comment above that proposes the provider expose the maximum efficient number of commands/parameters. After all, if the provider can expose this knowledge, why shouldn't it just act on it and remove the burden from the user?

But more importantly, I'm interested in exactly what kind of user knowledge could be useful for determining batch sizes/thresholds, that the provider does not have. Shouldn't batching always be desirable regardless of the network latency environment? How would the nature of the query impact the decision-making here? Some real-world examples could be useful on this.

Note that the user would always have the option of creating multiple DbCommandSets, according to whatever criteria they see fit. The question here is whether the provider implementation should make its own decisions on how to execute the commands given by the user. Don't forget we have some very clear provider-specific cases, like the max parameter limit per batch in the SQL Server case - do you really think this kind of thing should be the user's responsibility rather than the provider's?

The EF issue does not state what mechanism of batching was used. Maybe just concatenated SQL which has different properties than command sets (most significantly, recompilation). If command sets ever are slower I'd like to know why. What specific property (of the protocol?) caused that? It would be surprising to me.

That's very possible - I was also very surprised that batching two statements could actually be slower than not batching them, and was explained that it's related to overhead involved in batching. @divega and/or @AndriySvyryd may be able to provide more info.

EF might have 1 billion inserts queued but it will likely not want to create 1 billion command objects. So some moderation/negotiation on the number of commands per set has to take place in any case. If EF could query the maximum command and parameter count it could directly materialize the perfect number of commands.

Regardless of the above, if there really are many commands to execute, EF may decide to execute them in several batches - regardless of anything provider- or environment-specific, i.e. as a means of avoiding instantiation of lots of command instances. I'm not sure this is actually necessary, but the point is that it's a different kind of batch-breaking-down as the one discussed above.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@GSPP one more comment - note that whether a particular implementation of DbCommandBatch splits the batch or not is entirely internal and provider-specific. We can of course publish guidelines and recommendations, but at the end of the day each provider will do whatever it is they think is necessary - in other words, it's not part of the API itself.

@roji
Copy link
Member Author

roji commented Feb 7, 2019

@GSPP missed this last comment of yours:

A concatenation based shim would not work for statements that must be the single statement in the batch e.g. CREATE VIEW

Would it work in a non-concatenation-based batch? If not, then that seems OK - it's simply a limitation of SQL Server for both forms of batching. If CREATE VIEW can be batched, then it's an odd limitation :) But even in this case, it seems OK for the shim not to always work - it's a transitional mechanism until SqlClient gets its own optimized implementation, and it's a new API so nothing is being broken.

Also, renaming parameters is observable. T-SQL can query the catalog views to obtain the currently executing SQL text. It might have implications for parameterization, query store and other tooling. Of course, SQL Server will get a true implementation but these examples show how difficult it is to shim this.

I assume that by renaming parameters you mean changing their names so that they can be unique across all commands in the batch. If so, then I agree that this isn't viable, if only for practical reasons - we're not going to be able build an SQL parser into the shim that can detect placeholders which aren't part of SQL literals etc.

Maybe the shim should be the safest option possible and all optimizations should be left to the provider.

That seems to point towards the shim executing the commands without any sort of batching, as proposed above in the discussion with @FransBouma. I agree there's some merit in this...

@Antaris
Copy link

Antaris commented Feb 7, 2019

Is there going to be an considerations for T-SQLs GO batch separator? One thing that has always been a pain when building DB migration bits is when I want to execute scripts generated by SSMS that typically generate a number of batch scripts separated by GO.

Also, how will the different providers hand scope? For instance if I do the following:

DECLARE @val INT;
SELECT @val = 1;

If this were two commands, what happens to the scope of @val or is this actually a non-issue here?

Could an API be provided to accept a SQL statement and return a command batch? Is there any merit in such an API?

@bgribaudo
Copy link
Contributor

[@GSPP] Regarding error handling: It must be possible for providers to keep executing in case of error and keep incurring errors. SQL Server will happily continue in case of many errors. All those errors must be reported the API caller somehow. SQL Server commands can also generate info messages. These must be provided as well.

[@roji] This is of course very provider-specific; the general batching API doesn't define any mandatory behavior but definitely should not get in the way either (there's already a note in the proposal above). I personally think that it makes more sense to abort the batch once an error occurs, but if continuing is the desired behavior for a specific provider (e.g. SQL Server), then it could throw an aggregate exception containing all the individual exceptions at the end of batch execution. At the moment I don't see anything in the proposal that needs to be changed, let me know if you see a problem.

I concur with the idea of allowing execution to continue on error when the server supports this. However, I think it's also important to have the ability to know about errors as they are encountered vs. after the entire batch has been executed.

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

With this API as it currently stands, if a consumer is reading a result set that was not fully provided by the server due to an error, it sounds like the result set will look like a valid, complete result set. The consumer won't be able rely on that actually being the case until the entire set of commands has finishes executing. If the consumer wants to perform an irrevocable action on a per-data set basis, the consumer will need to cache all data sets as they are received while they wait for execution to complete and the possible aggregate exception before taking action.

If, instead, errors are raised as they are encountered, the consumer knows that when they read a complete data set, it is actually complete, and so can be actioned on immediately. If a related error is encountered during reading, the consumer is informed and gets to decide whether or not to continue on (e.g. move to the next data reader).

Another benefit of immediate exception on error is that, if the consumer decides that an error warrants cancelling the command set's execution, they can close the connection (perhaps automatically as the exception exits a using statement). The server can then detect this and cancel execution instead of continue on only for the consumer to ignore/discard the remainder of what's returned.

@FransBouma
Copy link
Contributor

Currently, with ADO.Net (at least, the SQL Server implementation), error exceptions are raised when the error is read from the incoming data stream. The consumer can then respond/react to the exception, as appropriate.

I don't know, if the error isn't a transient error, the current transaction is likely to be aborted on the server side, meaning whatever you decide to do, you have to run the previous commands again as they've been rolled back due to the aborted transaction.

@expcat
Copy link

expcat commented Apr 20, 2021

Not pushing.
Any news for this in 6.0?
or still be postponed?

@roji
Copy link
Member Author

roji commented Apr 20, 2021

@expcat still in the plan for 6.0 - this is very high up on my priority list.

@roji
Copy link
Member Author

roji commented Jun 20, 2021

I've just submitted #54467, apologies to all involved in the original discussions that this took so long. I've more or less implemented this proposal for Npgsql in npgsql/npgsql#3860, giving us further confidence that the API shape is correct (beyond @bgrainger's implementation in MySqlConnector and @Wraith2 experimental implementation in SqlClient).

Note that we ideally need to merge this in time for preview6 (in three weeks) to ensure it goes in for .NET 6.0.

Following my implementation, I'm proposing the following two tiny changes to the API proposal:

  • DbBatch.BatchCommands
    • This was originally proposed as an IList<DbBatchCommand>. During API review, it was recommended to make this into a dedicated type (DbBatchCommandCollection), in order to avoid exposing an interface (which users would then likely cast down into a concrete type).
    • Originally, it was proposed that this type extend Collection<DbBatchCommand>, which wraps an IList<T> (List<T> by default); however, during implementation of Npgsql batching support this got in the way; for example, it isn't possible to make it wrap a List<NpgsqlBatchCommand> (since IList<T> isn't covariant).
    • As a result, I propose that DbBatchCommandCollection simply extend IList<DbBatchCommand> directly; this entails a bit more boilerplate for provider implementors, but provides greater flexibility.
  • I propose moving DbConnection.CreateBatchCommand and CreateDbBatchCommand to DbBatch, similar to how CreateParameter is on DbCommand and not on DbConnection.

@roji
Copy link
Member Author

roji commented Jun 21, 2021

One additional, possibly more controversial thing...

Our proposal allows specifying different command behaviors for different commands in the batch. Looking at this now, it seems to introduce quite a bit of complexity, as well as API inconsistency with DbCommand, for what would very likely be a very rare edge case. I think we should consider reverting to the DbCommand API of accepting CommandBehavior in ExecuteReader.

Some objections:

  • This breaks the symmetry between the DbCommand/DbBatch APIs which we've tried to maintain. This symmetry reduces conceptual load for users understanding the new DbBatch API, and could also be captured in an interface down the road if needed.
  • To apply the same behavior to the entire batch, users must manually apply it to each and every batch command.
  • From the experience of implementing the batching proposal in Npgsql, this requirement requires quite a bit of rewriting, since behavior currently applies to the resulting DbDataReader as a whole. I can't say for sure this applies to other drivers, and in general implementation concerns shouldn't affect the abstraction design too much, but I sense that the possibility to actually have different behaviors would not be widely supported (@bgrainger any input on this?)
  • It's important to remember that it's always possible to split the batch into two for different command behaviors. That does mean an extra roundtrip, but assuming these are exotic edge cases, that seems more than reasonable.

Here are some notes per value of CommandBehavior:

CloseConnection

This makes no sense in batching, except possibly in the very last DbBatchCommand. We'd have to define the behavior for when it's specify in non-last commands (ignore? apply after the batch?). This value in itself is a an argument for not allowing per-command behavior.

SchemaOnly

Wanting to mix real command execution (non-SchemaOnly) with SchemaOnly in a batch seems very exotic. You typically either execute or retrieve the schema, but not both in the same batch.

KeyInfo

This is used to receive extra metadata information when using SchemaOnly. Wanting to get this extra information for some queries and absolutely not wanting it for others doesn't seem useful.

SequentialAccess

This is the one option which seems remotely plausible - you want to execute multiple commands, one of which has a huge column and should be traversed sequentially. Without per-command behavior, you can still process the entire batch sequentially - there isn't a big downside of that (aside from not being able to access result columns in random fashion).

SingleResult and SingleRow

As far as I know, these are mainly useful in cases where a stored procedure is being executed, but only the first resultset or row of that sproc is needed, and it's not possible to modify the sproc to accept a parameter managing that. It's possible this could be useful when executing multiple sprocs in a batch, but overall it seems like quite the corner case.

@bgrainger
Copy link
Contributor

From the experience of implementing the batching proposal in Npgsql, this requirement requires quite a bit of rewriting, since behavior currently applies to the resulting DbDataReader as a whole. I can't say for sure this applies to other drivers, and in general implementation concerns shouldn't affect the abstraction design too much, but I sense that the possibility to actually have different behaviors would not be widely supported (@bgrainger any input on this?)

In MySqlConnector, MySqlDataReader does have a (single) CommandBehavior associated with the reader as a whole. However, this was only used to support CommandBehavior.CloseConnection, which is explicitly disallowed by MySqlConnector for batch commands (even for the last command in a batch). (It seems like this restriction could be relaxed to allow CloseConnection only for the last command in a batch, and that would be fairly simple to implement in MySqlConnector.)

Per-command CommandBehavior is supported for SchemaOnly, SingleRow, and SingleResult; this does result in SET sql_select_limit=1; (or similar) being sent to the MySQL as an optimisation hint (and all other rows are silently discarded). There is one integration test for CommandBehavior.SingleRow but the others are untested (so conceivably don't actually work).

No other CommandBehavior enum values (KeyInfo, SequentialAccess) are supported by MySqlConnector (with or without batching).

@bgrainger
Copy link
Contributor

SingleResult and SingleRow
As far as I know, these are mainly useful in cases where a stored procedure is being executed, but only the first resultset or row of that sproc is needed, and it's not possible to modify the sproc to accept a parameter managing that. It's possible this could be useful when executing multiple sprocs in a batch, but overall it seems like quite the corner case.

FWIW, the only feature request MySqlConnector has had for CommandBehavior.SingleRow was for a non-sproc query: mysql-net/MySqlConnector#679.

It does seem like the easier solution would be to add LIMIT 1 to the query, but I can imagine simple SQL generation scenarios where that's not trivial to add. Dapper automatically adds SingleRow to its QueryFirstOrDefault call: https://github.com/DapperLib/Dapper/blob/42987a597b9c6b8844b0876e3facfc640f08a5cc/Dapper/SqlMapper.cs#L1182-L1184

In MySql.Data (and now MySqlConnector), this triggers an optimisation that sends a hint to MySQL that only one row should be returned, which can result in a dramatic speed-up in query execution.

I can still see the value of this on a per-command basis.

@bgribaudo
Copy link
Contributor

To apply the same behavior to the entire batch, users must manually apply it to each and every batch command.

Hmm...would it help if the per-batch command CommandBehavior behaved as an override which is applied only if it is set? If it is not set then the CommandBehavior from the parent DbCommand would be used (so the parent DbCommand's CommandBehavior applies to all commands in the batch except for any commands whose CommandBehavior is set directly).

@roji
Copy link
Member Author

roji commented Jun 22, 2021

@bgribaudo this was briefly discussed in #28633. The problem is that at that point we need a distinction between set-to-default and no-set-at-all, e.g. making behavior nullable on the individual batch commands. While this is possible, it introduces even further complexity for what is really edge-casey.

@bgrainger thanks for the added info. I can also see scenarios where per-command behavior could be useful, but I'm not sure they're useful enough (and often enough) when weighed against the disadvantages (i.e. consistency with the DbCommand API, having to set on all commands, etc.). It's also important to remember that users don't always have to batch everything in every case - there's always the option of not batching for the specific edge case which requires different behaviors; the downside is only an extra roundtrip, rather than blocking some functionality etc.

But I'm definitely OK with going with what people think here.

@bgrainger
Copy link
Contributor

users don't always have to batch everything

This is a good counterpoint.

It also seems plausible to me that some database backends might not even support customising the behaviour on a per-command basis (in order to provide high-efficiency batching). So there's a question of whether the API should expose something that may not be feasible for some providers. (CommandBehavior.CloseConnection is perhaps an obvious example of a behaviour that can only be applied to the batch as a whole, not to individual batch commands. OTOH, CommandBehavior.SingleResult doesn't really make much sense for a batch (which would frequently be used to retrieve multiple result sets) but could make sense for individual commands.)

I have no objection to moving the CommandBehavior to the batch.

@Wraith2
Copy link
Contributor

Wraith2 commented Jun 22, 2021

I haven't investigated from the SqlClient perspective so I don't know if any can be used.
It doesn't really matter, if there is one implementation that can use it at that level it should go into the shared api and the others can ignore or partial implement however they choose.

@roji
Copy link
Member Author

roji commented Jun 29, 2021

Looking at the CommandBehavior question again and after some online consultations, I've removed CommandBehavior from DbBatchCommand and added it as a parameter to DbBatch.ExecuteReader (and async).

Summary:

  • Some (even most) behavior options make little sense on a per-command basis (e.g. CloseConnection).
  • Even for those which could make sense, the use cases seem very rare, and users always have the option of executing two batches with different behaviors rather than a single batch with different behaviors.
  • We suspect a significant number of providers won't be able to support some/all behaviors on a per-command basis.
  • As @bgribaudo wrote above, if different behaviors are indeed needed in a batch, then it would help to be able to define a batch-wide default, and override on a specific batch command. Deferring per-command behavior now allows us to introduce this cleanly later (e.g. via a nullable behavior on DbBatchCommand) - if it really is needed. This seems better than the proposal as it was, which forced behavior to always be specified on each behavior, and at the same time doesn't go into the complexity of a batch default with command override too early, before the feature is requested etc.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Jul 4, 2021
roji added a commit that referenced this issue Jul 4, 2021
roji added a commit that referenced this issue Jul 29, 2021
Allow implementations of DbBatchCommandCollection to hide the indexer
with a narrower-typed version for their own implementation of
DbBatchCommand, as elsewhere in ADO.NET.

Fixup for #54467, part of #28633
roji added a commit that referenced this issue Jul 29, 2021
Allow implementations of DbBatchCommandCollection to hide the indexer
with a narrower-typed version for their own implementation of
DbBatchCommand, as elsewhere in ADO.NET.

Fixup for #54467, part of #28633
@ghost ghost locked as resolved and limited conversation to collaborators Aug 3, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-approved API was approved in API review, it can be implemented area-System.Data
Projects
None yet
Development

Successfully merging a pull request may close this issue.