Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API Proposal: DbCommandSet -- allows multiple DbCommands to be transmitted to server in single message #28794

Closed
bgribaudo opened this issue Feb 26, 2019 · 17 comments
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Data
Milestone

Comments

@bgribaudo
Copy link
Contributor

This is offered in the friendliest way possible as an alternative to #28633 (this general idea was discussed in that thread but didn't fit the direction being taken for that proposal). Both have the same high-level goal but differ in the details of how they achieve that goal.

Need/Problem Statement

Within the realm of SQL-based relational database engines:

  • All support executing command batches that consist of a single SQL statement.
  • Some support command batches that consist of multiple SQL statements (e.g. SQL Server supports SELECT * FROM TableA; SELECT * FROM TableB).
  • Some also support multiple independent command batches being sent in the same wire protocol level message. (In turn, depending on the server, these independent batches may themselves each contain multiple SQL statements.) Transmitting multiple commands in the same message can offer performance benefits.

The first two scenarios are supported by ADO.Net's DbCommand. However, no public ADO.Net API currently exists which allows multiple DbCommands to be grouped into a single wire protocol level message. This proposal suggests one way to add this functionality.

Design Strategy

Commands are passed in as independent units. Results are returned in similar fashion—as independent units, one unit per command passed in.

In general, the consumer's experience in retrieving results should be (almost) identical to the experience they'd have if they executed each command separately. For example, the consumer should be able to easily determine which result set (or result sets, as a single command can possibly return several) came from which command, affected row counts should be determinable on a per-command basis and the reader-level optimizations achieved by methods like ExecuteScalar as well as CommandBehavior options should be available on a per-command basis (as different optimizations may apply to different commands in the set).

API Proposal Draft

public abstract class DbCommandSet : IDisposable, IAsyncDisposable, IEnumerable<(DbCommand, CommandBehavior)>
{

  #region Command-Set Management

  // Adds a single command to the set
  public abstract void Add(DbCommand command);
  public abstract void Add(DbCommand command, CommandBehavior commandBehavior);

  // Adds multiple commands to the set
  public virtual void AddRange(IEnumerable<DbCommand> commands);
  public virtual void AddRange(IEnumerable<DbCommand> commands, CommandBehavior commandBehavior);
  public virtual void AddRange(IEnumerable<(DbCommand, CommandBehavior)> commands);

  // Clears all commands 
  public abstract void Clear();

  public abstract IEnumerator<(DbCommand, CommandBehavior)> GetEnumerator();
  IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

  #endregion


  #region Execution

  public abstract DbCommandSetResult Execute();
  public abstract Task<DbCommandSetResult> ExecuteAsync(CancellationToken cancellationToken = default);

  // Shortcut - executes then returns sum of number of rows affected by each command in the set
  public abstract int ExecuteNonQuery();
  public abstract Task<int> ExecuteNonQueryAsync(CancellationToken cancellationToken = default);

  public abstract void Prepare();
  public abstract Task PrepareAsync(CancellationToken cancellationToken = default);

  public abstract void Cancel();
  public abstract Task CancelAsync(CancellationToken cancellationToken = default);
  
  #endregion
  

  // Could be used for any command that does not have a timeout set; or individual command timeouts could be ignored in deference to this timeout.
  public abstract int Timeout { get; set; }

  // Specifies the connection to be used when executed. (If a connection is set on an individual command and it does not match this connection, possibly an exception should be thrown.)
  public abstract DbConnection Connection { get; set; }

  // If set, used for all commands in the set. (If a transaction is set on an individual command and it does not match this transaction, possibly an exception should be thrown.)
  public abstract DbTransaction Transaction { get; set; }
}

public abstract class DbCommandSetResults : IDisposable, IAsyncDisposable, IEnumerable<DbCommandSetResult> {
  public virtual DbDataReader GetReader();
  public abstract DbDataReader GetReader(CommandBehavior commandBehavior);
  public virtual Task<DbDataReader> GetReaderAsync(CancellationToken cancellationToken = default);
  public abstract Task<DbDataReader> GetReaderAsync(CommandBehavior commandBehavior, CancellationToken cancellationToken = default);

  public abstract int GetNonQueryResult();
  public abstract Task<int> GetNonQueryResultAsync(CancellationToken cancellationToken = default);
  
  public abstract object GetScalarResult();
  public abstract Task<object> GetScalarResultAsync(CancellationToken cancellationToken = default);

  // Advances DbCommandSetResults to the results associated with the next command in the set.
  public abstract bool NextCommandSetResult(); 
}

Notes

  • CommandBehavior may optionally be passed when adding commands to the set and when reading results. Depending on the database server, some behaviors (e.g. KeyInfo, SchemaOnly) affect the query executed, so need to be specified before execution (e.g. when the command is added to the set). Other behaviors may only be of interest to the data reader (for example, so it can optimized reading: SequentialAccess, SingleResult, SingleRow) and so only need to be passed at the time of reading. If a reading-specific behavior is passed when a command is added or an execution-level behavior is passed during reading, an exception should be thrown.

Example

using (var set = new SqlDbCommandSet(connection)) {
  set.Add(command1);
  set.Add(command2, CommandBehavior.SchemaInfo);
  set.Add(otherCommandsEnumerable);
  
  using (var setResults = set.Execute()) {  
    // Get affected row count from first command
    var command1AffectedRows = setResults.GetNonQueryResult();
    
    // Move to second command's results
    setResults.NextCommandSetResults();
    
    // Get schema details from results
    var command2Schema = setResults.GetColumnSchema();
    
    // Loop through remaining commands' results, processing each
    while (setResults.NextCommandSetResults()) {
      using (var reader = setResults.GetReader()) {
        ProcessReader(reader);
      }
    }
  }
}
@bgribaudo
Copy link
Contributor Author

The above proposal is offered in the friendliest way possible as an alternative to #28633 (this general idea was discussed in that thread but didn't fit the direction being taken for that proposal). Both have the same high-level goal but differ in the details of how they achieve that goal.

Differences

#28633's approach flattens the result sets returned by the various commands so that they are accessed sequentially using a single data reader. For example, imagine a command set consists of a command returning one result set, a command returning two result sets and another command returning one result set. #28633 will expose the results for this set via a single data reader which is used to access all five result sets.

With this proposal, results are grouped by command. Using the example scenario from the preceding paragraph, with this proposal, first a data reader containing the first command's single result set is provided, then a reader for the second command's two results sets, followed by a reader for the third command's single result set.

The "flatten results from each command into a single data reader" approach of #28633 streamlines use-cases such as executing a set of select statements where each returns a fixed number of result sets and where errors are not expected, as well as executing sets of insert statements where all are expected to succeed together or the entire set should die on first failure.

However, not every scenario where command sets would be useful aligns with this streamlining. For example, the ability to work with the results from each command separately comes in handy in the following situations:

  • Errors - Imagine a set of command where the first calls a stored procedure that returns one result set, the second calls the same stored procedure (though perhaps with different arguments) and subsequent commands return other result sets. When executed, the following are returned in the order given: an error, a result set, an error, a result set.

    How should this be interpreted? Possibilities include:

    • First error indicates first stored procedure failed, first result set is from second stored procedure, second error is from the command coming after the second stored procedure call, ….
    • Both first error and first result set came from the first procedure, second error came from second stored procedure, second result set came from the command following the second stored procedure call.
    • Both first error and first result set came from first procedure, both second error and second result set came from second procedure.

    If all result sets are flattened, it's hard to tell (in fact, it may not be possible to read beyond the first error, as that error may kill the data reader); if result sets are segmented by command, making the association is much easier (and since data readers are command-specific, an error that kills one command's data reader doesn't automatically break the ability to read the next command's results).

  • Variable Number of Result Sets - Not every command returns the same number of result sets each time it is called. Take, for example, a call to SQL Sever's sp_help with a table name passed in. The number of result sets returned can vary depending on how the particular table is defined.

    Imagine performing database metadata discovery using a set of sp_help calls combined into a command set. In the particular situation, all the consumer cares about is the first result set from each proc invocation. If result sets are segmented by command, it's easy for the consumer to get the results for a command, read the first result set, then jump to the next command's results and read the first, etc. If, instead, result sets are flattened, the consumer needs to apply more advanced logic to determine where one command's results end and the next begins (maybe the consumer needs to check each result set to see if its schema matches what the first result set’s schema should look like).

  • Reader Performance - Grouping results by command offers the option of specifying a CommandBehavior specific to each command's results. This allows the relevant reader to be tuned to how those particular results will be consumed (e.g. from the first command, just need a scalar; the second, just a single row; from the third, need everything). With result flattening, performance tuning at this level isn't available.

  • Affected Rows - With this proposal, affected row counts are provided on a per-command basis, similar to how they would be if each command was executed separately.

  • Separation of Responsibilities/Legacy Code - Existing code written to work with data readers can work with the data readers returned by this proposal without modification, as each data reader works (almost exactly) the same as it would if it were produced by executing the command outside of the command set.

    With result flattening, this is not necessarily true: Code consuming a flattened reader may need to be adapted to work in the new, shared context. For example, if a method that processes a reader from one command is passed a reader from the command set, the method may need to be modified to not close the reader when done (because the reader may provide access to other data not relevant to the method at hand) and to not assume that it can loop through all results returned by the reader (as [some] of those results may be from other commands).

In summary, by grouping results by command, this API proposal operating at a slightly lower level than #28633. By operating at this slightly lower level, it’s possible to support situations that flattened results either doesn't accommodate or doesn't accommodate as gracefully. It also more closely mirrors how results can be returned by the database server (e.g. when independent commands are executed in a single message, SQL Server segments the various result sets by command).

Flattening can easily be implemented over top this proposal, for those use cases where flattening is ideal.

@roji
Copy link
Member

roji commented Mar 1, 2019

@bgribaudo, sorry it took so long to answer - I only just recently realized this issue existed (always a good idea to mention relevant people to make sure they're aware).

First, it's really great to see this: regardless of what I think of the proposal or what happens to it, it shows that there's real interest and passion around this subject, and it makes me confident that we're on the right track trying to evolve the API. So thanks for engaging.

Having gone through the issue, my opinion is still the same as described in #28633. I will explain why I don't think this is the right API, point by point, but that's just my opinion and I really am open to having my mind changed (I'm also not making any decisions alone on this). In a nutshell:

  1. This proposal adds considerable complexity to the API, for all uses of batching, and not just for commands returning a variable number of resultsets. In that sense it's not "pay-per-play" (complexity is added everywhere, not just for the problematic cases).
  2. I believe that the problematic cases you cite (variable number of resultsets) is rare, possibly even extremely rare.
  3. Reasonable alternative solutions exist for finding out which resultset belongs to which command. These require slightly more advanced work by consumers, but only when needed, and does not make the general API more complex.

In other words, I don't think your proposal is wrong or useless - it does solve real problems - but it does so by adding complexity for everyone, whereas the problems it solves seems rare and solvable via other means.

Added complexity

First, to be sure we're all in sync, here's what consuming two resultsets would look like according to #28633:

DbCommandSet dbSet = ...; // Set up a command set
using (var reader = dbSet.ExecuteReader())
{
   // Regular, familiar pattern for accessing DbDataReader
   while (reader.Read()) {
      // Process first resultset
   }
   reader.NextResult();
   while (reader.Read()) {
      // Process second resultset
   }   
}

And here's what it would look like according to this proposal:

DbCommandSet dbSet = ...; // Set up a command set
using (var results = dbSet.Execute())
{
   using (var reader1 = results.GetReader()) {
	   while (reader.Read()) {
		  // Process first resultset
	   }
   }
   results.NextCommandSetResults();
   using (var reader2 = results.GetReader()) {
	   while (reader.Read()) {
		  // Process secondresultset
	   }
   }
}
  • Your proposal adds a new concept that doesn't exist in New System.Data.Common batching API #28633 (DbCommandSetResults). This is another thing for users to understand, and makes it harder to adopt the new API.
  • Consuming DbCommand results and consuming CommandSet results would be a pretty different thing, and users would have to think about the difference.
  • Batching is already used in many applications, by concatenating SQL statements. Your proposal would make it harder to port to the new API, as multiple readers would now be needed where only one was previously needed. New System.Data.Common batching API #28633 makes porting easier by making DbCommandSet behave very similarly to a DbCommand.
  • There's a new assymetry between the batched and non-batched APIs. The non-batched API doesn't have a DbCommandResults, similar to what you propose. Also, in the non-batched API we have Execute{Reader,Scalar,NonQuery} on the command, whereas in your proposal we have Get{Reader,Scalar,NonQuery} on the results.

To me, this added API complexity (which again, applies to everyone regardless of whether they're using variable resultsets) outweighs the value this proposal brings, even if I do agree that it does bring value.

Rareness of the problem

First, it's obviously hard to prove, but I have a strong feeling that in 99% of cases, users execute batches of a fixed number of commands, returning a fixed and known number of resultsets. This comes from a long experience working with users - variable resultsets have been a very minor edge-case - and #28633 was designed with this in mind. However, I could be biased (I come from the PostgreSQL world), and I'm open to hearing otherwise.

Second, once a structured batching API is introduced, there shouldn't really be a reason for users to continue using the legacy, concatenation batching "API". This isn't to say it shouldn't work anymore - providers can continue supporting concatenation batching even within a structured batch ("batching within batching") - but at the moment I simply don't see a compelling reason for users to do this. If that's correct, that would mean that the variable resultset problem is in effect restriced to stored procedures only.

Finally, even stored procedures returning variable resultsets seem to be a rare exception, rather than any sort of norm. Examples like sp_help do exist, but adding complexity to the API just to allow users to discern which resultset belongs to which command, just for this case, doesn't seem right to me.

Other solutions

If there were no way at all to discern which resultset belongs to which command, I'd agree that we have a problem. However, as I wrote here, there are other solutions for this:

  • One can look at the shape of the new resultset (e.g. via DbDataReader.GetSchemaTable() or GetColumnSchema(). The number of columns in the resultset or their types can be used to understand that we're now switched to consuming the next command's resultset.
  • On can simply insert a "marker query", producing a resultset that we'd look for. For example, we would execute SomeStoredProcReturningVariableResult(), then SELECT 'DONE', then SomeOtherThing(). When parsing the results, we'd simply examine each new resultset to see whether it contains DONE, and thereby know whether we've finished processing the first procedure's resultsets.

As variable resultsets are probably rare, the above seem to be perfectly reasonable ways to get the needed information. I'd personally want to understand what's insufficient about them before making the general API more complex for everyone.

It's also worth remembering that this is only a batching API - it improves performance but there's always the option of not batching if absolutely necessary (although at the moment I'm really not aware of such a situation).

Answers to your specific points

Errors

I'm still skeptical of the idea that query/command handling can continue in a reliable and sane way after an error occurs - or that it's a desirable feature. Examples of these up to now have been very theoretical, it would be good to see a reasonable, real-world example here, to avoid the risk that we're discussing an exotic/contrived scenario.

But more importantly, error handling doesn't seem to add an additional argument here - the only point seems to be again, the purported difficulty to discern which resultset belongs to which command. It seems that the solutions proposed above (via resultset query shape, via marker query) work just as well in error situations, unless I'm missing something. I don't see any reason to suppose that the reader will be killed upon error - that's a provider-specific implementation detail. If the provider does support handling commands in the batch after an error (again, I'm skeptical), then there's no reason it wouldn't allow you to continue using the same reader for that.

Variable Number of Result Sets
[...]
If [...] result sets are flattened, the consumer needs to apply (slightly) more advanced logic to determine where one command's results end and the next begins (maybe the consumer needs to check each result set to see if its schema matches what the first result set’s schema should look like).

Already addressed above. Yes, the consumer would need to apply more advanced logic in this scenario - but it's possible and seems reasonable given the supposed rareness of the need. On the other hand, if we change to a non-flattened API everyone has to do more work, all the time, just to go over regular batch results.

Reader Performance - Grouping results by command offers the option of specifying a CommandBehavior specific to each command's results. This allows the relevant reader to be tuned to how those particular results will be consumed (e.g. from the first command, just need a scalar; the second, just a single row; from the third, need everything). With result flattening, performance tuning at this level isn't available.

The problem with this argument, is that CommandBehavior seems to make sense only when given at execution time, before command(s) are sent to the database (or as part of them). In other words, by the time you pass CommandBehavior.SingleRow to the proposed DbCommandSetResults.GetReader(), the entire query batch has already been sent to the database and all rows are already being sent back from the database. Of course, the driver can still implement SingleRow client-side by filtering out subsequent rows, but the whole point of something like SingleRow is to avoid that. This seems to hold for KeyInfo, SchemaOnly, SingleResult and SingleRow - leaving only SequentialAccess and CloseConnection, the latter of which doesn't make much sense in this context.

Note that for SingleRow and SingleResult it's usually possible to achieve the same thing in SQL (e.g. with TOP(1)).

NOTE: When I wrote the above, I was referring to the overload of DbCommandSetResults.GetReader() which accepts a CommandBehavior. It's been pointed out to me since that you also include methods on DbCommandSet itself which add a command along with a CommandBehavior - my objection above doesn't hold for that. See this comment below for more.

Affected Rows

It's true that your proposal provides a simple solution for getting the non-aggregated affected rows. However, we already have two other proposals which provide a solution to this. I do agree they're a bit less elegant, but on the other hand I also consider non-aggregated affected rows an advanced/rare scenario mainly to be used by ORMs and other layers, for features such as optimistic concurrency. I really don't suspect end consumers will have much need to directly use this API feature.

Separation of Responsibilities/Legacy Code - Existing code written to work with data readers can work with the data readers returned by this proposal without modification, as each data reader works (almost exactly) the same as it would if it were produced by executing the command outside of the command set.
With result flattening, this is not necessarily true: Code consuming a flattened reader may need to be adapted to work in the new, shared context. For example, if a method that processes a reader from one command is passed a reader from the command set, the method may need to be modified to not close the reader when done (because the reader may provide access to other data not relevant to the method at hand) and to not assume that it can loop through all results returned by the reader (as [some] of those results may be from other commands).

I don't really agree with this point, and think that from a compatibility/legacy point of view the situation is in fact the opposite. As I wrote above, lots of code already exists out there which makes use of concatenation batching. If #28633 is implemented as-is, then the existing code need only be modified to replace the concatenated DbCommand with a DbCommandSet - the reader both produce should be identical. However, if your proposal is implemented, then the consuming code must be modified since there's no longer a single reader as before, but multiple ones. In that sense your proposal makes it more complicated, and not easier, to switch to the new batching API.

It is true that if you have an existing method that accepts a DbDataReader, closes it with the assumption that it has only one resultset, and you want to start passing it a batch-produced reader, then yes, it would need to be modified to not close the reader. However, that seems reasonable for that very specific design - I'd prefer current uses of concatenation batching to be easier to port.

/cc @divega @ajcvickers

@ajcvickers
Copy link
Member

@roji Often when there is a common, simple experience that will work for most people, but then some things that may need to be tweaked or may need more access for advanced cases, we approach that by creating building blocks that allow the customization when needed but then we also create convention/sugar experiences on top of that so that it doesn't become "harder for everyone". Do you think there is any room for this here?

Of course, taking that approach too far leads to over-engineered and heavy solutions. For example, I'm not sure too many layers here is a great idea because of perf (including excessive allocations), but then again, maybe the low-level building blocks can be fast, and only the super sugary sugar that is a bit more allocate heavy.

Regardless, just something to consider.

@roji
Copy link
Member

roji commented Mar 1, 2019

@ajcvickers it's certainly possible to think about two layers here - a lower-level one which doesn't flatten command results and an upper-level one which does. As you wrote, I'm also not keen to go down that path for reasons of complexity, perf, etc.

But a crucial point is that at the moment, I'm not aware of anything which wouldn't be possible with the simpler API being proposed (#35135) and that would be made possible with this proposal; mechanisms do exist for discerning command edges when consuming resultsets, it's just that the basic API doesn't do it for you automatically. In that sense, this proposal seems to add convenience rather than functionality, making a more complex (but IMO quite rare) task easier to perform. This is why I don't think it's a good idea to complicate things, either by providing one more complex API (this proposal) or one simple and one more complex API.

I really could be wrong here - I'm just waiting for the additional information to show it. If we feel that variable resultsets really are a very common scenario, then it may make sense to make it easier to consume them via this proposal. Or if we find some serious fault with the proposed alternative solutions for discerning edges (i.e. resultset shape, marker query) - some important scenario which isn't handled - then once again it would make sense to look further. But at this point I haven't seen one.

@roji
Copy link
Member

roji commented Mar 1, 2019

One additional important note about CommandBehavior. In your proposal there are two places where CommandBehavior can be specified:

  1. DbCommandSetResults.GetReader()
  2. DbCommandSet.Add() (and AddRange()).

The first one doesn't work as I explained above, since it occurs after everything has already been sent to the database, and so cannot really impact performance as it should.

However, the second option doesn't suffer from this problem, and is definitely a possible improvement to the API. It's important to note that this is completely orthogonal to the rest of your proposal, i.e. it has nothing to do with flattened/non-flattened resultsets - the only change is to the DbCommandSet API itself. As such, I'm going to add it as an open question to #28633 and we can continue discussion there. We can keep this issue to focus more specifically on the resultset issue and the proposed DbCommandSetResult API.

@bgribaudo
Copy link
Contributor Author

@roji, thank you for taking the time to read this proposal and for the detailed comments. We are both pulling toward the same big picture goal.

Taking a step back from implementation details, one question I’ve been thinking about is “what are the use cases for command sets?” There are many different ways to slice and dice when grouping and describing use cases. Here’s one attempt….

Use Cases

ORM Scenario (a.k.a. the Template Command)

I get the sense that (one of) the main use case(s) motivating #28633 comes from the ORM world. Say, changes have been made to a set of objects which need to be persisted to the database. When those changes are written to the db, revised values for the affected rows should be returned from the db (e.g. so that object properties can be updated with values from database computed columns, from columns updated by triggers, etc.). Also, applying the changes from the objects should all either succeed or fail together.

The query that needs to be executed might be something like: UPDATE dbo.SomeTable SET Col1 = @Value1 WHERE Col2 = @Value2 OUTPUT INSERTED.Col1, INSERTED.OtherValue;. The simple way to execute is to load the statement into a DbCommand, then iterate through each object, setting the command’s parameters to correspond with the current object, execute and move to the next object. This works. However, it’s very inefficient due to the various latencies introduced by needing to wait for each command to finish before beginning the next.

The common way around this is to concatenate multiple statements together into a single command (UPDATE …; UPDATE …; UPDATE; etc.). This also works, potentially providing a significant performance improvement over the simplistic alternative. However, the developer needs to build the concatenated statement. Simply cloning the SQL statement several times isn’t that big of a deal. The complexity comes from the fact that parameters in each clone need to be renamed and mapped to the appropriate source object’s parameters. It would be really nice if each project didn’t need to reinvent the wheel to implement this functionality…hence the other proposal.

In this use case, the need is for:

  • The same SQL batch to be executed multiple times,
  • with different parameters each time,
  • under the expectation that the entire set of commands either succeeds or fails together—and if failure occurs, dies on first failure,
  • where the SQL command consists of a single SQL statement
  • where each command always produces either zero result sets or one result set.

In essence, a single command template is being applied multiple times with different arguments each time.

General Use Case

This includes all situations which do not exactly match the need list for the ORM Scenario/Template Command. :-) In essence, the developer decides that several distinct commands would be more efficiently executed if they were multiplexed together for transmission to the database server.

For example:

  • During application start-up, a dozen distinct queries need to be run to pull information. (Doesn’t fit the ORM Scenario as distinct commands are executed.)
  • Search functionality where user inputs a list of item numbers which are then split by delimiter, then each passed as a parameter to a separate stored procedure call. If an item is not found, the stored proc returns an error. When this occurs, details about the other items should still be returned. (Doesn’t fit the ORM Scenario, as execution of other procs/processing of their results should continue on failure and the number of result sets returned varies based on the number of non-erroring proc calls.)
  • Document production tool that displays a grid of documents of various types. User selects the ones they want to produce (e.g. print), then clicks the action button. Statements are executed for each selected document, with the procs/queries used differing based on the document’s type. If a document has already been produced since the grid was last updated, the queries/stored procs won’t be able to find it—but this shouldn’t stop production of the other selected documents. (Doesn’t fit the ORM Scenario as distinct commands are executed and the set shouldn’t die on first failure; possibly also because of varying number of result sets).
  • Various other situations the broad span of developers who use ADO.Net will come up with.

Real-World Example

A few years ago, I worked on a .Net project where command sets would (presumably) have significantly helped performance. In a nutshell, the tool being built batch imported data: read from source file, apply transformations then dynamically map to tables/columns in a SQL database (with mappings computed based both on the source file’s headers and the database schema). Database errors from foreign key constrain violations were expected and needed to both be logged and cause the current source row to be skipped but could not abort the entire import.

Since errors needed to be tied to the specific source row, the data for each source row (which could target multiple tables) was inserted separately. Imagine a SQL batch something like: BEGIN TRAN; INSERT TableA (Col1…) VALUES (…); DECLARE @Id = SCOPE_IDENTITY(); INSERT Table2 (ForeignKey, ColA) VALUES (@Id, …); COMMIT; being executed with SET XACT_ABORT ON.

A great candidate for command sets—though definitely not a ORM Scenario/Command Template situation (multiple commands in the statement, continue on failure, etc.).

One idea which we didn’t get to implement at the time but which came up and had real-world applicability was for the tool to log the identity values of the just-inserted rows. To do this, the INSERTs would change to something like “INSERT TableA … OUTPUT INSERTED.IdColumn;…”. If this were done and a foreign key constraint violation caused an insert in a particular batch to fail, the remaining inserts in that batch would have been skipped over by the server as it moved to start on the next batch in the set, leading to the result sets returned for that source row being less than the expected quantity (a.k.a. a variable number of results being returned).

Is this a reasonably fair way to articulate the use cases for command sets?

@bgribaudo
Copy link
Contributor Author

How results are returned

Describing use cases still leaves the big question we’ve been discussing about how result are returned. :-)

Under ether proposal, in the case of ORM Scenario/Template Command, the DbCommands in the command set are not—in essence—separate distinct commands but rather instances of the same command paired with different parameters. In this context, there is no need or interest to break/group the result sets returned by which command instance they correspond with. Aggregating the results into a single data reader instance is not needed and so adds unnecessary complexity because those results are coming (in a sense) from the same command (just different instances of it).

For the General Use Case, returning the results separated by command batch maintains symmetry with the fact that separate command batches were passed in and allows the data readers to behave almost exactly identically to how they would if the commands where executed separately. Since the developer is consciously aware of that they are multiplexing distinct command batches and is using an API from a section of .Net that works with lower-level data access functionality, I’d propose that it is not unduly burdensome to expose them to the same level of detail/responsibility on the result consumption side (that is, by returning results segmented by command batch).

All this said, even if it is not unduly burdensome, it is still fair to ask if this level of detail is necessary or if it could be eliminated for convenience purposes.

If it is felt that it could be appropriate to segment by command batch, at least in some cases, but the concern with providing an API for this is not wanting to complicate the use case inspiring the original proposal (ORM Use Case/Template Command), what about providing a simplified template command API for that use case and then have a more general API that accommodates the other/more advanced uses folks might make of it?

An API tailored just for the idea of a template command could be streamlined even more than either of the current proposals. Instead of requiring the consumer to create a set of commands, why not let them define a command template, then pass an enumerable of enumerables of parameters (where each item in the outer enumerable corresponds with one invocation of the command and the items in the inner enumerables are the parameters for that particular invocation). Executing the set would then prepare the template then send transmit the set of commands using the prepared reference (instead of paying the transport byte cost to re-transmit the full SQL statement once for each invocation).

using (var command = new SqlTemplateCommand(connection)) {
   command.CommandText = “INSERT ….”); // alternative: command.TemplateCommand = new SqlCommand { CommandText = “INSERT ….” };
   command.Parameters = new[] { new[] { /* params for first insert */ }, new[] { /* params for second insert }, … };
   command.ExecuteNonReader();
}

In other words, instead of trying to have a one-size-fits-all API that makes the simple more complex than it needs to be (the ORM Use Case) and the complex less accommodating than it could be (the General Use Case), what about splitting the two and so bypass the problem? :-) (Behind the scenes, the implementation of template command could certainly leverage the general purpose API.)

@bgribaudo
Copy link
Contributor Author

@roji, some thoughts on specific comments you graciously shared. All of these are primarily given in the context of/from the prospective of the General Use Case.

Batching is already used in many applications, by concatenating SQL statements. Your proposal would make it harder to port to the new API, as multiple readers would now be needed where only one was previously needed.

Actually, only commands (a.k.a. batches) would have their results segmented into separate data readers. However, if a single command returns multiple result sets (e.g. because several SQL statements are concatenated together), nothing changes. Each of those result sets would still be returned by the same data reader (identical to how DbCommand handles this).

There's a new assymetry between the batched and non-batched APIs.

Conceptually, I look at it as a symmetrical input-to-output flow: “consumer passes a set of distinct commands to the multiplexer, consumer receives back a set of distinct command results from the multiplexer.” :-)

Second, once a structured batching API is introduced, there shouldn't really be a reason for users to continue using the legacy, concatenation batching "API". This isn't to say it shouldn't work anymore - providers can continue supporting concatenation batching even within a structured batch ("batching within batching") - but at the moment I simply don't see a compelling reason for users to do this.

I’m not so sure. :-) In simpler situations where using single DbCommand satisfies all of a consumer’s needs, I suspect there’s a good chance consumers will keep concatenating. After all, in the more basic scenarios, using a command set adds complexity without offering a measurable advantage (Imaging a train of thought of: Why create a command set instance and separate command instances for each statement if putting them all in the same string works fine?). If consumers keep doing this, then some of those concatenated-statement commands will find their way into command sets.

Also, there are advanced situations where command sets will be most helpful but multiple statements need to be placed in a single command batch within that set to achieve the desired effect (e.g. executing an insert into several related tables where the key from the first table needs to be used as a foreign key reference by the others).

I'm still skeptical of the idea that query/command handling can continue in a reliable and sane way after an error occurs - or that it's a desirable feature. …. But more importantly, error handling doesn't seem to add an additional argument here - the only point seems to be again, the purported difficulty to discern which resultset belongs to which command. It seems that the solutions proposed above (via resultset query shape, via marker query) work just as well in error situations, unless I'm missing something. I don't see any reason to suppose that the reader will be killed upon error - that's a provider-specific implementation detail.

In situations like the batch insert tool I mentioned above or the idea of making a distinct calls to the ‘look up item details’ stored procs—in both cases, a failure with one command shouldn’t kill the entire set.

Currently, if SqlClient encounters an error before the first result set (e.g. from a variable DECLARE/SET statement’s expression, from early in a stored proc’s logic), the exception for the error is returned from ExecuteReader(). It’s not possible to move the reader past that error because no reader instance was returned. If the same behavior is kept with SQL Server’s implementation of command sets, then it won’t be possible to read past the exception to the results from subsequent commands in the set—even though those commands are still being executed and results returned for them by the server.

(I’m assuming that API implementers will maintain similar behavior between the data readers returned by their DbCommand derivatives and those returned from this API—because they likely will want to reuse as much existing logic as possible; as well as because users of their DbCommands have established expectations of how the associated readers handle errors and it would be confusing to introduce a new data reader flavor that handles errors differently.)

@bgribaudo
Copy link
Contributor Author

@roji

Reader Performance - Grouping results by command offers the option of specifying a CommandBehavior specific to each command's results. This allows the relevant reader to be tuned to how those particular results will be consumed (e.g. from the first command, just need a scalar; the second, just a single row; from the third, need everything). With result flattening, performance tuning at this level isn't available.

The problem with this argument, is that CommandBehavior seems to make sense only when given at execution time, before command(s) are sent to the database (or as part of them). In other words, by the time you pass CommandBehavior.SingleRow to the proposed DbCommandSetResults.GetReader(), the entire query batch has already been sent to the database and all rows are already being sent back from the database. Of course, the driver can still implement SingleRow client-side by filtering out subsequent rows, but the whole point of something like SingleRow is to avoid that. This seems to hold for KeyInfo, SchemaOnly, SingleResult and SingleRow - leaving only SequentialAccess and CloseConnection, the latter of which doesn't make much sense in this context.

Maybe this varies client by client. With SqlClient, I believe only KeyInfo, SchemaOnly change the query while SingleResult, SingleRow and SequentialAccess come into play in the data reader (e.g. ref1, ref2).

@roji
Copy link
Member

roji commented Mar 6, 2019

@bgribaudo, thanks for your responses - it's very valuable to get this kind of feedback and the discussion is very important.

Use cases and a "SIMD" API

First, regarding use cases... I don't necessarily think that ORMs behave in some common way that's distinct from other non-ORM scenarios - different ORMs do different things in different situations, and a lot of what they do (possibly even all) overlaps with non-ORM usage. The "ORM Template command" scenario you described - where the same SQL is executed several times with different parameter sets - is indeed important; let's call it "SIMD" for brevity. EF Core, for one, makes heavy use of batching (when the user calls SaveChanges()), but it definitely does not just do SIMD - INSERTs and UPDATEs are mixed into the same batch, for different tables with different shapes. So I'll talk about SIMD rather than ORMs from here on.

I think that as you suggest, an API tailored for SIMD batching does make some sense. FWIW this is exactly the way the JDBC API works - see section 14.1.4 of the JDBC 4.3 specification. Unfortunately, in JDBC this comes at the expense of batching multiple heterogeneous statements in a single PreparedStatement - only SIMD is supported. Multiple heterogeneous statement batching is supported by the non-prepared Statement, but this means that no parameters are supported (nor preparation, of course). We're trying to do better with ADO.NET.

The important point about a SIMD-specific API, is that it can easily be implemented on top of either one of our proposals: some API would accept a single SQL statement and multiple parameters sets, and would simply create a DbCommandSet with many commands, each having the same statement and one of the parameter sets. One could even merge the resultsets coming back from this into a single resultset for easier access, as you suggested (although rows affected for each command would still need to be translated). But at the end of the day, since this can be seen as a purely additional layer, I don't think this belongs in ADO.NET, which is a lower-level API that IMO should aim only to expose access to provider features which cannot simply be implemented on top.

Note that even if a SIMD batching API were introduced into ADO.NET alongside a more general, non-SIMD batching API, I still would want the latter to produce a single reader, for all the reasons already discussed (I still expect the vast majority of cases to not have variable numbers of resultsets).

Error handling

First, in some of your "general use case" scenarios you describe stored procedure calls that return an error when an item is not found. As a general rule, I'd not expect an actual database error - DbException/SqlException - to be raised in this case, but rather some other way of saying "not found" (the procedure could simply return an empty resultset, or a null, or whatever). Similarly, rather than relying on constraint violations to be generated as part of normal processing, it may be possible to instruct the database to simply ignore the INSERT in that case (PostgreSQL has INSERT ... ON CONFLICT DO NOTHING - these are all just ways to avoid using exceptions/errors as normal flow control. However, I do understand that in some case this won't be possible so this is just a remark, not an actual objection on my part.

But the point is, as I wrote above in my response to errors, that there's nothing in either proposal to prevent this - a provider such as SqlClient can support continuing after error just fine. If we follow #28633, that would mean that the single reader returned from DbCommandSet is "resilient" with regards to errors; calling NextResult() may throw (since that resultset errored), but the next one would work. I don't see how having multiple readers makes things better or worse than having a single error - it just seems orthogonal. So there's no reason your scenarios wouldn't work just fine on #28633. In fact, they don't seem to have actual variable numbers of resultsets :)

In essence, the developer decides that several distinct commands would be more efficiently executed if they were multiplexed together for transmission to the database server.

One note - I'd recommend avoiding the use of the term "multiplexing" here - multiplexing usually means that more than one producer is producing SQL statements to be executed at more or less the same time, and all these go through the same connection - take a look at npgsql/npgsql#1982 for some thoughts on this in the context of the Npgsql provider specifically. I think I understand where this is command from - SQL Server using "batch" in a a way that corresponds to a multiple statements in a single command - so there's an amgibuity. So to avoid any doubt, when I write batching I'm referring to batching commands, not statements.

On specific points

Batching is already used in many applications, by concatenating SQL statements. Your proposal would make it harder to port to the new API, as multiple readers would now be needed where only one was previously needed.

Actually, only commands (a.k.a. batches) would have their results segmented into separate data readers. However, if a single command returns multiple result sets (e.g. because several SQL statements are concatenated together), nothing changes. Each of those result sets would still be returned by the same data reader (identical to how DbCommand handles this).

Right, but the point is that if an application is already using concatenation batching - which is likely if the user is already interested in batching/perf - then #28633 makes porting to the new DbCommandSet API easier (since a single reader is maintained).

Second, once a structured batching API is introduced, there shouldn't really be a reason for users to continue using the legacy, concatenation batching "API". This isn't to say it shouldn't work anymore - providers can continue supporting concatenation batching even within a structured batch ("batching within batching") - but at the moment I simply don't see a compelling reason for users to do this.

I’m not so sure. :-) In simpler situations where using single DbCommand satisfies all of a consumer’s needs, I suspect there’s a good chance consumers will keep concatenating. After all, in the more basic scenarios, using a command set adds complexity without offering a measurable advantage (Imaging a train of thought of: Why create a command set instance and separate command instances for each statement if putting them all in the same string works fine?). If consumers keep doing this, then some of those concatenated-statement commands will find their way into command sets.

Also, there are advanced situations where command sets will be most helpful but multiple statements need to be placed in a single command batch within that set to achieve the desired effect (e.g. executing an insert into several related tables where the key from the first table needs to be used as a foreign key reference by the others).

As I wrote above, I'm not proposing that we disallow concatenation batching - a provider is free to continue allowing it. At least for Npgsql and SqlClient, we're assuming that there are some significant perf advantages to command batching as opposed to concatenation batching, and hopefully users will understand that. In any case, this doesn't really seem to matter in the context of this discussion.

Maybe this varies client by client. With SqlClient, I believe only KeyInfo, SchemaOnly change the query while SingleResult, SingleRow and SequentialAccess come into play in the data reader (e.g. ref1, ref2).

I definitely don't know anything about the SqlClient behavior (and don't enough time to dive in), but if SingleResult and SingleRow don't actually reduce the data transferred over the wire, then I'd say they're largely useless... Skipping rows (or resultsets) beyond the first one can easily be done by the user themselves without any help from the driver (@Wraith2 or @divega may be able to provide more info on the actual SqlClient behavior). In any case, there are other providers out there where CommandBehavior does impact what is sent out, so the API needs to allow that.

SequentialAccess is indeed a different case as it's not meant to have an effect on what data is transferred, only on how it's accessed.

@bgribaudo
Copy link
Contributor Author

@roji

In any case, there are other providers out there where CommandBehavior does impact what is sent out, so the API needs to allow that.

I agree. :-) I was trying to allow for this by "If a reading-specific behavior is passed when a command is added or an execution-level behavior is passed during reading, an exception should be thrown." In essence, each provider is responsible for deciding which behaviors are appropriate at which point and should then throw if a behavior is specified when it was inappropriate.

As an alternative, the API could be simplified so that specifying behavior is only allowed when adding a command to the set (e.g. remove the option of specifying it when requesting a reader).

@roji
Copy link
Member

roji commented Mar 8, 2019

In any case, there are other providers out there where CommandBehavior does impact what is sent out, so the API needs to allow that.

I agree. :-) I was trying to allow for this by "If a reading-specific behavior is passed when a command is added or an execution-level behavior is passed during reading, an exception should be thrown." In essence, each provider is responsible for deciding which behaviors are appropriate at which point and should then throw if a behavior is specified when it was inappropriate.

Understood. I don't think it's a good way forward to allow this to be specified late if many main use-cases wouldn't be supported by it. More importantly, at least in my mind I'm not convinced that a multiple reader approach (as proposed in this issue) is the right way forward, and this question of where to specify the command behavior (execution-time or reader-get-time) is only relevant within this proposal - in #28633 only one reader is returned anyway.

Regardless, the value/desirability of allowing multiple command behaviors within a single batch is still relevant within the scope of #28633.

@roji
Copy link
Member

roji commented Mar 12, 2019

@bgribaudo as things currently stand I don't think it would make sense to go with this proposal, for the reasons detailed above - do you plan on continuing the discussion here? Note that #28633 still has an open question of allowing behavior to be specified per command, and that I've adopted your suggestion for a lightweight DbCommand-replacement in that proposal.

@bgribaudo
Copy link
Contributor Author

Hi @roji! Thanks for taking the time to discuss this topic.

I agree that the cases we’ve been talking about aren’t essential to the major use case for the command sets idea (though I’d submit they are valid minor use cases of that idea) and that there seems to always be some kind of work around (generating placeholder result sets, schema detection, etc.) to make the minor use cases workable. (Actually, I don’t think we’ve ever disagreed on either of these points—just on how to factor in the significance of the minor use cases.)

For the whopping grand 2 cents it’s worth, when one of the minor cases comes into play, I still don’t care for how #28633 necessitates extra development work and possibly database/transport costs (e.g. when placeholder queries/result sets are involved) when the database server is already providing the needed information for free—just it’s not exposed by the API. Without access to that info, it’s not always possible to provide generic workarounds for these minor cases, meaning that each developer who encounters one of them needs to become vested in the details of the situation then reinvent the wheel to solve it.

In contrast, with something like this proposal, the challenges associated with the minor use cases (almost completely) go away because it’s possible to run just about any valid query in a command set without modification or special handing. For the major use case, that doesn’t need result sets segmented by command, possibly a couple helper methods or extension methods could be provided that provide flattened results—this way, those who want to the flattening have it but those who need the lower level details each don’t have to reinvent the wheel.

Again, for the two cents its’s worth… :-)

@roji
Copy link
Member

roji commented Mar 12, 2019

Thanks for the response @bgribaudo, I appreciate the proposal and the arguments. I think we're more or less in sync about everything - the difference is in how much complexity I consider the non-flattened API to add (i.e. a bit more than you) and the rareness of the cases where I consider non-flattened resultsets to be useful (again, probably quite a bit more than you :)).

It's always possible to allow all options by providing multiple layers for users to choose from, but that also has its complexity price as users have to wrestle with understanding when and why the low-level layer is useful, etc. I think we should really aim for as much simplicity as possible without sacrificing actual functionality. In some way, ADO.NET already suffers from being a "low-level API" that doesn't look like it was conceived for end-users (as opposed to ORMs and other layers). Hopefully we can change that somewhat.

Any any case, thanks again for engaging and for all of your ideas! I'll go ahead and close this for now although if you wish to continue the conversation please don't hesitate.

@roji roji closed this as completed Mar 12, 2019
@bgribaudo
Copy link
Contributor Author

Thanks for the good conversation, @roji!

@roji
Copy link
Member

roji commented Mar 29, 2019

Likewise @bgribaudo, hope we'll do it again soon!

@msftgits msftgits transferred this issue from dotnet/corefx Feb 1, 2020
@msftgits msftgits added this to the 3.0 milestone Feb 1, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
api-suggestion Early API idea and discussion, it is NOT ready for implementation area-System.Data
Projects
None yet
Development

No branches or pull requests

4 participants