Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

onurogurtani · 2024-10-18T06:29:24Z

Hello,

I am using CAP in a .NET Core project, and I would like to raise a concern regarding the SQL queries generated in the IDataStorage implementations for PostgreSQL and MSSQL. Currently, we have queries that use the OR operator, which looks like this:

PostgreSQL:

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND (
    ("ExpiresAt" < @TwoMinutesLater AND "StatusName" = '{StatusName.Delayed}') 
    OR 
    ("ExpiresAt" < @OneMinutesAgo AND "StatusName" = '{StatusName.Queued}')
) 
FOR UPDATE SKIP LOCKED;

MSSQL:

SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND (
    (ExpiresAt < @TwoMinutesLater AND StatusName = '{StatusName.Delayed}') 
    OR 
    (ExpiresAt < @OneMinutesAgo AND StatusName = '{StatusName.Queued}')
);

Our DBA team has suggested replacing the OR operator with UNION in these queries to improve performance, especially when dealing with large datasets. This is due to the fact that the OR operator can sometimes lead to suboptimal index usage and degrade query performance.

Here’s how the updated queries in the IDataStorage implementations would look with UNION:

PostgreSQL:

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND "ExpiresAt" < @TwoMinutesLater 
AND "StatusName" = '{StatusName.Delayed}'
FOR UPDATE SKIP LOCKED
UNION ALL
SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND "ExpiresAt" < @OneMinutesAgo 
AND "StatusName" = '{StatusName.Queued}'
FOR UPDATE SKIP LOCKED;

MSSQL:

SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND ExpiresAt < @TwoMinutesLater 
AND StatusName = '{StatusName.Delayed}'
UNION ALL
SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND ExpiresAt < @OneMinutesAgo 
AND StatusName = '{StatusName.Queued}';

This change could potentially improve the performance of the queries in the IDataStorage implementations, and I'd appreciate it if you could consider this suggestion. It would be great to see this improvement in future releases of the CAP library.

Thank you!

Best regards,
Onur

The text was updated successfully, but these errors were encountered:

yang-xiaodong · 2024-10-18T08:53:52Z

Hello @onurogurtani ,

Thank you for your feedback. While using OR might prevent the query optimizer from effectively utilizing indexes, we did not create indexes for columns like StatusName when the table was initially created. The reasoning can be found in this response.

I’m reconsidering whether we should have added certain columns as indexes during table creation. In fact, most queries in IDataStorage use the majority of columns in their WHERE conditions. So, which columns should we designate as indexes to optimize performance? At the moment, it seems that StatusName, ExpiresAt, and Version are suitable candidates for indexing.

What do you think about this? @maisa92 @xiangxiren

Of course, replacing OR with UNION ALL or other alternatives is beneficial regardless, I will be making a PR for this, @onurogurtani if you are interested in this please let me know!

Thanks!

onurogurtani · 2024-10-18T12:05:46Z

Hello @yang-xiaodong

We are using the following two nonclustered indexes:

CREATE NONCLUSTERED INDEX [IDX_Candidate_Published_Version_ExpiresAt_StatusName] 
ON [Candidate].[Published] ([Version] ASC, [ExpiresAt] ASC, [StatusName] ASC)
INCLUDE ([Id], [Content], [Retries], [Added]);

CREATE NONCLUSTERED INDEX [IX_Published_ExpiresAt_CT] 
ON [Candidate].[Published] ([ExpiresAt] ASC, [StatusName] ASC);

We handle around 1 million transactions daily. Success records are deleted after one day, leaving approximately 1 million records in the table at any given time. We are using 6 pods on Kubernetes, with the following CapOptions configuration:

"CapOptions": {
    "FailedRetryCount": 3,
    "CollectorCleaningInterval": 3600,
    "ConsumerThreadCount": 2,
    "ProducerThreadCount": 2
}

onurogurtani · 2024-10-25T06:53:42Z

Yang, I'm interested in this topic. Are there any updates?

yang-xiaodong · 2024-10-25T07:34:05Z

Hi @onurogurtani ,

I am working on this PR and need to point out that in PostgreSQL FOR UPDATE SKIP LOCKED cannot work together with UNION ALL, and in MySQL, it cannot utilize indexes.

So, in PostgreSQL and MySQL we will still continue to using the following SQL.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND (
    ("ExpiresAt" < @TwoMinutesLater AND "StatusName" = '{StatusName.Delayed}') 
    OR 
    ("ExpiresAt" < @OneMinutesAgo AND "StatusName" = '{StatusName.Queued}')
) 
FOR UPDATE SKIP LOCKED;

onurogurtani · 2024-10-25T07:44:57Z

Yang, would you like to try this? We can solve the problem you mentioned using a subquery:↳

Another alternative could be to completely separate these two queries.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
FROM (
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @TwoMinutesLater
      AND "StatusName" = '{StatusName.Delayed}'
    UNION ALL
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @OneMinutesAgo
      AND "StatusName" = '{StatusName.Queued}'
) AS sub
FOR UPDATE SKIP LOCKED;

yang-xiaodong · 2024-10-25T08:00:47Z

Yang, would you like to try this? We can solve the problem you mentioned using a subquery:↳

Another alternative could be to completely separate these two queries.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
FROM (
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @TwoMinutesLater
      AND "StatusName" = '{StatusName.Delayed}'
    UNION ALL
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @OneMinutesAgo
      AND "StatusName" = '{StatusName.Queued}'
) AS sub
FOR UPDATE SKIP LOCKED;

This has an error in PostgreSQL :

ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

It can works in MySQL but does not utilize indexes.

{
  "query_block": {
    "select_id": 1,
    "message": "no matching row in const table",
    "table": {
      "materialized_from_subquery": {
        "using_temporary_table": true,
        "dependent": false,
        "cacheable": true,
        "query_block": {
          "union_result": {
            "using_temporary_table": false,
            "query_specifications": [
              {
                "dependent": false,
                "cacheable": true,
                "query_block": {
                  "select_id": 2,
                  "message": "Impossible WHERE"
                }
              },
              {
                "dependent": false,
                "cacheable": true,
                "query_block": {
                  "select_id": 3,
                  "message": "Impossible WHERE"
                }
              }
            ]
          }
        }
      }
    }
  }
}

yang-xiaodong · 2024-10-25T08:05:26Z

SELECT `Id`,`Content`,`Retries`,`Added`,`ExpiresAt` FROM `cap.published` WHERE `Version`='v1'
AND (`ExpiresAt`< '2024/10/25 14:09:48' AND `StatusName` = 'Delayed') OR (`ExpiresAt`< '2024/10/25 15:09:48' AND `StatusName` = 'Queued') FOR UPDATE SKIP LOCKED;

The explain for SQL in MySQL:

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "1.41"
    },
    "table": {
      "table_name": "cap.published",
      "access_type": "range",
      "possible_keys": [
        "IX_Version_ExpiresAt_StatusName",
        "IX_ExpiresAt_StatusName"
      ],
      "key": "IX_ExpiresAt_StatusName",
      "used_key_parts": [
        "ExpiresAt"
      ],
      "key_length": "168",
      "rows_examined_per_scan": 2,
      "rows_produced_per_join": 2,
      "filtered": "100.00",
      "index_condition": "(((`cap`.`cap.published`.`StatusName` = 'Delayed') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 14:09:48')) or ((`cap`.`cap.published`.`StatusName` = 'Queued') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 15:09:48')))",
      "cost_info": {
        "read_cost": "1.21",
        "eval_cost": "0.20",
        "prefix_cost": "1.41",
        "data_read_per_join": "2K"
      },
      "used_columns": [
        "Id",
        "Version",
        "Content",
        "Retries",
        "Added",
        "ExpiresAt",
        "StatusName"
      ],
      "attached_condition": "(((`cap`.`cap.published`.`StatusName` = 'Delayed') and (`cap`.`cap.published`.`Version` = 'v1') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 14:09:48')) or ((`cap`.`cap.published`.`StatusName` = 'Queued') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 15:09:48')))"
    }
  }
}

onurogurtani · 2024-10-25T08:12:12Z

I think that by running the two queries independently or separating them into different methods, we can effectively overcome the limitations with FOR UPDATE SKIP LOCKED in PostgreSQL when using UNION ALL. This strategy also allows for better index utilization in both PostgreSQL and MySQL, improving overall performance.

yang-xiaodong · 2024-10-25T09:32:52Z

We have applied the same query conditions on both sides of the or and have created a composite index. I believe the current query does not need further adjustment.

onurogurtani · 2024-10-25T09:36:30Z

I'm excited to try it out; thank you!

… the query SQL to utilize the optimized indexes. (#1599)

… the query SQL to utilize the optimized indexes. (#1599) (#1601)

onurogurtani · 2024-10-30T07:41:55Z

Hi Yang, I think the development is complete. How can I test it? Are you planning to release a preview package?

yang-xiaodong · 2024-10-31T01:52:06Z

@onurogurtani Version 8.3.1-preview-247022046 is released to NuGet now.

onurogurtani · 2024-10-31T06:21:53Z

Thank you :)

yang-xiaodong · 2024-11-05T04:18:00Z

Fixed in version 8.3.1

malcolby2333 · 2024-11-06T11:14:50Z

@yang-xiaodong We encountered this error after upgrading to 8.3.1.

Npgsql.PostgresException (0x80004005): 54000: index row size 5984 exceeds btree version 4 maximum 2704 for index "idx_published_Version_ExpiresAt_StatusName"

DETAIL: Detail redacted as it may contain sensitive data. Specify 'Include Error Detail' in the connection string to include this information.
   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken)
   at DotNetCore.CAP.PostgreSql.DbConnectionExtensions.ExecuteNonQueryAsync(DbConnection connection, String sql, DbTransaction transaction, Object[] sqlParams)
   at DotNetCore.CAP.PostgreSql.DbConnectionExtensions.ExecuteNonQueryAsync(DbConnection connection, String sql, DbTransaction transaction, Object[] sqlParams)
   at DotNetCore.CAP.PostgreSql.PostgreSqlDataStorage.StoreMessageAsync(String name, Message content, Object transaction)
   at DotNetCore.CAP.PostgreSql.PostgreSqlDataStorage.StoreMessageAsync(String name, Message content, Object transaction)
   at DotNetCore.CAP.Internal.CapPublisher.PublishInternalAsync[T](String name, T value, IDictionary`2 headers, Nullable`1 delayTime, CancellationToken cancellationToken)

green-eyed · 2024-11-22T09:19:14Z

CREATE NONCLUSTERED INDEX [IDX_Candidate_Published_Version_ExpiresAt_StatusName] 
ON [Candidate].[Published] ([Version] ASC, [ExpiresAt] ASC, [StatusName] ASC)
INCLUDE ([Id], [Content], [Retries], [Added]);

Include almost the table columns in index? Why? "INCLUDE" looks like something harmful.

yang-xiaodong added the enhancement label Oct 18, 2024

yang-xiaodong added a commit that referenced this issue Oct 25, 2024

Add default indexes to the database tables during creation and adjust…

9e2ea04

… the query SQL to utilize the optimized indexes. (#1599)

yang-xiaodong mentioned this issue Oct 25, 2024

Add default indexes to the database tables during creation #1601

Merged

4 tasks

yang-xiaodong added a commit that referenced this issue Oct 28, 2024

Add default indexes to the database tables during creation and adjust…

1cafc93

… the query SQL to utilize the optimized indexes. (#1599) (#1601)

yang-xiaodong added the fixed label Oct 28, 2024

yang-xiaodong closed this as completed Nov 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

onurogurtani commented Oct 18, 2024

yang-xiaodong commented Oct 18, 2024

onurogurtani commented Oct 18, 2024

onurogurtani commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024 •

edited

Loading

onurogurtani commented Oct 25, 2024 •

edited

Loading

yang-xiaodong commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024

onurogurtani commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024

onurogurtani commented Oct 25, 2024

onurogurtani commented Oct 30, 2024

yang-xiaodong commented Oct 31, 2024

onurogurtani commented Oct 31, 2024

yang-xiaodong commented Nov 5, 2024

malcolby2333 commented Nov 6, 2024

green-eyed commented Nov 22, 2024

Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

Comments

onurogurtani commented Oct 18, 2024

yang-xiaodong commented Oct 18, 2024

onurogurtani commented Oct 18, 2024

onurogurtani commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024 • edited Loading

onurogurtani commented Oct 25, 2024 • edited Loading

yang-xiaodong commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024

onurogurtani commented Oct 25, 2024

yang-xiaodong commented Oct 25, 2024

onurogurtani commented Oct 25, 2024

onurogurtani commented Oct 30, 2024

yang-xiaodong commented Oct 31, 2024

onurogurtani commented Oct 31, 2024

yang-xiaodong commented Nov 5, 2024

malcolby2333 commented Nov 6, 2024

green-eyed commented Nov 22, 2024

yang-xiaodong commented Oct 25, 2024 •

edited

Loading

onurogurtani commented Oct 25, 2024 •

edited

Loading