Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Improvement Suggestion: Replace OR with UNION in IDataStorage SQL Queries for PostgreSQL and MSSQL #1599

Closed
onurogurtani opened this issue Oct 18, 2024 · 16 comments

Comments

@onurogurtani
Copy link

Hello,

I am using CAP in a .NET Core project, and I would like to raise a concern regarding the SQL queries generated in the IDataStorage implementations for PostgreSQL and MSSQL. Currently, we have queries that use the OR operator, which looks like this:

PostgreSQL:

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND (
    ("ExpiresAt" < @TwoMinutesLater AND "StatusName" = '{StatusName.Delayed}') 
    OR 
    ("ExpiresAt" < @OneMinutesAgo AND "StatusName" = '{StatusName.Queued}')
) 
FOR UPDATE SKIP LOCKED;

MSSQL:

SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND (
    (ExpiresAt < @TwoMinutesLater AND StatusName = '{StatusName.Delayed}') 
    OR 
    (ExpiresAt < @OneMinutesAgo AND StatusName = '{StatusName.Queued}')
);

Our DBA team has suggested replacing the OR operator with UNION in these queries to improve performance, especially when dealing with large datasets. This is due to the fact that the OR operator can sometimes lead to suboptimal index usage and degrade query performance.

Here’s how the updated queries in the IDataStorage implementations would look with UNION:

PostgreSQL:

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND "ExpiresAt" < @TwoMinutesLater 
AND "StatusName" = '{StatusName.Delayed}'
FOR UPDATE SKIP LOCKED
UNION ALL
SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND "ExpiresAt" < @OneMinutesAgo 
AND "StatusName" = '{StatusName.Queued}'
FOR UPDATE SKIP LOCKED;

MSSQL:

SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND ExpiresAt < @TwoMinutesLater 
AND StatusName = '{StatusName.Delayed}'
UNION ALL
SELECT Id, Content, Retries, Added, ExpiresAt 
FROM {_pubName} WITH (UPDLOCK, READPAST) 
WHERE Version = @Version 
AND ExpiresAt < @OneMinutesAgo 
AND StatusName = '{StatusName.Queued}';

This change could potentially improve the performance of the queries in the IDataStorage implementations, and I'd appreciate it if you could consider this suggestion. It would be great to see this improvement in future releases of the CAP library.

Thank you!

Best regards,
Onur

@yang-xiaodong
Copy link
Member

Hello @onurogurtani ,

Thank you for your feedback. While using OR might prevent the query optimizer from effectively utilizing indexes, we did not create indexes for columns like StatusName when the table was initially created. The reasoning can be found in this response.

I’m reconsidering whether we should have added certain columns as indexes during table creation. In fact, most queries in IDataStorage use the majority of columns in their WHERE conditions. So, which columns should we designate as indexes to optimize performance? At the moment, it seems that StatusName, ExpiresAt, and Version are suitable candidates for indexing.

What do you think about this? @maisa92 @xiangxiren

Of course, replacing OR with UNION ALL or other alternatives is beneficial regardless, I will be making a PR for this, @onurogurtani if you are interested in this please let me know!

Thanks!

@onurogurtani
Copy link
Author

Hello @yang-xiaodong

We are using the following two nonclustered indexes:

CREATE NONCLUSTERED INDEX [IDX_Candidate_Published_Version_ExpiresAt_StatusName] 
ON [Candidate].[Published] ([Version] ASC, [ExpiresAt] ASC, [StatusName] ASC)
INCLUDE ([Id], [Content], [Retries], [Added]);

CREATE NONCLUSTERED INDEX [IX_Published_ExpiresAt_CT] 
ON [Candidate].[Published] ([ExpiresAt] ASC, [StatusName] ASC);

We handle around 1 million transactions daily. Success records are deleted after one day, leaving approximately 1 million records in the table at any given time. We are using 6 pods on Kubernetes, with the following CapOptions configuration:

"CapOptions": {
    "FailedRetryCount": 3,
    "CollectorCleaningInterval": 3600,
    "ConsumerThreadCount": 2,
    "ProducerThreadCount": 2
}

@onurogurtani
Copy link
Author

Yang, I'm interested in this topic. Are there any updates?

@yang-xiaodong
Copy link
Member

yang-xiaodong commented Oct 25, 2024

Hi @onurogurtani ,

I am working on this PR and need to point out that in PostgreSQL FOR UPDATE SKIP LOCKED cannot work together with UNION ALL, and in MySQL, it cannot utilize indexes.

So, in PostgreSQL and MySQL we will still continue to using the following SQL.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt" 
FROM {_pubName} 
WHERE "Version" = @Version 
AND (
    ("ExpiresAt" < @TwoMinutesLater AND "StatusName" = '{StatusName.Delayed}') 
    OR 
    ("ExpiresAt" < @OneMinutesAgo AND "StatusName" = '{StatusName.Queued}')
) 
FOR UPDATE SKIP LOCKED;

@onurogurtani
Copy link
Author

onurogurtani commented Oct 25, 2024

Yang, would you like to try this? We can solve the problem you mentioned using a subquery:↳

Another alternative could be to completely separate these two queries.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
FROM (
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @TwoMinutesLater
      AND "StatusName" = '{StatusName.Delayed}'
    UNION ALL
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @OneMinutesAgo
      AND "StatusName" = '{StatusName.Queued}'
) AS sub
FOR UPDATE SKIP LOCKED;

@yang-xiaodong
Copy link
Member

Yang, would you like to try this? We can solve the problem you mentioned using a subquery:↳

Another alternative could be to completely separate these two queries.

SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
FROM (
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @TwoMinutesLater
      AND "StatusName" = '{StatusName.Delayed}'
    UNION ALL
    SELECT "Id", "Content", "Retries", "Added", "ExpiresAt"
    FROM {_pubName}
    WHERE "Version" = @Version
      AND "ExpiresAt" < @OneMinutesAgo
      AND "StatusName" = '{StatusName.Queued}'
) AS sub
FOR UPDATE SKIP LOCKED;

This has an error in PostgreSQL :

ERROR: FOR UPDATE is not allowed with UNION/INTERSECT/EXCEPT

It can works in MySQL but does not utilize indexes.

{
  "query_block": {
    "select_id": 1,
    "message": "no matching row in const table",
    "table": {
      "materialized_from_subquery": {
        "using_temporary_table": true,
        "dependent": false,
        "cacheable": true,
        "query_block": {
          "union_result": {
            "using_temporary_table": false,
            "query_specifications": [
              {
                "dependent": false,
                "cacheable": true,
                "query_block": {
                  "select_id": 2,
                  "message": "Impossible WHERE"
                }
              },
              {
                "dependent": false,
                "cacheable": true,
                "query_block": {
                  "select_id": 3,
                  "message": "Impossible WHERE"
                }
              }
            ]
          }
        }
      }
    }
  }
}

@yang-xiaodong
Copy link
Member

SELECT `Id`,`Content`,`Retries`,`Added`,`ExpiresAt` FROM `cap.published` WHERE `Version`='v1'
AND (`ExpiresAt`< '2024/10/25 14:09:48' AND `StatusName` = 'Delayed') OR (`ExpiresAt`< '2024/10/25 15:09:48' AND `StatusName` = 'Queued') FOR UPDATE SKIP LOCKED;

The explain for SQL in MySQL:

{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "1.41"
    },
    "table": {
      "table_name": "cap.published",
      "access_type": "range",
      "possible_keys": [
        "IX_Version_ExpiresAt_StatusName",
        "IX_ExpiresAt_StatusName"
      ],
      "key": "IX_ExpiresAt_StatusName",
      "used_key_parts": [
        "ExpiresAt"
      ],
      "key_length": "168",
      "rows_examined_per_scan": 2,
      "rows_produced_per_join": 2,
      "filtered": "100.00",
      "index_condition": "(((`cap`.`cap.published`.`StatusName` = 'Delayed') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 14:09:48')) or ((`cap`.`cap.published`.`StatusName` = 'Queued') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 15:09:48')))",
      "cost_info": {
        "read_cost": "1.21",
        "eval_cost": "0.20",
        "prefix_cost": "1.41",
        "data_read_per_join": "2K"
      },
      "used_columns": [
        "Id",
        "Version",
        "Content",
        "Retries",
        "Added",
        "ExpiresAt",
        "StatusName"
      ],
      "attached_condition": "(((`cap`.`cap.published`.`StatusName` = 'Delayed') and (`cap`.`cap.published`.`Version` = 'v1') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 14:09:48')) or ((`cap`.`cap.published`.`StatusName` = 'Queued') and (`cap`.`cap.published`.`ExpiresAt` < TIMESTAMP'2024-10-25 15:09:48')))"
    }
  }
}

@onurogurtani
Copy link
Author

I think that by running the two queries independently or separating them into different methods, we can effectively overcome the limitations with FOR UPDATE SKIP LOCKED in PostgreSQL when using UNION ALL. This strategy also allows for better index utilization in both PostgreSQL and MySQL, improving overall performance.

@yang-xiaodong
Copy link
Member

We have applied the same query conditions on both sides of the or and have created a composite index. I believe the current query does not need further adjustment.

@onurogurtani
Copy link
Author

I'm excited to try it out; thank you!

yang-xiaodong added a commit that referenced this issue Oct 25, 2024
… the query SQL to utilize the optimized indexes. (#1599)
yang-xiaodong added a commit that referenced this issue Oct 28, 2024
… the query SQL to utilize the optimized indexes. (#1599) (#1601)
@onurogurtani
Copy link
Author

Hi Yang, I think the development is complete. How can I test it? Are you planning to release a preview package?

@yang-xiaodong
Copy link
Member

@onurogurtani Version 8.3.1-preview-247022046 is released to NuGet now.

@onurogurtani
Copy link
Author

Thank you :)

@yang-xiaodong
Copy link
Member

Fixed in version 8.3.1

@malcolby2333
Copy link

@yang-xiaodong We encountered this error after upgrading to 8.3.1.

Npgsql.PostgresException (0x80004005): 54000: index row size 5984 exceeds btree version 4 maximum 2704 for index "idx_published_Version_ExpiresAt_StatusName"

DETAIL: Detail redacted as it may contain sensitive data. Specify 'Include Error Detail' in the connection string to include this information.
   at Npgsql.Internal.NpgsqlConnector.ReadMessageLong(Boolean async, DataRowLoadingMode dataRowLoadingMode, Boolean readingNotifications, Boolean isReadingPrependedMessage)
   at System.Runtime.CompilerServices.PoolingAsyncValueTaskMethodBuilder`1.StateMachineBox`1.System.Threading.Tasks.Sources.IValueTaskSource<TResult>.GetResult(Int16 token)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlDataReader.NextResult(Boolean async, Boolean isConsuming, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteReader(Boolean async, CommandBehavior behavior, CancellationToken cancellationToken)
   at Npgsql.NpgsqlCommand.ExecuteNonQuery(Boolean async, CancellationToken cancellationToken)
   at DotNetCore.CAP.PostgreSql.DbConnectionExtensions.ExecuteNonQueryAsync(DbConnection connection, String sql, DbTransaction transaction, Object[] sqlParams)
   at DotNetCore.CAP.PostgreSql.DbConnectionExtensions.ExecuteNonQueryAsync(DbConnection connection, String sql, DbTransaction transaction, Object[] sqlParams)
   at DotNetCore.CAP.PostgreSql.PostgreSqlDataStorage.StoreMessageAsync(String name, Message content, Object transaction)
   at DotNetCore.CAP.PostgreSql.PostgreSqlDataStorage.StoreMessageAsync(String name, Message content, Object transaction)
   at DotNetCore.CAP.Internal.CapPublisher.PublishInternalAsync[T](String name, T value, IDictionary`2 headers, Nullable`1 delayTime, CancellationToken cancellationToken)

@green-eyed
Copy link

CREATE NONCLUSTERED INDEX [IDX_Candidate_Published_Version_ExpiresAt_StatusName] 
ON [Candidate].[Published] ([Version] ASC, [ExpiresAt] ASC, [StatusName] ASC)
INCLUDE ([Id], [Content], [Retries], [Added]);

Include almost the table columns in index? Why? "INCLUDE" looks like something harmful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants