sql-stats-compaction: failing with TransactionRetryError #108936

j82w · 2023-08-17T20:21:24Z

Description:
The sql-stats-compaction is failing with TransactionRetryError. This is caused by the internal executor uses the zero-values for the settings, rather than the cluster defaults. This causes SET reorder_joins_limit = 0; which then causes the sql-stats-compaction delete statement to do a full scan. The full scan is causing the query to take a long time causing other queries to conflict with it.

Error:
TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key

Got the bundle:

SELECT crdb_internal.request_statement_bundle('DELETE FROM system.statement_statistics WHERE (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id) IN (SELECT aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id FROM system.statement_statistics WHERE (crdb_internal_aggregated_ts_app_name_fingerprint_id_node_id_plan_hash_transaction_fingerprint_id_shard_8 = $1) AND (aggregated_ts < $1) ORDER BY aggregated_ts ASC LIMIT $1) RETURNING aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id', 0, 0::INTERVAL, 0::INTERVAL);

Explain plan from bundle:
planning time: 8ms
execution time: 49.7s
distribution: local
vectorized: true
rows read from KV: 1,193,100 (3.8 GiB, 390 gRPC calls)
cumulative time spent in KV: 26.8s
maximum memory usage: 34 MiB
network usage: 0 B (0 messages)
regions: ***

• delete
│ estimated row count: 10,000
│ from: statement_statistics
│ auto commit
│
└── • render
    │
    └── • hash join (semi)
        │ nodes: n108
        │ regions: use1
        │ actual row count: 10,000
        │ estimated max memory allocated: 7.1 MiB
        │ estimated max sql temp disk usage: 0 B
        │ estimated row count: 10,000
        │ equality: (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id) = (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id)
        │ left cols are key
        │ right cols are key
        │
        ├── • scan
        │     nodes: n108
        │     regions: use1
        │     actual row count: 1,183,100
        │     KV time: 26.5s
        │     KV contention time: 0µs
        │     KV rows read: 1,183,100
        │     KV bytes read: 3.8 GiB
        │     KV gRPC calls: 386
        │     estimated max memory allocated: 14 MiB
        │     estimated row count: 1,121,385 (100% of the table; stats collected 60 minutes ago)
        │     table: statement_statistics@primary
        │     spans: FULL SCAN
        │
        └── • scan
              nodes: n108
              regions: use1
              actual row count: 10,000
              KV time: 305ms
              KV contention time: 0µs
              KV rows read: 10,000
              KV bytes read: 32 MiB
              KV gRPC calls: 4
              estimated max memory allocated: 10 MiB
              estimated row count: 10,000 (0.89% of the table; stats collected 60 minutes ago)
              table: statement_statistics@primary
              spans: [/0 - /0/'2023-08-15 18:59:59.999999+00']
              limit: 10000

The fix to use the correct default value instead of 0 is made in #101486. That is not being backported.

Solution is to change the query to avoid the join and thus the full scan:

DELETE FROM system.statement_statistics
WHERE crdb_internal_aggregated_ts_app_name_fingerprint_id_node_id_plan_hash_transaction_fingerprint_id_shard_8 = 0:::INT8 AND aggregated_ts < '2023-08-15 19:00:00+00':::TIMESTAMPTZ
ORDER BY aggregated_ts ASC
LIMIT 10000:::INT8
RETURNING aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id;

  project
   └── delete statement_statistics
        └── project
             ├── scan statement_statistics
             │    ├── constraint: /33/23/24/25/26/27/28: [/0 - /0/'2023-08-15 18:59:59.999999+00']
             │    └── limit: 10000
             └── projections
                  ├── app_name NOT LIKE '$ internal%'
                  └── (statistics->'statistics')->'indexes'

Jira issue: CRDB-30707

The text was updated successfully, but these errors were encountered:

108947: sql: fix sql compaction job full scan r=j82w a=j82w The sql-stats-compaction is failing with TransactionRetryError. This is caused by the internal executor uses the zero-values for the settings, rather than the cluster defaults. This causes `SET reorder_joins_limit = 0;` which then causes the `sql-stats-compaction` delete statement to do a full scan. The full scan is causing the query to take a long time causing other queries to conflict with it. Error: `TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key` The fix to use the correct default value instead of 0 is made in #101486. Solution is to change the query to avoid the join and thus the full scan. Fixes: #108936 Release note (sql change): Optimized the sql-stats-compaction job delete query to avoid full scan. This helps avoid the TransactionRetryError which can cause the job to fail. Co-authored-by: j82w <[email protected]>

pingxianga · 2023-12-04T07:26:57Z

说明：失败，显示。这是由于内部执行程序使用零值作为设置，而不是群集默认值。这会导致 delete 语句执行完全扫描。完全扫描导致查询花费很长时间，导致其他查询与之冲突。sql-stats-compaction``TransactionRetryError``SET reorder_joins_limit = 0;``sql-stats-compaction

错误： TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_SERIALIZABLE - failed preemptive refresh due to a conflict: committed value on key

拿到捆绑包：

SELECT crdb_internal.request_statement_bundle('DELETE FROM system.statement_statistics WHERE (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id) IN (SELECT aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id FROM system.statement_statistics WHERE (crdb_internal_aggregated_ts_app_name_fingerprint_id_node_id_plan_hash_transaction_fingerprint_id_shard_8 = $1) AND (aggregated_ts < $1) ORDER BY aggregated_ts ASC LIMIT $1) RETURNING aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id', 0, 0::INTERVAL, 0::INTERVAL);

Explain plan from bundle:
planning time: 8ms
execution time: 49.7s
distribution: local
vectorized: true
rows read from KV: 1,193,100 (3.8 GiB, 390 gRPC calls)
cumulative time spent in KV: 26.8s
maximum memory usage: 34 MiB
network usage: 0 B (0 messages)
regions: ***

• delete
│ estimated row count: 10,000
│ from: statement_statistics
│ auto commit
│
└── • render
    │
    └── • hash join (semi)
        │ nodes: n108
        │ regions: use1
        │ actual row count: 10,000
        │ estimated max memory allocated: 7.1 MiB
        │ estimated max sql temp disk usage: 0 B
        │ estimated row count: 10,000
        │ equality: (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id) = (aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id)
        │ left cols are key
        │ right cols are key
        │
        ├── • scan
        │     nodes: n108
        │     regions: use1
        │     actual row count: 1,183,100
        │     KV time: 26.5s
        │     KV contention time: 0µs
        │     KV rows read: 1,183,100
        │     KV bytes read: 3.8 GiB
        │     KV gRPC calls: 386
        │     estimated max memory allocated: 14 MiB
        │     estimated row count: 1,121,385 (100% of the table; stats collected 60 minutes ago)
        │     table: statement_statistics@primary
        │     spans: FULL SCAN
        │
        └── • scan
              nodes: n108
              regions: use1
              actual row count: 10,000
              KV time: 305ms
              KV contention time: 0µs
              KV rows read: 10,000
              KV bytes read: 32 MiB
              KV gRPC calls: 4
              estimated max memory allocated: 10 MiB
              estimated row count: 10,000 (0.89% of the table; stats collected 60 minutes ago)
              table: statement_statistics@primary
              spans: [/0 - /0/'2023-08-15 18:59:59.999999+00']
              limit: 10000

#101486 中修复了使用正确的默认值而不是 0。这没有被向后移植。

解决方案是更改查询以避免联接，从而避免完全扫描：

DELETE FROM system.statement_statistics
WHERE crdb_internal_aggregated_ts_app_name_fingerprint_id_node_id_plan_hash_transaction_fingerprint_id_shard_8 = 0:::INT8 AND aggregated_ts < '2023-08-15 19:00:00+00':::TIMESTAMPTZ
ORDER BY aggregated_ts ASC
LIMIT 10000:::INT8
RETURNING aggregated_ts, fingerprint_id, transaction_fingerprint_id, plan_hash, app_name, node_id;

  project
   └── delete statement_statistics
        └── project
             ├── scan statement_statistics
             │    ├── constraint: /33/23/24/25/26/27/28: [/0 - /0/'2023-08-15 18:59:59.999999+00']
             │    └── limit: 10000
             └── projections
                  ├── app_name NOT LIKE '$ internal%'
                  └── (statistics->'statistics')->'indexes'

Jira 事务：

Which default value specifically causes him to perform a full scan? gentlemen

j82w added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-cluster-observability A-cluster-observability Related to cluster observability labels Aug 17, 2023

j82w self-assigned this Aug 17, 2023

j82w mentioned this issue Aug 17, 2023

sql: fix sql compaction job full scan #108947

Merged

craig bot closed this as completed in 91f452f Aug 18, 2023

This was referenced Aug 18, 2023

release-23.1: sql: fix sql compaction job full scan #108987

Merged

release-23.1.9-rc: sql: fix sql compaction job full scan #108990

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql-stats-compaction: failing with TransactionRetryError #108936

sql-stats-compaction: failing with TransactionRetryError #108936

j82w commented Aug 17, 2023 •

edited by cockroach-jira-scripts

Loading

pingxianga commented Dec 4, 2023

sql-stats-compaction: failing with TransactionRetryError #108936

sql-stats-compaction: failing with TransactionRetryError #108936

Comments

j82w commented Aug 17, 2023 • edited by cockroach-jira-scripts Loading

pingxianga commented Dec 4, 2023

j82w commented Aug 17, 2023 •

edited by cockroach-jira-scripts

Loading