Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statistics: clean up dropped predicate columns stats usage #53680

Merged

Conversation

Rustin170506
Copy link
Member

What problem does this PR solve?

Issue Number: ref #53567

Problem Summary:

We need to clean up the outdated predicate columns that have been dropped from the schema.
Otherwise, we will attempt to analyze non-existent columns.

What changed and how does it work?

  1. Added a test case to verify current behavior.
  2. Added a new function to clean up the non-existent columns before we get the predicate columns.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/M Denotes a PR that changes 30-99 lines, ignoring generated files. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels May 30, 2024
@Rustin170506 Rustin170506 force-pushed the rustin-patch-cleanup-predicate-columns branch 4 times, most recently from 971e144 to bb85eef Compare May 30, 2024 08:28
@Rustin170506 Rustin170506 force-pushed the rustin-patch-cleanup-predicate-columns branch from bb85eef to 96f1c07 Compare May 30, 2024 08:29
Copy link
Member Author

@Rustin170506 Rustin170506 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔢 Self-check (PR reviewed by myself and ready for feedback.)

Copy link

codecov bot commented May 30, 2024

Codecov Report

Attention: Patch coverage is 71.42857% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 74.4987%. Comparing base (04c66ee) to head (96f1c07).
Report is 3 commits behind head on master.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #53680        +/-   ##
================================================
+ Coverage   74.4884%   74.4987%   +0.0102%     
================================================
  Files          1506       1506                
  Lines        357618     431433     +73815     
================================================
+ Hits         266384     321412     +55028     
- Misses        71857      90098     +18241     
- Partials      19377      19923       +546     
Flag Coverage Δ
integration 49.3067% <0.0000%> (?)
unit 71.4443% <71.4285%> (-1.9229%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9957% <ø> (-2.3014%) ⬇️
parser ∅ <ø> (∅)
br 50.4453% <ø> (+6.8818%) ⬆️

@Rustin170506
Copy link
Member Author

Tested locally:

  1. Start the TiDB cluster with patch: tiup playground nightly --db.binpath /Users/hi-rustin/vsc/tidb/bin/tidb-server
  2. Create a new table:
mysql> create table t (a int, b int);
Query OK, 0 rows affected (0.04 sec)

mysql> insert into t values (1, 1), (2, 2), (3, 3);
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0
  1. Enable predicate column tracking:
mysql> set global tidb_enable_column_tracking = 1;
Query OK, 0 rows affected (0.02 sec)

mysql> select @@tidb_enable_column_tracking;
+-------------------------------+
| @@tidb_enable_column_tracking |
+-------------------------------+
|                             1 |
+-------------------------------+
1 row in set (0.00 sec)
  1. Select data with predicates:
mysql> select * from t where b > 1;
+------+------+
| a    | b    |
+------+------+
|    2 |    2 |
|    3 |    3 |
+------+------+
2 rows in set (0.00 sec)
  1. Wait for 2 minutes:
mysql> SELECT * FROM MYSQL.COLUMN_STATS_USAGE;
+----------+-----------+---------------------+------------------+
| table_id | column_id | last_used_at        | last_analyzed_at |
+----------+-----------+---------------------+------------------+
|      104 |         2 | 2024-05-30 16:47:18 | NULL             |
+----------+-----------+---------------------+------------------+
1 row in set (0.01 sec)
  1. Analyze with predicate columns: ANALYZE test.t PREDICATE COLUMNS
mysql> ANALYZE TABLE test.t PREDICATE COLUMNS;
Query OK, 0 rows affected, 1 warning (0.03 sec)
mysql> select * from mysql.analyze_options;
+----------+------------+-------------+---------+------+---------------+------------+
| table_id | sample_num | sample_rate | buckets | topn | column_choice | column_ids |
+----------+------------+-------------+---------+------+---------------+------------+
|      104 |          0 |           0 |       0 |   -1 | PREDICATE     |            |
+----------+------------+-------------+---------+------+---------------+------------+
1 row in set (0.00 sec)
  1. Check analyze jobs:
mysql> SELECT * FROM mysql.analyze_jobs;
+----+---------------------+--------------+------------+----------------+------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
| id | update_time         | table_schema | table_name | partition_name | job_info                                                         | processed_rows | start_time          | end_time            | state    | fail_reason | instance       | process_id |
+----+---------------------+--------------+------------+----------------+------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
|  1 | 2024-05-30 16:52:28 | test         | t          |                | analyze table columns b with 256 buckets, 500 topn, 1 samplerate |              3 | 2024-05-30 16:52:28 | 2024-05-30 16:52:28 | finished | NULL        | 127.0.0.1:4000 |       NULL |
+----+---------------------+--------------+------------+----------------+------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
1 row in set (0.00 sec)
  1. Drop a column:
mysql> alter table t drop column b;
Query OK, 0 rows affected (0.10 sec)
  1. Check the stats usage:
mysql> SELECT * FROM MYSQL.COLUMN_STATS_USAGE;
+----------+-----------+---------------------+---------------------+
| table_id | column_id | last_used_at        | last_analyzed_at    |
+----------+-----------+---------------------+---------------------+
|      104 |         2 | 2024-05-30 16:47:18 | 2024-05-30 16:52:28 |
+----------+-----------+---------------------+---------------------+
1 row in set (0.00 sec)
  1. Try analyze again:
mysql> ANALYZE TABLE test.t PREDICATE COLUMNS;
Query OK, 0 rows affected, 1 warning (0.02 sec)
mysql> show warnings;
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Level   | Code | Message                                                                                                                                 |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+
| Warning | 1105 | No predicate column has been collected yet for table test.t so all columns are analyzed                                                 |
| Note    | 1105 | Analyze use auto adjusted sample rate 1.000000 for table test.t, reason to use this rate is "use min(1, 110000/3) as the sample-rate=1" |
+---------+------+-----------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.00 sec)
  1. Check the jobs again:
mysql> SELECT * FROM mysql.analyze_jobs;
+----+---------------------+--------------+------------+----------------+--------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
| id | update_time         | table_schema | table_name | partition_name | job_info                                                           | processed_rows | start_time          | end_time            | state    | fail_reason | instance       | process_id |
+----+---------------------+--------------+------------+----------------+--------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
|  1 | 2024-05-30 16:52:28 | test         | t          |                | analyze table columns b with 256 buckets, 500 topn, 1 samplerate   |              3 | 2024-05-30 16:52:28 | 2024-05-30 16:52:28 | finished | NULL        | 127.0.0.1:4000 |       NULL |
|  2 | 2024-05-30 16:53:21 | test         | t          |                | analyze table all columns with 256 buckets, 500 topn, 1 samplerate |              3 | 2024-05-30 16:53:21 | 2024-05-30 16:53:21 | finished | NULL        | 127.0.0.1:4000 |       NULL |
+----+---------------------+--------------+------------+----------------+--------------------------------------------------------------------+----------------+---------------------+---------------------+----------+-------------+----------------+------------+
2 rows in set (0.00 sec)
  1. Check stats usage again:
mysql> SELECT * FROM MYSQL.COLUMN_STATS_USAGE;
+----------+-----------+--------------+---------------------+
| table_id | column_id | last_used_at | last_analyzed_at    |
+----------+-----------+--------------+---------------------+
|      104 |         1 | NULL         | 2024-05-30 16:53:21 |
+----------+-----------+--------------+---------------------+
1 row in set (0.00 sec)

I am unsure why we insert the column 'a' into the MYSQL.COLUMN_STATS_USAGE table. I don't think it makes sense, but it is off-topic for this PR. We will fix it later.

@Rustin170506
Copy link
Member Author

I am unsure why we insert the column 'a' into the MYSQL.COLUMN_STATS_USAGE table. I don't think it makes sense, but it is off-topic for this PR. We will fix it later.

I don't if this is a bug or a feature.

But the minimal reproduction steps are:

mysql> use test;
Database changed
mysql> create table t (a int, b int);
Query OK, 0 rows affected (0.04 sec)

mysql> insert into t values (1, 1), (2, 2), (3, 3);
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> set global tidb_enable_column_tracking = 1;
Query OK, 0 rows affected (0.01 sec)

mysql> analyze table t;
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql> SELECT * FROM MYSQL.COLUMN_STATS_USAGE;
+----------+-----------+--------------+---------------------+
| table_id | column_id | last_used_at | last_analyzed_at    |
+----------+-----------+--------------+---------------------+
|      104 |         1 | NULL         | 2024-05-30 17:00:36 |
|      104 |         2 | NULL         | 2024-05-30 17:00:36 |
+----------+-----------+--------------+---------------------+
2 rows in set (0.00 sec)

I will figure out why this happens.

@Rustin170506 Rustin170506 requested review from qw4990 and AilinKid May 30, 2024 09:20
@Rustin170506 Rustin170506 changed the title statistics: cleanup outdated predicate columns statistics: cleanup dropped predicate columns May 30, 2024
@Rustin170506 Rustin170506 changed the title statistics: cleanup dropped predicate columns statistics: clean up dropped predicate columns stats usage May 30, 2024
@ti-chi-bot ti-chi-bot bot added approved needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 30, 2024
Copy link

ti-chi-bot bot commented May 30, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hawkingrei, qw4990

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels May 30, 2024
Copy link

ti-chi-bot bot commented May 30, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-05-30 09:34:48.964889167 +0000 UTC m=+2941842.722024740: ☑️ agreed by qw4990.
  • 2024-05-30 10:04:49.05914344 +0000 UTC m=+2943642.816279015: ☑️ agreed by hawkingrei.

@ti-chi-bot ti-chi-bot bot merged commit 9745a16 into pingcap:master May 30, 2024
23 checks passed
@Rustin170506 Rustin170506 deleted the rustin-patch-cleanup-predicate-columns branch May 30, 2024 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. sig/planner SIG: Planner size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants