From e72c1503c684307e2cfc77f146cc9840380c8d3e Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Thu, 29 Sep 2022 10:35:10 +0200 Subject: [PATCH 01/12] docs/design: Initial commit of REORGANIZE PARTITION design doc --- .../design/2022-09-29-reorganize-partition.md | 132 ++++++++++++++++++ 1 file changed, 132 insertions(+) create mode 100644 docs/design/2022-09-29-reorganize-partition.md diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md new file mode 100644 index 0000000000000..1c66bc657a8f9 --- /dev/null +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -0,0 +1,132 @@ +# TiDB Design Documents + +- Author(s): [Mattias Jonsson](http://github.com/mjonss) +- Discussion PR: https://github.com/pingcap/tidb/pull/XXX +- Tracking Issue: https://github.com/pingcap/tidb/issues/15000 + +## Table of Contents + +* [Introduction](#introduction) +* [Motivation or Background](#motivation-or-background) +* [Detailed Design](#detailed-design) +* [Test Design](#test-design) + * [Functional Tests](#functional-tests) + * [Scenario Tests](#scenario-tests) + * [Compatibility Tests](#compatibility-tests) + * [Benchmark Tests](#benchmark-tests) +* [Impacts & Risks](#impacts--risks) +* [Investigation & Alternatives](#investigation--alternatives) +* [Unresolved Questions](#unresolved-questions) + +## Introduction + +Support ALTER TABLE t REORGANIZE PARTITION p1,p2 INTO (partition pNew1 values...) + +## Motivation or Background + +TiDB is currently lacking the support of changing the partitions of a partitioned table, it only supports adding and dropping LIST/RANGE partitions. +Supporting REORGANIZE PARTITIONs will allow RANGE partitioned tables to have a MAXVALUE partition to catch all values and split it into new ranges. Similar with LIST partitions where one can split or merge different partitions. + +When this is implemented, it will also allow transforming a non-partitioned table into a partitioned table as well as remove partitioning and make a partitioned table a normal non-partitioned table, which is different ALTER statements but can use the same implementation as REORGANIZE PARTITION + +The operation should be online, and must handle multiple partitions as well as large data sets. + +Possible usage scenarios: +- Full table copy + - merging all partitions to a single table (ALTER TABLE t REMOVE PARTITIONING) + - splitting data from many to many partitions, like change the number of hash partitions + - splitting a table to many partitions (ALTER TABLE t PARTITION BY ...) +- Partial table copy (not full table/all partitions) + - split one or more partitions + - merge two or more partitions + +These different use cases can have different optimizations, but the generic form must still be solved: +- N partitions, where each partition has M indexes + +Should we implement both the ingest (lightning way) as well as the merge-txn (row-by-row internal insert batches)? + +## Detailed Design + +There are two parts of the design: +- Schema change states throughout the operation +- Reorganization implementation + +Where the schema change states will clarify which different steps that will be done in which schema state transisions. + +### Schema change states for REORGANIZE PARTITION + +Since this operation will: +- create new partitions +- drop existing partitions +- copy data from dropped partitions to new partitions +it will use all these schema change stages: + // StateNone means this schema element is absent and can't be used. + StateNone SchemaState = iota + // StateDeleteOnly means we can only delete items for this schema element. + StateDeleteOnly + // StateWriteOnly means we can use any write operation on this schema element, + // but outer can't read the changed data. + StateWriteOnly + // StateWriteReorganization means we are re-organizing whole data after write only state. + StateWriteReorganization + // StateDeleteReorganization means we are re-organizing whole data after delete only state. + StateDeleteReorganization + // StatePublic means this schema element is ok for all write and read operations. + StatePublic + // StateReplicaOnly means we're waiting tiflash replica to be finished. + StateReplicaOnly + // StateGlobalTxnOnly means we can only use global txn for operator on this schema element + StateGlobalTxnOnly + + +Explaie the design in enough detail that: it is reasonably clear how the feature would be implemented, corner cases are dissected by example, how the feature is used, etc. + +It's better to describe the pseudo-code of the key algorithm, API interfaces, the UML graph, what components are needed to be changed in this section. + +Compatibility is important, please also take into consideration, a checklist: +- Compatibility with other features, like partition table, security&privilege, collation&charset, clustered index, async commit, etc. +- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. +- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. +- Upgrade compatibility +- Downgrade compatibility + +## Test Design + +A brief description of how the implementation will be tested. Both the integration test and the unit test should be considered. + +### Functional Tests + +It's used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered. + +### Scenario Tests + +It's used to ensure this feature works as expected in some common scenarios. + +### Compatibility Tests + +A checklist to test compatibility: +- Compatibility with other features, like partition table, security & privilege, charset & collation, clustered index, async commit, etc. +- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. +- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. +- Upgrade compatibility +- Downgrade compatibility + +### Benchmark Tests + +The following two parts need to be measured: +- The performance of this feature under different parameters +- The performance influence on the online workload + +## Impacts & Risks + +Describe the potential impacts & risks of the design on overall performance, security, k8s, and other aspects. List all the risks or unknowns by far. + +Please describe impacts and risks in two sections: Impacts could be positive or negative, and intentional. Risks are usually negative, unintentional, and may or may not happen. E.g., for performance, we might expect a new feature to improve latency by 10% (expected impact), there is a risk that latency in scenarios X and Y could degrade by 50%. + +## Investigation & Alternatives + +How do other systems solve this issue? What other designs have been considered and what is the rationale for not choosing them? + +## Unresolved Questions + +What parts of the design are still to be determined? From 090eab3cf2fdf380c4eb9ed3c1f0abd3156ce104 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Fri, 7 Oct 2022 13:54:25 +0200 Subject: [PATCH 02/12] Update 2022-09-29-reorganize-partition.md --- .../design/2022-09-29-reorganize-partition.md | 75 ++++++++++++++----- 1 file changed, 56 insertions(+), 19 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 1c66bc657a8f9..69492e51cb880 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -1,7 +1,7 @@ # TiDB Design Documents - Author(s): [Mattias Jonsson](http://github.com/mjonss) -- Discussion PR: https://github.com/pingcap/tidb/pull/XXX +- Discussion PR: https://github.com/pingcap/tidb/pull/38314 - Tracking Issue: https://github.com/pingcap/tidb/issues/15000 ## Table of Contents @@ -43,7 +43,8 @@ Possible usage scenarios: These different use cases can have different optimizations, but the generic form must still be solved: - N partitions, where each partition has M indexes -Should we implement both the ingest (lightning way) as well as the merge-txn (row-by-row internal insert batches)? +First implementation should be based on the merge-txn (row-by-row read, update record, write) transactional batches. +Later we can implement the ingest (lightning way) optimization, since it is not yet completely done. ## Detailed Design @@ -59,51 +60,72 @@ Since this operation will: - create new partitions - drop existing partitions - copy data from dropped partitions to new partitions +- it will use all these schema change stages: + // StateNone means this schema element is absent and can't be used. StateNone SchemaState = iota + - Check if the table structure after the ALTER is valid + - Generate physical table ids to each new partition + - Update the meta data with the new partitions and which partitions to be dropped (so that new transaction can double write) + - TODO: Should we also set placement rules? (Lable Rules) + - Set the state to StateDeleteOnly + // StateDeleteOnly means we can only delete items for this schema element. StateDeleteOnly + - Set the state to StateWriteOnly + // StateWriteOnly means we can use any write operation on this schema element, // but outer can't read the changed data. StateWriteOnly + - Set the state to StateWriteReorganization + // StateWriteReorganization means we are re-organizing whole data after write only state. StateWriteReorganization - // StateDeleteReorganization means we are re-organizing whole data after delete only state. - StateDeleteReorganization - // StatePublic means this schema element is ok for all write and read operations. - StatePublic + - Copy the data from the partitions to be dropped and insert it into the new partitions [1] + - Write the index data for the new partitions [2] + - Replace the old partitions with the new partitions in the metadata + - Add a job for dropping the old partitions data (which will also remove its placement rules etc.) + // StateReplicaOnly means we're waiting tiflash replica to be finished. StateReplicaOnly - // StateGlobalTxnOnly means we can only use global txn for operator on this schema element - StateGlobalTxnOnly + - TODO: Do we need this stage? How to handle TiFlash for REORGANIZE PARTITIONS? +- [1] will read all rows, update their physical table id (partition id) and write them back. +- [2] can be done together with [1] to avoid duplicating the work for both reading and getting the new partition id -Explaie the design in enough detail that: it is reasonably clear how the feature would be implemented, corner cases are dissected by example, how the feature is used, etc. -It's better to describe the pseudo-code of the key algorithm, API interfaces, the UML graph, what components are needed to be changed in this section. +During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in `(*partitionedTable) AddRecord/UpdateRecord/RemoveRecord`) -Compatibility is important, please also take into consideration, a checklist: -- Compatibility with other features, like partition table, security&privilege, collation&charset, clustered index, async commit, etc. -- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. -- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. -- Upgrade compatibility -- Downgrade compatibility +Note that parser support already exists. +There should be no issues with upgrading, and downgrade will not be supported during the DDL. + +TODO: +- How does DDL affect statistics? Look into how to add statistics for the new partitions (and that the statistics for the old partitions are removed when they are dropped) +- How to handle TiFlash? ## Test Design -A brief description of how the implementation will be tested. Both the integration test and the unit test should be considered. +Re-use tests from other DDLs like Modify column, but adjust them for Reorganize partition. ### Functional Tests +TODO + It's used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered. ### Scenario Tests +Note that this DDL will be online, while MySQL is blocking on MDL. + +TODO + It's used to ensure this feature works as expected in some common scenarios. ### Compatibility Tests +TODO + A checklist to test compatibility: - Compatibility with other features, like partition table, security & privilege, charset & collation, clustered index, async commit, etc. - Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. @@ -113,15 +135,29 @@ A checklist to test compatibility: ### Benchmark Tests +Correctness and functionality is higher priority than performance, also better to not influence performance of normal load during the DDL. + +TODO + The following two parts need to be measured: - The performance of this feature under different parameters - The performance influence on the online workload ## Impacts & Risks -Describe the potential impacts & risks of the design on overall performance, security, k8s, and other aspects. List all the risks or unknowns by far. +Impacts: +- better usability of partitioned tables +- online alter in TiDB, where MySQL is blocking +- all affected data needs to be read (CPU/IO/Network load on TiDB/PD/TiKV) +- all data needs to be writted (duplicated, both row-data and indexes), including transaction logs (more disk space on TiKV, CPU/IO/Network load on TiDB/PD/TiKV and TiFlash if configured on the table). -Please describe impacts and risks in two sections: Impacts could be positive or negative, and intentional. Risks are usually negative, unintentional, and may or may not happen. E.g., for performance, we might expect a new feature to improve latency by 10% (expected impact), there is a risk that latency in scenarios X and Y could degrade by 50%. +Risks: +- introduction of bugs + - in the DDL code + - in the write path (double writing the changes for transactions running during the DDL) +- out of disk space +- out of memory +- general resource usage, resulting in lower performance of the cluster ## Investigation & Alternatives @@ -130,3 +166,4 @@ How do other systems solve this issue? What other designs have been considered a ## Unresolved Questions What parts of the design are still to be determined? +- TiFlash handling From 6be116305ad6c1b6e2f1b006ddb09cf944865bdf Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:06 +0200 Subject: [PATCH 03/12] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 69492e51cb880..baf15b0f897d9 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -44,7 +44,7 @@ These different use cases can have different optimizations, but the generic form - N partitions, where each partition has M indexes First implementation should be based on the merge-txn (row-by-row read, update record, write) transactional batches. -Later we can implement the ingest (lightning way) optimization, since it is not yet completely done. +Later we can implement the ingest (lightning way) optimization, since DDL module are on the way of evolution to do reorg tasks more efficiency. ## Detailed Design From 10dcf46299064d3ab163e8be4d2453fbde255a4e Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:20 +0200 Subject: [PATCH 04/12] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index baf15b0f897d9..9b7414506b9c7 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -52,7 +52,7 @@ There are two parts of the design: - Schema change states throughout the operation - Reorganization implementation -Where the schema change states will clarify which different steps that will be done in which schema state transisions. +Where the schema change states will clarify which different steps that will be done in which schema state transitions. ### Schema change states for REORGANIZE PARTITION From 7e140c756d7fb4982bdbc084a6003ab0f0e047d6 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:27 +0200 Subject: [PATCH 05/12] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 9b7414506b9c7..86be04f5ea389 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -60,7 +60,6 @@ Since this operation will: - create new partitions - drop existing partitions - copy data from dropped partitions to new partitions -- it will use all these schema change stages: // StateNone means this schema element is absent and can't be used. From 71a23d954ca2c36eadaf7a1e264a158a699fcb34 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 10 Oct 2022 10:14:53 +0200 Subject: [PATCH 06/12] Update docs/design/2022-09-29-reorganize-partition.md Co-authored-by: Benjamin2037 --- docs/design/2022-09-29-reorganize-partition.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 86be04f5ea389..70b6e605f3c91 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -70,7 +70,7 @@ it will use all these schema change stages: - TODO: Should we also set placement rules? (Lable Rules) - Set the state to StateDeleteOnly - // StateDeleteOnly means we can only delete items for this schema element. + // StateDeleteOnly means we can only delete items for this schema element (the new partition). StateDeleteOnly - Set the state to StateWriteOnly From 0bff4401722a7db5029e0b6df10e1f4b889bbd99 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Tue, 18 Oct 2022 22:06:27 +0200 Subject: [PATCH 07/12] Update 2022-09-29-reorganize-partition.md --- .../design/2022-09-29-reorganize-partition.md | 73 +++++-------------- 1 file changed, 17 insertions(+), 56 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 70b6e605f3c91..0fbfe2439eabc 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -1,7 +1,7 @@ # TiDB Design Documents - Author(s): [Mattias Jonsson](http://github.com/mjonss) -- Discussion PR: https://github.com/pingcap/tidb/pull/38314 +- Discussion PR: https://github.com/pingcap/tidb/issues/38535 - Tracking Issue: https://github.com/pingcap/tidb/issues/15000 ## Table of Contents @@ -81,73 +81,43 @@ it will use all these schema change stages: // StateWriteReorganization means we are re-organizing whole data after write only state. StateWriteReorganization - - Copy the data from the partitions to be dropped and insert it into the new partitions [1] - - Write the index data for the new partitions [2] - - Replace the old partitions with the new partitions in the metadata - - Add a job for dropping the old partitions data (which will also remove its placement rules etc.) + - Copy the data from the partitions to be dropped (one at a time) and insert it into the new partitions. This needs a new backfillWorker implementation. + - Recreate the indexes one by one for the new partitions (one partition at a time) (create an element for each index and reuse the addIndexWorker). (Note: this can be optimized in the futute, either with the new fast add index implementation, based on lightning. Or by either writing the index entries at the same time as the records, in the previous step, or if the partitioning columns are included in the index or handle) + - Replace the old partitions with the new partitions in the metadata when the data copying is done + - Register the range delete of the old partition data (in finishJob / deleteRange). - // StateReplicaOnly means we're waiting tiflash replica to be finished. - StateReplicaOnly - - TODO: Do we need this stage? How to handle TiFlash for REORGANIZE PARTITIONS? - -- [1] will read all rows, update their physical table id (partition id) and write them back. -- [2] can be done together with [1] to avoid duplicating the work for both reading and getting the new partition id - + // StatePublic means this schema element is ok for all write and read operations. + StatePublic + - Table structure is now complete and the table is ready to use with its new partitioning scheme + - Note that there is a background job for the GCWorker to do in its deleteRange function. During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in `(*partitionedTable) AddRecord/UpdateRecord/RemoveRecord`) Note that parser support already exists. There should be no issues with upgrading, and downgrade will not be supported during the DDL. -TODO: -- How does DDL affect statistics? Look into how to add statistics for the new partitions (and that the statistics for the old partitions are removed when they are dropped) -- How to handle TiFlash? +Notes: +- statistics should be removed from the old partitions. +- statistics will not be generated for the new partitions (future optimization possible, to get statistics during the data copying?) +- the global statistics (table level) will remain the same, since the data has not changed. +- this DDL will be online, while MySQL is blocking on MDL. ## Test Design Re-use tests from other DDLs like Modify column, but adjust them for Reorganize partition. +A separate test plan will be created and a test report will be written and signed off when the tests are completed. -### Functional Tests - -TODO - -It's used to ensure the basic feature function works as expected. Both the integration test and the unit test should be considered. - -### Scenario Tests - -Note that this DDL will be online, while MySQL is blocking on MDL. - -TODO - -It's used to ensure this feature works as expected in some common scenarios. - -### Compatibility Tests - -TODO - -A checklist to test compatibility: -- Compatibility with other features, like partition table, security & privilege, charset & collation, clustered index, async commit, etc. -- Compatibility with other internal components, like parser, DDL, planner, statistics, executor, etc. -- Compatibility with other external components, like PD, TiKV, TiFlash, BR, TiCDC, Dumpling, TiUP, K8s, etc. -- Upgrade compatibility -- Downgrade compatibility ### Benchmark Tests -Correctness and functionality is higher priority than performance, also better to not influence performance of normal load during the DDL. - -TODO - -The following two parts need to be measured: -- The performance of this feature under different parameters -- The performance influence on the online workload +Correctness and functionality is higher priority than performance. ## Impacts & Risks Impacts: - better usability of partitioned tables - online alter in TiDB, where MySQL is blocking -- all affected data needs to be read (CPU/IO/Network load on TiDB/PD/TiKV) +- all affected data needs to be read (CPU/IO/Network load on TiDB/PD/TiKV), even multiple times in case of indexes. - all data needs to be writted (duplicated, both row-data and indexes), including transaction logs (more disk space on TiKV, CPU/IO/Network load on TiDB/PD/TiKV and TiFlash if configured on the table). Risks: @@ -157,12 +127,3 @@ Risks: - out of disk space - out of memory - general resource usage, resulting in lower performance of the cluster - -## Investigation & Alternatives - -How do other systems solve this issue? What other designs have been considered and what is the rationale for not choosing them? - -## Unresolved Questions - -What parts of the design are still to be determined? -- TiFlash handling From 7906bd21a8459c929680e8d8234e72fa4515e1dc Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Tue, 18 Oct 2022 22:26:46 +0200 Subject: [PATCH 08/12] Update 2022-09-29-reorganize-partition.md --- .../design/2022-09-29-reorganize-partition.md | 27 ++++++++++++------- 1 file changed, 17 insertions(+), 10 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 0fbfe2439eabc..7f1e5953a381f 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -9,14 +9,12 @@ * [Introduction](#introduction) * [Motivation or Background](#motivation-or-background) * [Detailed Design](#detailed-design) + * [Schema change states for REORGANIZE PARTITION](#schema-change-states-for-reorganize-partition) + * [Error Handling](#error-handling) + * [Notes](#notes) * [Test Design](#test-design) - * [Functional Tests](#functional-tests) - * [Scenario Tests](#scenario-tests) - * [Compatibility Tests](#compatibility-tests) * [Benchmark Tests](#benchmark-tests) * [Impacts & Risks](#impacts--risks) -* [Investigation & Alternatives](#investigation--alternatives) -* [Unresolved Questions](#unresolved-questions) ## Introduction @@ -27,14 +25,14 @@ Support ALTER TABLE t REORGANIZE PARTITION p1,p2 INTO (partition pNew1 values... TiDB is currently lacking the support of changing the partitions of a partitioned table, it only supports adding and dropping LIST/RANGE partitions. Supporting REORGANIZE PARTITIONs will allow RANGE partitioned tables to have a MAXVALUE partition to catch all values and split it into new ranges. Similar with LIST partitions where one can split or merge different partitions. -When this is implemented, it will also allow transforming a non-partitioned table into a partitioned table as well as remove partitioning and make a partitioned table a normal non-partitioned table, which is different ALTER statements but can use the same implementation as REORGANIZE PARTITION +When this is implemented, it will also allow future PRs transforming a non-partitioned table into a partitioned table as well as remove partitioning and make a partitioned table a normal non-partitioned table, as well as COALESCE PARTITION and ADD PARTITION for HASH partitioned tables, which is different ALTER statements but can use the same implementation as REORGANIZE PARTITION The operation should be online, and must handle multiple partitions as well as large data sets. Possible usage scenarios: - Full table copy - merging all partitions to a single table (ALTER TABLE t REMOVE PARTITIONING) - - splitting data from many to many partitions, like change the number of hash partitions + - splitting data from many to many partitions, like change the number of HASH partitions - splitting a table to many partitions (ALTER TABLE t PARTITION BY ...) - Partial table copy (not full table/all partitions) - split one or more partitions @@ -43,7 +41,7 @@ Possible usage scenarios: These different use cases can have different optimizations, but the generic form must still be solved: - N partitions, where each partition has M indexes -First implementation should be based on the merge-txn (row-by-row read, update record, write) transactional batches. +First implementation should be based on the merge-txn (row-by-row batch read, update record key with new Physical Table ID, write) transactional batches and then create the indexes in batches index by index, partition by partition. Later we can implement the ingest (lightning way) optimization, since DDL module are on the way of evolution to do reorg tasks more efficiency. ## Detailed Design @@ -60,14 +58,17 @@ Since this operation will: - create new partitions - drop existing partitions - copy data from dropped partitions to new partitions -it will use all these schema change stages: + +It will use all these schema change stages: // StateNone means this schema element is absent and can't be used. StateNone SchemaState = iota - Check if the table structure after the ALTER is valid - Generate physical table ids to each new partition - Update the meta data with the new partitions and which partitions to be dropped (so that new transaction can double write) - - TODO: Should we also set placement rules? (Lable Rules) + - Set placement rules + - Set TiFlash Replicas + - Set legacy Bundles (non-sql placement) - Set the state to StateDeleteOnly // StateDeleteOnly means we can only delete items for this schema element (the new partition). @@ -93,6 +94,12 @@ it will use all these schema change stages: During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in `(*partitionedTable) AddRecord/UpdateRecord/RemoveRecord`) +### Error handling + +If any non-retryable error occurs, we will call onDropTablePartition and adjust the logic in that function to also handle the roll back of reorganize partition, in a similar way as it does with model.ActionAddTablePartition. + +### Notes + Note that parser support already exists. There should be no issues with upgrading, and downgrade will not be supported during the DDL. From 4c3ffa76303ceb1d90dffbafa62238daf238f2c0 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Tue, 18 Oct 2022 22:32:36 +0200 Subject: [PATCH 09/12] Update 2022-09-29-reorganize-partition.md --- docs/design/2022-09-29-reorganize-partition.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 7f1e5953a381f..7a1c5204ad1cc 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -48,7 +48,7 @@ Later we can implement the ingest (lightning way) optimization, since DDL module There are two parts of the design: - Schema change states throughout the operation -- Reorganization implementation +- Reorganization implementation, which will be handled in the StateWriteReorganization state. Where the schema change states will clarify which different steps that will be done in which schema state transitions. @@ -56,8 +56,9 @@ Where the schema change states will clarify which different steps that will be d Since this operation will: - create new partitions +- copy data from dropped partitions to new partitions and create their indexes +- change the partition definitions - drop existing partitions -- copy data from dropped partitions to new partitions It will use all these schema change stages: From 808d41369a6a65f67ff218ea1f448a5b4471a8b3 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Mon, 5 Dec 2022 15:48:45 +0100 Subject: [PATCH 10/12] Added StateDeleteReorganization --- docs/design/2022-09-29-reorganize-partition.md | 12 ++++++++++-- 1 file changed, 10 insertions(+), 2 deletions(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 7a1c5204ad1cc..ee487df5e9742 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -65,8 +65,8 @@ It will use all these schema change stages: // StateNone means this schema element is absent and can't be used. StateNone SchemaState = iota - Check if the table structure after the ALTER is valid - - Generate physical table ids to each new partition - - Update the meta data with the new partitions and which partitions to be dropped (so that new transaction can double write) + - Use the generate physical table ids to each new partition (that was generated already by the client sending the ALTER command). + - Update the meta data with the new partitions (AddingDefinitions) and which partitions to be dropped (DroppingDefinitions), so that new transactions can double write. - Set placement rules - Set TiFlash Replicas - Set legacy Bundles (non-sql placement) @@ -86,7 +86,15 @@ It will use all these schema change stages: - Copy the data from the partitions to be dropped (one at a time) and insert it into the new partitions. This needs a new backfillWorker implementation. - Recreate the indexes one by one for the new partitions (one partition at a time) (create an element for each index and reuse the addIndexWorker). (Note: this can be optimized in the futute, either with the new fast add index implementation, based on lightning. Or by either writing the index entries at the same time as the records, in the previous step, or if the partitioning columns are included in the index or handle) - Replace the old partitions with the new partitions in the metadata when the data copying is done + - Set the state to StateDeleteReorganization + + // StateDeleteReorganization means we are re-organizing whole data after delete only state. + StateDeleteReorganization - we are using this state in a slightly different way than the comment above says. + This state is needed since we cannot directly move from StateWriteReorganization to StatePublic. + Imagine that the StateWriteReorganization is complete and we are updating the schema version, then if a transaction seeing the new schema version is writing to the new partitions, then those changes needs to be written to the old partitions as well, so new transactions in other nodes using the older schema version can still see the changes. + - Remove the notion of new partitions (AddingDefinitions) and which partitions to be dropped (DroppingDefinitions) and double writing will stop when it goes to StatePublic. - Register the range delete of the old partition data (in finishJob / deleteRange). + - Set the state to StatePublic // StatePublic means this schema element is ok for all write and read operations. StatePublic From e78325e7b2c72a27a22701a243bbbee4da0d30d0 Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Thu, 22 Dec 2022 10:55:12 +0000 Subject: [PATCH 11/12] Update 2022-09-29-reorganize-partition.md Added example / table about why the StateDeleteReorganize is needed. --- .../design/2022-09-29-reorganize-partition.md | 36 +++++++++++++++++++ 1 file changed, 36 insertions(+) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index ee487df5e9742..9109dda340922 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -103,6 +103,42 @@ It will use all these schema change stages: During the reorganization happens in the background the normal write path needs to check if there are any new partitions in the metadata and also check if the updated/deleted/inserted row would match a new partition, and if so, also do the same operation in the new partition, just like during adding index or modify column operations currently does. (To be implemented in `(*partitionedTable) AddRecord/UpdateRecord/RemoveRecord`) +Example of why an extra state between StateWriteReorganize and StatePublic is needed: + +```sql +-- table: +CREATE TABLE t (a int) PARTITION BY LIST (a) (PARTITION p0 VALUES IN (1,2,3,4,5), PARTITION p1 VALUES IN (6,7,8,9,10)); +-- during alter operation: +ALTER TABLE t REORGANIZE PARTITION p0 INTO (PARTITION p0a VALUES IN (1,2,3), PARTITION p0b VALUES IN (4,5)); +``` + +Partition within parentheses `(p0a [1] p0b [0])` is hidden or to be deleted by GC/DeleteRange. Values in the brackets after the partition `p0a [2]`. + +If we go directly from StateWriteReorganize to StatePublic, then clients one schema version behind will not see changes to the new partitions: + +| Data (TiKV/Unistore) | TiDB client 1 | TiDB client 2 | +| --------------------------------------- | ------------------------------------ | ------------------------------------------------------------ | +| p0 [] p1 [] StateWriteReorganize | | | +| p0 [] p1 [] (p0a [] p0b []) | | | +| (p0 []) p1 [] p0a [] p0b [] StatePublic | | | +| (p0 []) p1 [] p0a [2] p0b [] | StatePublic INSERT INTO T VALUES (2) | | +| (p0 []) p1 [] p0a [2] p0b [] | | StateWriteReorganize SELECT * FROM t => [] (only sees p0,p1) | + + +But if we add a state between StateWriteReorganize and StatePublic and double write to the old partitions during that state it works: + + +| Data (TiKV/Unistore) | TiDB client 1 | TiDB client 2 | +| ------------------------------------------------- | ---------------------------------------------- | -------------------------------------------------------------------- | +| p0 [] p1 [] (p0a [] p0b []) StateWriteReorganize | | | +| (p0 []) p1 [] p0a [] p0b [] StateDeleteReorganize | | | +| (p0 [2]) p1 [] p0a [2] p0b [] | StateDeleteReorganize INSERT INTO T VALUES (2) | | +| (p0 [2]) p1 [] p0a [2] p0b [] | | StateWriteReorganize SELECT * FROM t => [2] (only sees p0,p1) | +| (p0 [2]) p1 [] p0a [2] p0b [] StatePublic | | | +| (p0 [2]) p1 [] p0a [2] p0b [4] | StatePublic INSERT INTO T VALUES (4) | | +| (p0 [2]) p1 [] p0a [2] p0b [4] | | StateDeleteReorganize SELECT * FROM t => [2,4] (sees p0a,p0b,p1) | + + ### Error handling If any non-retryable error occurs, we will call onDropTablePartition and adjust the logic in that function to also handle the roll back of reorganize partition, in a similar way as it does with model.ActionAddTablePartition. From e8bf79a7ffb7a1a2c331ac2e6087fc6b02a0091e Mon Sep 17 00:00:00 2001 From: Mattias Jonsson Date: Wed, 28 Dec 2022 09:24:33 +0000 Subject: [PATCH 12/12] Update 2022-09-29-reorganize-partition.md --- docs/design/2022-09-29-reorganize-partition.md | 1 - 1 file changed, 1 deletion(-) diff --git a/docs/design/2022-09-29-reorganize-partition.md b/docs/design/2022-09-29-reorganize-partition.md index 9109dda340922..56e380826efa7 100644 --- a/docs/design/2022-09-29-reorganize-partition.md +++ b/docs/design/2022-09-29-reorganize-partition.md @@ -159,7 +159,6 @@ Notes: Re-use tests from other DDLs like Modify column, but adjust them for Reorganize partition. A separate test plan will be created and a test report will be written and signed off when the tests are completed. - ### Benchmark Tests Correctness and functionality is higher priority than performance.