Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Release retention barriers on children tablets that are not of interest to CDC stream #22773

Closed
yugabyte-ci opened this issue Jun 7, 2024 · 0 comments

Comments

@yugabyte-ci
Copy link
Contributor

yugabyte-ci commented Jun 7, 2024

Jira Link: DB-11676

When a tablet of any table in the database is split into new child tablets, these children tablets are marked as possible candidates for stream consumption. This applies to all tables in the database, not just the ones being polled by the CDC stream. This was primarily done to ensure that any table can be included in the CDC stream if the connector's configurations are updated to include them in the 'table.include.list'.

Now if the connector doesn't poll on these child tablets, their resources will be held till the interval defined by the flag cdc_intent_retention_ms. If this flag interval is set to a high value, it can lead to unnecessary resource usage for a longer time.

@yugabyte-ci yugabyte-ci added area/cdcsdk CDC SDK jira-originated kind/bug This issue is a bug priority/highest Highest priority issue labels Jun 7, 2024
@yugabyte-ci yugabyte-ci changed the title [CDCSDK] Release retention barriers on children tablets that are not of interest for CDC stream [CDCSDK] Release retention barriers on children tablets that are not of interest to CDC stream Jun 7, 2024
siddharth2411 added a commit to siddharth2411/yugabyte-db that referenced this issue Jun 14, 2024
…time to non-consistent snapshot streams

Test Plan: Jenkins

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D35832
siddharth2411 added a commit that referenced this issue Jun 26, 2024
…e from CDCSDK stream

Summary:
This diff introduces three new yb-admin commands required to remove a **user table** from a CDCSDK stream.
**`NOTE: All three commands are only meant to be used on CDC streams that are not associated with a replication slot.`**

**Command-1**: yb-admin command to disable dynamic table addition in a CDC stream. Only works when the new auto flag `enable_cdcsdk_dynamic_tables_disable_option` is set to true. **Note, post execution of this command, no dynamic tables (user/non-user) will get added to CDC stream. Additionally, there is no option to re-enable dynamic table addition for the stream.**
```
yb-admin \
    -master_addresses <master-addresses> \
    disable_dynamic_table_addition_in_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Command-2**: yb-admin command to remove only a particular **user** table from the CDC stream metadata as well as update the checkpoint for corresponding state table entries to OpId max. Since, the checkpoint is set to max, these entries will be later deleted from the cdc state table by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    remove_user_table_from_change_data_stream <stream_id> <table_id>
```
The command works with a single stream_id & table_id.

**Command-3**: yb-admin command to validate cdc state table entries for a particular stream. As part of validation, if the table of any cdc state table entry is not present in the CDC stream metadata, then checkpoint of such entries will be updated to OpID max, and they'll be later deleted by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    validate_and_sync_cdc_state_table_entries_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Advisory for command-usage:**
General guidelines that need to strictly followed while executing these commands:
  - Ensure no DDLs are performed before/after 15 mins of executing these commands.

These yb-admin commands are meant to be used when a user is only interested on polling from subset of tables in the namespace. Therefore, the user can remove the extra tables from CDC stream that are not supposed to be polled. To achieve this, user needs to first execute Command-1, followed by command-2 & command-3.

Example:
Starting state: 5 user tables (t1 to t5) in the CDC stream including 4 extra tables that are not polled (t1,t2,t3,t4) + 2 indexes (i1,i2)
Target state: Only t5 + 2 indexes (i1,i2) should be present in CDC stream.

To reach the target state, we need to remove 4 user tables (t1-t4) from stream metadata & their state entries

**Perform the following steps to remove user tables from the CDC stream:**

  # Firstly, disable dynamic table addition using command-1.
  # Confirm that dynamic table addition is disabled by running `list_change_data_streams` yb-admin command. The output for that stream would contain the string `cdcsdk_disable_dynamic_table_addition: true`
  # Remove the table from stream metadata & update its state table entries using command-2.
  # Confirm that the table is removed from stream metadata by re-running `list_change_data_streams` command.
  # Based on when the user reads the cdc state table (via cqlsh), the state table entries corresponding to this table would have been either updated to checkpoint max or may be removed. Note, State table entries deletion might take some time as it will be done in a separate thread.
  # Repeat step 3-5 for all user tables that needs to be removed.
  # At the end, once all extra user tables are removed from a stream, execute command-3 as a sanity check to get rid of any cdc state entries that might still be hanging around in state table but the corresponding table has been removed from stream metadata. One scenario where cdc state table entries might be present even after table is removed, is when a tablet splits while table was being removed from stream metadata. In this case, the children tablet entries will get added to cdc state table and so they'll get removed when command-3 is executed.

**Working**:
Command-1 internally calls //DisableDynamicTableAdditionOnCDCSDKStream// RPC that will set the optional field `cdcsdk_disable_dynamic_table_addition` in stream metadata to true. This will prevent any tables, that are not yet part of the CDC stream, to get added to the CDC stream.

Command-2 internally calls //RemoveUserTableFromCDCSDKStream// RPC that performs the following:
  # Update the checkpoint of tablet entries for the given table in the CDC state table to `OpId::Max()`. This is done to release the retention barriers on these tables and allow the deletion of the state table entry  by UpdatePeersAndMetrics.
  # Remove the table from CDC stream metadata, //cdcsdk_tables_to_stream_map_// and persist the updated metadata in sys catalog.

Command-3 internally calls //ValidateAndSyncCDCStateEntriesForCDCSDKStream// RPC that updates checkpoint to max for cdc state table entries whose table is not found in the CDC stream metadata.

**Upgrade/Rollback safety:**
//cdcsdk_disable_dynamic_table_addition// - added a new optional field in existing protos SysCDCStreamEntryPB, CDCStreamInfoPB. This field is protected and will only be read when the new auto flag `cdcsdk_enable_dynamic_tables_disable_option` is set.

Introduced request, response proto for new RPCs:

  - DisableDynamicTableAdditionOnCDCSDKStream - DisableDynamicTableAdditionOnCDCSDKStreamRequestPB,  DisableDynamicTableAdditionOnCDCSDKStreamResponsePB
  - RemoveUserTableFromCDCSDKStream - RemoveUserTableFromCDCSDKStreamRequestPB,  RemoveUserTableFromCDCSDKStreamResponsePB
  - ValidateAndSyncCDCStateEntriesForCDCSDKStream - ValidateAndSyncCDCStateEntriesForCDCSDKStreamRequestPB, ValidateAndSyncCDCStateEntriesForCDCSDKStreamResponsePB

Jira: DB-11778, DB-11676

Test Plan:
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnConsistentSnapshotStream

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: asrinivasan, stiwary

Subscribers: ycdcxcluster, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35870
siddharth2411 added a commit that referenced this issue Jun 26, 2024
… remove user table from CDCSDK stream

Summary:
**Backport Descrption:**
Faced minor conflict in yb-admin_cli.cc since some xcluster commands are not backported.

**Original Description:**
Original commit: 7c99ff9 / D35870
This diff introduces three new yb-admin commands required to remove a **user table** from a CDCSDK stream.
**`NOTE: All three commands are only meant to be used on CDC streams that are not associated with a replication slot.`**

**Command-1**: yb-admin command to disable dynamic table addition in a CDC stream. Only works when the new auto flag `enable_cdcsdk_dynamic_tables_disable_option` is set to true. **Note, post execution of this command, no dynamic tables (user/non-user) will get added to CDC stream. Additionally, there is no option to re-enable dynamic table addition for the stream.**
```
yb-admin \
    -master_addresses <master-addresses> \
    disable_dynamic_table_addition_in_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Command-2**: yb-admin command to remove only a particular **user** table from the CDC stream metadata as well as update the checkpoint for corresponding state table entries to OpId max. Since, the checkpoint is set to max, these entries will be later deleted from the cdc state table by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    remove_user_table_from_change_data_stream <stream_id> <table_id>
```
The command works with a single stream_id & table_id.

**Command-3**: yb-admin command to validate cdc state table entries for a particular stream. As part of validation, if the table of any cdc state table entry is not present in the CDC stream metadata, then checkpoint of such entries will be updated to OpID max, and they'll be later deleted by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    validate_and_sync_cdc_state_table_entries_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Advisory for command-usage:**
General guidelines that need to strictly followed while executing these commands:
  - Ensure no DDLs are performed before/after 15 mins of executing these commands.

These yb-admin commands are meant to be used when a user is only interested on polling from subset of tables in the namespace. Therefore, the user can remove the extra tables from CDC stream that are not supposed to be polled. To achieve this, user needs to first execute Command-1, followed by command-2 & command-3.

Example:
Starting state: 5 user tables (t1 to t5) in the CDC stream including 4 extra tables that are not polled (t1,t2,t3,t4) + 2 indexes (i1,i2)
Target state: Only t5 + 2 indexes (i1,i2) should be present in CDC stream.

To reach the target state, we need to remove 4 user tables (t1-t4) from stream metadata & their state entries

**Perform the following steps to remove user tables from the CDC stream:**

  # Firstly, disable dynamic table addition using command-1.
  # Confirm that dynamic table addition is disabled by running `list_change_data_streams` yb-admin command. The output for that stream would contain the string `cdcsdk_disable_dynamic_table_addition: true`
  # Remove the table from stream metadata & update its state table entries using command-2.
  # Confirm that the table is removed from stream metadata by re-running `list_change_data_streams` command.
  # Based on when the user reads the cdc state table (via cqlsh), the state table entries corresponding to this table would have been either updated to checkpoint max or may be removed. Note, State table entries deletion might take some time as it will be done in a separate thread.
  # Repeat step 3-5 for all user tables that needs to be removed.
  # At the end, once all extra user tables are removed from a stream, execute command-3 as a sanity check to get rid of any cdc state entries that might still be hanging around in state table but the corresponding table has been removed from stream metadata. One scenario where cdc state table entries might be present even after table is removed, is when a tablet splits while table was being removed from stream metadata. In this case, the children tablet entries will get added to cdc state table and so they'll get removed when command-3 is executed.

**Working**:
Command-1 internally calls //DisableDynamicTableAdditionOnCDCSDKStream// RPC that will set the optional field `cdcsdk_disable_dynamic_table_addition` in stream metadata to true. This will prevent any tables, that are not yet part of the CDC stream, to get added to the CDC stream.

Command-2 internally calls //RemoveUserTableFromCDCSDKStream// RPC that performs the following:
  # Update the checkpoint of tablet entries for the given table in the CDC state table to `OpId::Max()`. This is done to release the retention barriers on these tables and allow the deletion of the state table entry  by UpdatePeersAndMetrics.
  # Remove the table from CDC stream metadata, //cdcsdk_tables_to_stream_map_// and persist the updated metadata in sys catalog.

Command-3 internally calls //ValidateAndSyncCDCStateEntriesForCDCSDKStream// RPC that updates checkpoint to max for cdc state table entries whose table is not found in the CDC stream metadata.

**Upgrade/Rollback safety:**
//cdcsdk_disable_dynamic_table_addition// - added a new optional field in existing protos SysCDCStreamEntryPB, CDCStreamInfoPB. This field is protected and will only be read when the new auto flag `cdcsdk_enable_dynamic_tables_disable_option` is set.

Introduced request, response proto for new RPCs:

  - DisableDynamicTableAdditionOnCDCSDKStream - DisableDynamicTableAdditionOnCDCSDKStreamRequestPB,  DisableDynamicTableAdditionOnCDCSDKStreamResponsePB
  - RemoveUserTableFromCDCSDKStream - RemoveUserTableFromCDCSDKStreamRequestPB,  RemoveUserTableFromCDCSDKStreamResponsePB
  - ValidateAndSyncCDCStateEntriesForCDCSDKStream - ValidateAndSyncCDCStateEntriesForCDCSDKStreamRequestPB, ValidateAndSyncCDCStateEntriesForCDCSDKStreamResponsePB
Jira: DB-11778, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnConsistentSnapshotStream

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: asrinivasan, stiwary

Subscribers: ybase, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36129
siddharth2411 added a commit that referenced this issue Jun 26, 2024
…o remove user table from CDCSDK stream

Summary:
**Backport description:**
Faced merge conflicts because of missing macro, utility methods. Fixed some and added some missing methods.

**Original description:**
Original commit : 7c99ff9 / D35870

This diff introduces three new yb-admin commands required to remove a **user table** from a CDCSDK stream.
**`NOTE: All three commands are only meant to be used on CDC streams that are not associated with a replication slot.`**

**Command-1**: yb-admin command to disable dynamic table addition in a CDC stream. Only works when the new auto flag `enable_cdcsdk_dynamic_tables_disable_option` is set to true. **Note, post execution of this command, no dynamic tables (user/non-user) will get added to CDC stream. Additionally, there is no option to re-enable dynamic table addition for the stream.**
```
yb-admin \
    -master_addresses <master-addresses> \
    disable_dynamic_table_addition_in_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Command-2**: yb-admin command to remove only a particular **user** table from the CDC stream metadata as well as update the checkpoint for corresponding state table entries to OpId max. Since, the checkpoint is set to max, these entries will be later deleted from the cdc state table by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    remove_user_table_from_change_data_stream <stream_id> <table_id>
```
The command works with a single stream_id & table_id.

**Command-3**: yb-admin command to validate cdc state table entries for a particular stream. As part of validation, if the table of any cdc state table entry is not present in the CDC stream metadata, then checkpoint of such entries will be updated to OpID max, and they'll be later deleted by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    validate_and_sync_cdc_state_table_entries_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Advisory for command-usage:**
General guidelines that need to strictly followed while executing these commands:
  - Ensure no DDLs are performed before/after 15 mins of executing these commands.

These yb-admin commands are meant to be used when a user is only interested on polling from subset of tables in the namespace. Therefore, the user can remove the extra tables from CDC stream that are not supposed to be polled. To achieve this, user needs to first execute Command-1, followed by command-2 & command-3.

Example:
Starting state: 5 user tables (t1 to t5) in the CDC stream including 4 extra tables that are not polled (t1,t2,t3,t4) + 2 indexes (i1,i2)
Target state: Only t5 + 2 indexes (i1,i2) should be present in CDC stream.

To reach the target state, we need to remove 4 user tables (t1-t4) from stream metadata & their state entries

**Perform the following steps to remove user tables from the CDC stream:**

  # Firstly, disable dynamic table addition using command-1.
  # Confirm that dynamic table addition is disabled by running `list_change_data_streams` yb-admin command. The output for that stream would contain the string `cdcsdk_disable_dynamic_table_addition: true`
  # Remove the table from stream metadata & update its state table entries using command-2.
  # Confirm that the table is removed from stream metadata by re-running `list_change_data_streams` command.
  # Based on when the user reads the cdc state table (via cqlsh), the state table entries corresponding to this table would have been either updated to checkpoint max or may be removed. Note, State table entries deletion might take some time as it will be done in a separate thread.
  # Repeat step 3-5 for all user tables that needs to be removed.
  # At the end, once all extra user tables are removed from a stream, execute command-3 as a sanity check to get rid of any cdc state entries that might still be hanging around in state table but the corresponding table has been removed from stream metadata. One scenario where cdc state table entries might be present even after table is removed, is when a tablet splits while table was being removed from stream metadata. In this case, the children tablet entries will get added to cdc state table and so they'll get removed when command-3 is executed.

**Working**:
Command-1 internally calls //DisableDynamicTableAdditionOnCDCSDKStream// RPC that will set the optional field `cdcsdk_disable_dynamic_table_addition` in stream metadata to true. This will prevent any tables, that are not yet part of the CDC stream, to get added to the CDC stream.

Command-2 internally calls //RemoveUserTableFromCDCSDKStream// RPC that performs the following:
  # Update the checkpoint of tablet entries for the given table in the CDC state table to `OpId::Max()`. This is done to release the retention barriers on these tables and allow the deletion of the state table entry  by UpdatePeersAndMetrics.
  # Remove the table from CDC stream metadata, //cdcsdk_tables_to_stream_map_// and persist the updated metadata in sys catalog.

Command-3 internally calls //ValidateAndSyncCDCStateEntriesForCDCSDKStream// RPC that updates checkpoint to max for cdc state table entries whose table is not found in the CDC stream metadata.

**Upgrade/Rollback safety:**
//cdcsdk_disable_dynamic_table_addition// - added a new optional field in existing protos SysCDCStreamEntryPB, CDCStreamInfoPB. This field is protected and will only be read when the new auto flag `cdcsdk_enable_dynamic_tables_disable_option` is set.

Introduced request, response proto for new RPCs:

  - DisableDynamicTableAdditionOnCDCSDKStream - DisableDynamicTableAdditionOnCDCSDKStreamRequestPB,  DisableDynamicTableAdditionOnCDCSDKStreamResponsePB
  - RemoveUserTableFromCDCSDKStream - RemoveUserTableFromCDCSDKStreamRequestPB,  RemoveUserTableFromCDCSDKStreamResponsePB
  - ValidateAndSyncCDCStateEntriesForCDCSDKStream - ValidateAndSyncCDCStateEntriesForCDCSDKStreamRequestPB, ValidateAndSyncCDCStateEntriesForCDCSDKStreamResponsePB

Jira: DB-11778, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: asrinivasan, stiwary

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36128
siddharth2411 added a commit that referenced this issue Jun 26, 2024
…rom existing CDCSDK stream

Summary:
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36031
siddharth2411 added a commit that referenced this issue Jun 26, 2024
… tables for CDC from existing CDCSDK stream

Summary:
**Backport Description:**
No merge conflicts.

**Original Description:**
Original commit: 4e9a81c / D36031
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar, xCluster, hsunder

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36172
jasonyb pushed a commit that referenced this issue Jun 27, 2024
Summary:
 411a32e [DB-11813] Rename ysql_conn_mgr_idle_or_pending_clients metric name
 c2e13ef [#15682] YSQL: Fix stack_is_too_deep function in ASAN
 ef31455 [PLAT-14188] Fixing upgrade disk availability check
 db6b1b7 [#23004] YCQL: Fix tserver crash due to NULL pointer dereference
 0ada80a [PLAT-14433] Use correct kubeconfig for edit provider validation
 eccbc10 [PLAT-14414] Enable Kubernetes provider validation by default
 199f679 [PLAT-14324]: Move all node agent based flags from BETA to INTERNAL in Provider Conf keys file
 86a865d [PLAT-14443] YBA Installer wait for ready time configurable.
 ac184a8 [#22882] YSQL: Fix deadlock in DDL atomicity
 a4218fb [Docs] Sort feature to tables (Where fulfills the criteria) (#22836)
 2f267ca [#22996] xCluster: Add SOURCE_UNREACHABLE and SYSTEM_ERROR enums

Skipped due to conflict:
dee7691 [#21534] docdb: Set owner correctly for cloned databases
34632ba [PLAT-14495] Set up the node_exporter for ynp
7c99ff9 [#22876][#22773] CDCSDK: Add new yb-admin command to remove user table from CDCSDK stream
4e9a81c [#22876][#22835][#22773] CDCSDK: Remove non-eligible tables for CDC from existing CDCSDK stream
f2e574e [#23013] xClusterDDLRepl: Allow table_ids for GetXClusterStreams

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: tfoucher, sanketh, jenkins-bot

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D36184
siddharth2411 added a commit that referenced this issue Jun 27, 2024
… tables for CDC from existing CDCSDK stream

Summary:
**Backport Description:**
Faced merge conflicts because of changed method names & missing/changing of flags related to replication commands and
replica identity.

**Original Description:**
Original commit: 4e9a81c / D36031
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: asrinivasan

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36175
jasonyb pushed a commit that referenced this issue Jun 28, 2024
…e from CDCSDK stream

Summary:
This diff introduces three new yb-admin commands required to remove a **user table** from a CDCSDK stream.
**`NOTE: All three commands are only meant to be used on CDC streams that are not associated with a replication slot.`**

**Command-1**: yb-admin command to disable dynamic table addition in a CDC stream. Only works when the new auto flag `enable_cdcsdk_dynamic_tables_disable_option` is set to true. **Note, post execution of this command, no dynamic tables (user/non-user) will get added to CDC stream. Additionally, there is no option to re-enable dynamic table addition for the stream.**
```
yb-admin \
    -master_addresses <master-addresses> \
    disable_dynamic_table_addition_in_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Command-2**: yb-admin command to remove only a particular **user** table from the CDC stream metadata as well as update the checkpoint for corresponding state table entries to OpId max. Since, the checkpoint is set to max, these entries will be later deleted from the cdc state table by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    remove_user_table_from_change_data_stream <stream_id> <table_id>
```
The command works with a single stream_id & table_id.

**Command-3**: yb-admin command to validate cdc state table entries for a particular stream. As part of validation, if the table of any cdc state table entry is not present in the CDC stream metadata, then checkpoint of such entries will be updated to OpID max, and they'll be later deleted by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    validate_and_sync_cdc_state_table_entries_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Advisory for command-usage:**
General guidelines that need to strictly followed while executing these commands:
  - Ensure no DDLs are performed before/after 15 mins of executing these commands.

These yb-admin commands are meant to be used when a user is only interested on polling from subset of tables in the namespace. Therefore, the user can remove the extra tables from CDC stream that are not supposed to be polled. To achieve this, user needs to first execute Command-1, followed by command-2 & command-3.

Example:
Starting state: 5 user tables (t1 to t5) in the CDC stream including 4 extra tables that are not polled (t1,t2,t3,t4) + 2 indexes (i1,i2)
Target state: Only t5 + 2 indexes (i1,i2) should be present in CDC stream.

To reach the target state, we need to remove 4 user tables (t1-t4) from stream metadata & their state entries

**Perform the following steps to remove user tables from the CDC stream:**

  # Firstly, disable dynamic table addition using command-1.
  # Confirm that dynamic table addition is disabled by running `list_change_data_streams` yb-admin command. The output for that stream would contain the string `cdcsdk_disable_dynamic_table_addition: true`
  # Remove the table from stream metadata & update its state table entries using command-2.
  # Confirm that the table is removed from stream metadata by re-running `list_change_data_streams` command.
  # Based on when the user reads the cdc state table (via cqlsh), the state table entries corresponding to this table would have been either updated to checkpoint max or may be removed. Note, State table entries deletion might take some time as it will be done in a separate thread.
  # Repeat step 3-5 for all user tables that needs to be removed.
  # At the end, once all extra user tables are removed from a stream, execute command-3 as a sanity check to get rid of any cdc state entries that might still be hanging around in state table but the corresponding table has been removed from stream metadata. One scenario where cdc state table entries might be present even after table is removed, is when a tablet splits while table was being removed from stream metadata. In this case, the children tablet entries will get added to cdc state table and so they'll get removed when command-3 is executed.

**Working**:
Command-1 internally calls //DisableDynamicTableAdditionOnCDCSDKStream// RPC that will set the optional field `cdcsdk_disable_dynamic_table_addition` in stream metadata to true. This will prevent any tables, that are not yet part of the CDC stream, to get added to the CDC stream.

Command-2 internally calls //RemoveUserTableFromCDCSDKStream// RPC that performs the following:
  # Update the checkpoint of tablet entries for the given table in the CDC state table to `OpId::Max()`. This is done to release the retention barriers on these tables and allow the deletion of the state table entry  by UpdatePeersAndMetrics.
  # Remove the table from CDC stream metadata, //cdcsdk_tables_to_stream_map_// and persist the updated metadata in sys catalog.

Command-3 internally calls //ValidateAndSyncCDCStateEntriesForCDCSDKStream// RPC that updates checkpoint to max for cdc state table entries whose table is not found in the CDC stream metadata.

**Upgrade/Rollback safety:**
//cdcsdk_disable_dynamic_table_addition// - added a new optional field in existing protos SysCDCStreamEntryPB, CDCStreamInfoPB. This field is protected and will only be read when the new auto flag `cdcsdk_enable_dynamic_tables_disable_option` is set.

Introduced request, response proto for new RPCs:

  - DisableDynamicTableAdditionOnCDCSDKStream - DisableDynamicTableAdditionOnCDCSDKStreamRequestPB,  DisableDynamicTableAdditionOnCDCSDKStreamResponsePB
  - RemoveUserTableFromCDCSDKStream - RemoveUserTableFromCDCSDKStreamRequestPB,  RemoveUserTableFromCDCSDKStreamResponsePB
  - ValidateAndSyncCDCStateEntriesForCDCSDKStream - ValidateAndSyncCDCStateEntriesForCDCSDKStreamRequestPB, ValidateAndSyncCDCStateEntriesForCDCSDKStreamResponsePB

Jira: DB-11778, DB-11676

Test Plan:
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnConsistentSnapshotStream

Reviewers: skumar, asrinivasan, stiwary

Reviewed By: asrinivasan, stiwary

Subscribers: ycdcxcluster, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35870
jasonyb pushed a commit that referenced this issue Jun 28, 2024
…rom existing CDCSDK stream

Summary:
Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to [[ https://phorge.dev.yugabyte.com/D35856 | this missing logic in addition of dynamic tables codepath ]]. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed `cdc_intent_retention_ms`.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:
1. Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
2. Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

We have followed the same pattern of achieving the above tasks via a background thread, exactly similar to addition of dynamic table in CDC streams.

**Working:**

  - //FindAllNonUserTablesInCDCSDKStream// - On a master restart/leadership change, while loading CDCSDK streams into memory, we will compute the set difference between tables present in stream metadata and tables in the namespace that are eligible for a CDC stream. This set difference will give us the set of non-eligible tables that were not supposed to get added to the CDC stream, but got added because of the above mentioned bug. These non-eligible tables will be added to `namespace_to_cdcsdk_non_user_table_map_` which is further processed in catalog manager background thread by //FindCDCSDKStreamsForNonUserTables//.

 The bg thread of catalog manager (CatalogManagerBgTasks), with the following methods handles the actual table removal:

  - //FindCDCSDKStreamsForNonUserTables//: This method is run in every subsequent iteration of the bg thread of catalog manager. It scans the cdc_stream_map_ and finds all streams in ACTIVE/DELETING METADATA state which have the non-eligible table entry in stream metadata, and collects the details to be further processed by //RemoveNonUserTablesForCDCSDKStreams//.
  - //RemoveNonUserTablesForCDCSDKStreams//: This method is run after FindCDCSDKStreamsForNonUserTables and does the following for each stream that contains the non-eligible table entry:
     1. Update the checkpoint of cdc state entries related to non-eligible table to OpId max. Incase of colocated tables, entries with a colocated_table_id will be deleted.
     2. Removes the table from stream metadata and cdcsdk_tables_to_stream_map_.
    Once the table is removed from all relevant CDC streams, then we remove the table entry from `namespace_to_cdcsdk_non_user_table_map_`.

Note:
1. To enable this cleanup of non-eligible tables, user has to set the master flag `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`.
2. In single iteration of the bg thread, we only process two non-eligible tables across all namespaces. This processing limit is configurable and we are reusing the existing flag `cdcsdk_table_processing_limit_per_run`.

Additionally, in the tablet split codepath, before adding cdc state entries for children tables, we will now check if the table is a non-eligible table for CDC stream or not. This also helps in preventing a race condition when a tablet of a non-eligible is split and concurrently, there was a master restart/leadership changes and we are trying to remove the table from stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: asrinivasan, stiwary, skumar

Reviewed By: stiwary

Subscribers: ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36031
siddharth2411 added a commit that referenced this issue Jul 2, 2024
…igible tables in CDC stream

Summary:
This diff is an extension of [[ https://phorge.dev.yugabyte.com/D36031 | D36031 ]] which introduced cleanup mechanism for non-eligible tables. The mechanism involves two steps:

  # Identification of indexes -> happens during loading of CDC streams into memory on a master restart/leadership change.
  # Removal of these identified indexes by the bg thread.

Without this diff, both these steps were guarded under a non-auto flag - `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`
Therefore, post upgrade, step-1 requires the user to set the above flag and explicitly do a master restart/leader change.

To avoid this explicit master restart/leader change and still give control to users over this cleanup, we are introducing a new auto flag `cdcsdk_enable_identification_of_non_eligible_tables` that will guard the identification step.

These identified tables will be added to `namespace_to_cdcsdk_non_eligible_table_map_`.  If `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream` is set to true, the catalog manager background thread will pick up these tables for actual cleanup.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
Existing cdc tests for removal of non-eligible tables
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: xCluster, hsunder, asrinivasan

Reviewed By: asrinivasan

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36240
siddharth2411 added a commit that referenced this issue Jul 2, 2024
…o identify non-eligible tables in CDC stream

Summary:
**Backport Description:**
No merge conflicts

**Original Description:**
Original commit: None / D36240
This diff is an extension of [[ https://phorge.dev.yugabyte.com/D36031 | D36031 ]] which introduced cleanup mechanism for non-eligible tables. The mechanism involves two steps:

  # Identification of indexes -> happens during loading of CDC streams into memory on a master restart/leadership change.
  # Removal of these identified indexes by the bg thread.

Without this diff, both these steps were guarded under a non-auto flag - `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream`
Therefore, post upgrade, step-1 requires the user to set the above flag and explicitly do a master restart/leader change.

To avoid this explicit master restart/leader change and still give control to users over this cleanup, we are introducing a new auto flag `cdcsdk_enable_identification_of_non_eligible_tables` that will guard the identification step.

These identified tables will be added to `namespace_to_cdcsdk_non_eligible_table_map_`.  If `enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream` is set to true, the catalog manager background thread will pick up these tables for actual cleanup.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent
Existing cdc tests for removal of non-eligible tables
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

Reviewers: xCluster, hsunder, asrinivasan

Reviewed By: asrinivasan

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36244
jasonyb pushed a commit that referenced this issue Jul 2, 2024
Summary:
 f8e73e9 [#18192] YSQL: Introduce interface to mock tserver response in MiniCluster tests
 4ae68f4 Build break fix for centos7
 Excluded: 2ec9224 [#23033] Allow running YSQL upgrade unit tests with snapshot other than 2.0.9.0
 37912f1 [#22058] docdb: Disable connections on cloned db until cloning is complete
 059b855 [#22908] xCluster: Use XClusterRemoteClient across XCluster
 5dc5ee7 [#22849] YSQL: Correctly handle reset phase timeout errors in YSQL Connection Manager
 af49a1e [#22876][#22835][#22773] CDCSDK: Add new auto flag to identify non-eligible tables in CDC stream
 f3c4e14 [PLAT-14524] Up-version pekko to fix TLSActor infinite loop
 9388aea [#23052] yugabyted:  Restarting a node fails when data_dir is missing in user specified configuration.
 5cf9736 [PLAT-12685]: Generate a YBA metric for xcluster config table status.
 73fc90a [PLAT-14497]: Fix incremental backup time when none full backup exists
 e9b5ba5 [PLAT-14533]: Modify the gflags metadata support db version check
 8dca952 [PLAT-14432][Platform] Show certificate Database Node Certificate/key and Client Certificate/key for CA certs in certificate details modal
 6551e45 Add utkarsh.munjal to contributors.md
 bafa1cb [#21751] YSQL, ASH: Sampling of wait events

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36325
siddharth2411 added a commit that referenced this issue Jul 4, 2024
…ligible tables from CDCSDK stream

Summary:
**Backport description:**
Faced merge conflicts because of missing macro, utility methods. Fixed some and added some missing methods.

**Original description:**
Original commits:
92827b3 / D35870
4e9a81c / D36031
af49a1e / D36240

Please refer the original commit for full summary of each diff.

**Diff-1 (D35870): Add new yb-admin command to remove user table from CDCSDK stream**

This diff introduces three new yb-admin commands required to remove a **user table** from a CDCSDK stream.
**`NOTE: All three commands are only meant to be used on CDC streams that are not associated with a replication slot.`**

**Command-1**: yb-admin command to disable dynamic table addition in a CDC stream. Only works when the new auto flag `enable_cdcsdk_dynamic_tables_disable_option` is set to true. **Note, post execution of this command, no dynamic tables (user/non-user) will get added to CDC stream. Additionally, there is no option to re-enable dynamic table addition for the stream.**
```
yb-admin \
    -master_addresses <master-addresses> \
    disable_dynamic_table_addition_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Command-2**: yb-admin command to remove only a particular **user** table from the CDC stream metadata as well as update the checkpoint for corresponding state table entries to OpId max. Since, the checkpoint is set to max, these entries will be later deleted from the cdc state table by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    remove_user_table_from_change_data_stream <stream_id> <table_id>
```
The command works with a single stream_id & table_id.

**Command-3**: yb-admin command to validate cdc state table entries for a particular stream. As part of validation, if the table of any cdc state table entry is not present in the CDC stream metadata, then checkpoint of such entries will be updated to OpID max, and they'll be later deleted by a separate thread (UpdatePeersAndMetrics).

```
yb-admin \
    -master_addresses <master-addresses> \
    validate_and_sync_cdc_state_table_entries_on_change_data_stream <stream_id>
```
The command works with a single stream_id.

**Diff-2 (D36031): Remove non-eligible tables for CDC from existing CDCSDK stream**

Some non-eligible tables like indexes etc. created after creation of a CDC stream were getting added to the CDC stream due to this missing logic in addition of dynamic tables codepath. We do not hold retention barriers on tablets of such tables until and unless they split, in which case, we start holding retention barriers on the children tablets. This leads to heavy resource usage over time and since the CDC stream never polls on tables of such tables, retention barriers are not lifted until the active time of these tablets exceed cdc_intent_retention_ms.

Therefore, to prevent resource consumption from such tables, we want to achieve the following:

Remove these non-eligible tables from the stream metadata so that any further tablet splitting on these tables do not lead to addition of children tablets in cdc state table.
Release retention barriers on the existing tablets that are part of the cdc state table and finally remove these state table entries.

**Diff-3 (D36240): Add new auto flag to identify non-eligible tables in CDC stream**

This diff is an extension of D36031 which introduced cleanup mechanism for non-eligible tables. The mechanism involves two steps:

Identification of indexes -> happens during loading of CDC streams into memory on a master restart/leadership change.
Removal of these identified indexes by the bg thread.
Without this diff, both these steps were guarded under a non-auto flag - enable_cleanup_of_non_eligible_tables_from_cdcsdk_stream
Therefore, post upgrade, step-1 requires the user to set the above flag and explicitly do a master restart/leader change.

To avoid this explicit master restart/leader change and still give control to users over this cleanup, we are introducing a new auto flag `cdcsdk_enable_identification_of_non_eligible_tables` that will guard the identification step.

**Upgrade/Rollback safety:**
//cdcsdk_disable_dynamic_table_addition// - added a new optional field in existing protos SysCDCStreamEntryPB, CDCStreamInfoPB. This field is protected and will only be read when the new auto flag `cdcsdk_enable_dynamic_tables_disable_option` is set.

Introduced request, response proto for new RPCs:

  - DisableDynamicTableAdditionOnCDCSDKStream - DisableDynamicTableAdditionOnCDCSDKStreamRequestPB,  DisableDynamicTableAdditionOnCDCSDKStreamResponsePB
  - RemoveUserTableFromCDCSDKStream - RemoveUserTableFromCDCSDKStreamRequestPB,  RemoveUserTableFromCDCSDKStreamResponsePB
  - ValidateAndSyncCDCStateEntriesForCDCSDKStream - ValidateAndSyncCDCStateEntriesForCDCSDKStreamRequestPB, ValidateAndSyncCDCStateEntriesForCDCSDKStreamResponsePB

Jira: DB-11778, DB-11733, DB-11676

Test Plan:
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestDisableOfDynamicTableAdditionOnConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromNonConsistentSnapshotCDCStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableRemovalFromConsistentSnapshotCDCStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestChildTabletsOfNonEligibleTableDoNotGetAddedToConsistentSnapshotStream

./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnNonConsistentSnapshotStream
./yb_build.sh --cxx-test integration-tests_cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestValidationAndSyncOfCDCStateEntriesAfterUserTableRemovalOnConsistentSnapshotStream

Reviewers: skumar, asrinivasan, stiwary, xCluster, hsunder

Reviewed By: asrinivasan

Subscribers: ybase, ycdcxcluster

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36321
@yugabyte-ci yugabyte-ci closed this as not planned Won't fix, can't repro, duplicate, stale Jul 4, 2024
siddharth2411 added a commit that referenced this issue Aug 12, 2024
…n in table removal from CDC stream

Summary:
This diff make the following changes:

 # The table (non-eligible/user-created) removal codepath as well as addition of children tablets on split fetches tableInfoPtr/CDCStreamInfoPtr from master. At some of the places, we were either missing null checks which have been added
 # Removes the optimization in `UpdateCheckpointForTabletEntriesInCDCState()` that fetches TabletInfo only for a filtered set of tablet_ids when a particular table is being removed. This set used to only contain CDC state entries for the table being removed. Due to this optimisation, there was a race condition with drop table where this table removal codepath removes the table from stream metadata and the drop table cleanup codepath deletes the relevant state table entries. The race is described below:

Current algorithm for cleanup of user-created table (yb-admin command)
1. Check if stream is non-null & not in DELETING state.
2. Check if table is non-null & not in DELETING state
3. Update cdc state table entries

   i) Get tablets belonging to the table to be removed from master

    ii) Filter cdc state table entries whose tablets are part of the above collected tablet list

    iii) Get TabletInfo of filtered state table entries and check the table_id of the tablet

    iv) Iterate over the TabletInfo(s) and update checkpoint to max if table_id in TabletInfo doesnt belong to stream metadata.

    v) Delete colocated table state table entry

4. Remove user table from stream metadata

In Step 3, i) was the optimisation. The race here is suppose between step-2 & step-3, the table being removed was dropped. When we execute i) of step-3, we would receive 0 tablets from master for the table being removed. Because of 0 tablets, we wont be able to filter out any state table entries. Hence, essentially, we will not be updating any state table entry. But in step-4, we will remove the table from stream metadata. Now when the CDC drop table cleanup bg thread executes, it will find these state table entries that doesnt belong to any tables in stream metadata and hence will directly delete them. Although the end result is as expected, but we would like to avoid these partial cleanup from codepaths.

The updated algorithm for updating the checkpoint of a state table entry is as follows:

We will now scan the entire state table and fetch the TabletInfo for all the tablets in cdc state table. A state table entry can qualify for checkpoint update based on if a particular table is being removed or if all the state table entries are just being validated and synced with tables in stream metadata.

Case-1 : Removal of a particular table. Tablet has to satisfy either of the two conditions for checkpoint update:

  - If the table being removed is not a colocated table, therefore the tablet exclusively belongs to the table being removed.
  - If the table being removed is a colocated table, then all the other colocated tables on the tablet should not be present in stream metadata.

Case-2: Validation and sync of cdc state entries

  - A state table entry can only qualify for checkpoint update if it belongs to none of the tables present in stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent, test regex: .*TestNonEligibleTableCleanupWithDropTable.*
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDeleteCDCStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDeleteStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestColocatedUserTableRemovalFromCDCStream

Reviewers: xCluster, hsunder, skumar, stiwary, sumukh.phalgaonkar, asrinivasan

Reviewed By: skumar

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37135
jasonyb pushed a commit that referenced this issue Aug 13, 2024
Summary:
 bd1e19e [PLAT-14835]: Add extra transient YCQL index tables in xClusterTableConfig during GET calls and metrics.
 2715c58 [docs] updated yb version with shortcode (#23456)
 a2b5495 [PLAT-14912] docs warning for install root as subdirectory (#23470)
 53365b1 [#23422] YSQL: Disable random warmup feature by default for connection manager tests
 09d6e96 [#22876][#22835][#22773] CDCSDK: Add null checks & remove optimisation in table removal from CDC stream
 69d4052 [#22862] XCluster: Improving XCluster Index Base WAL Retention Policy
 706e97d [#23460] DocDB: Read vector index data
 b1a90b9 [#23428] docdb: Remove non-transactional snapshot code
 581648f [PLAT-13957] Update RBAC wrapper for xCluster DR
 fbaf945 Whitepaper on migration (#23468)
 f6af2f5 [PLAT-13936] Upgrade Grpc and its dependencies to fix CVEs
 d7027fe [PLAT-14892] Update PITR configuration step text
 92804ac [PLAT-14760] Use new xCluster sync API
 3cb8faf [PLAT-11243] Upgrade python requests to latest version
 Excluded: a036313 [#23070] YSQL, ASH: Replace ysql_session_id with pid
 4d2f71f [PLAT-14882] Retrieve userName from the attribute lists in case not found in dn
 99489c0 [PLAT-14909] Upgrade YBC version to 2.2.0.0-b4

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37272
@yugabyte-ci yugabyte-ci reopened this Aug 14, 2024
siddharth2411 added a commit that referenced this issue Aug 14, 2024
… remove optimisation in table removal from CDC stream

Summary:
**Backport description:**
Minor conflict in xrepl_catalog_manager.cc.

**Original description:**
Original commit: 09d6e96 / D37135
This diff make the following changes:

 # The table (non-eligible/user-created) removal codepath as well as addition of children tablets on split fetches tableInfoPtr/CDCStreamInfoPtr from master. At some of the places, we were either missing null checks which have been added
 # Removes the optimization in `UpdateCheckpointForTabletEntriesInCDCState()` that fetches TabletInfo only for a filtered set of tablet_ids when a particular table is being removed. This set used to only contain CDC state entries for the table being removed. Due to this optimisation, there was a race condition with drop table where this table removal codepath removes the table from stream metadata and the drop table cleanup codepath deletes the relevant state table entries. The race is described below:

Current algorithm for cleanup of user-created table (yb-admin command)
1. Check if stream is non-null & not in DELETING state.
2. Check if table is non-null & not in DELETING state
3. Update cdc state table entries

   i) Get tablets belonging to the table to be removed from master

    ii) Filter cdc state table entries whose tablets are part of the above collected tablet list

    iii) Get TabletInfo of filtered state table entries and check the table_id of the tablet

    iv) Iterate over the TabletInfo(s) and update checkpoint to max if table_id in TabletInfo doesnt belong to stream metadata.

    v) Delete colocated table state table entry

4. Remove user table from stream metadata

In Step 3, i) was the optimisation. The race here is suppose between step-2 & step-3, the table being removed was dropped. When we execute i) of step-3, we would receive 0 tablets from master for the table being removed. Because of 0 tablets, we wont be able to filter out any state table entries. Hence, essentially, we will not be updating any state table entry. But in step-4, we will remove the table from stream metadata. Now when the CDC drop table cleanup bg thread executes, it will find these state table entries that doesnt belong to any tables in stream metadata and hence will directly delete them. Although the end result is as expected, but we would like to avoid these partial cleanup from codepaths.

The updated algorithm for updating the checkpoint of a state table entry is as follows:

We will now scan the entire state table and fetch the TabletInfo for all the tablets in cdc state table. A state table entry can qualify for checkpoint update based on if a particular table is being removed or if all the state table entries are just being validated and synced with tables in stream metadata.

Case-1 : Removal of a particular table. Tablet has to satisfy either of the two conditions for checkpoint update:

  - If the table being removed is not a colocated table, therefore the tablet exclusively belongs to the table being removed.
  - If the table being removed is a colocated table, then all the other colocated tables on the tablet should not be present in stream metadata.

Case-2: Validation and sync of cdc state entries

  - A state table entry can only qualify for checkpoint update if it belongs to none of the tables present in stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDeleteCDCStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDeleteStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestColocatedUserTableRemovalFromCDCStream

Reviewers: skumar, stiwary, sumukh.phalgaonkar, asrinivasan

Reviewed By: sumukh.phalgaonkar

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37246
siddharth2411 added a commit that referenced this issue Aug 14, 2024
…emove optimisation in table removal from CDC stream

Summary:
**Backport description:**
Minor conflict in xrepl_catalog_manager.cc

**Original description:**
Original commit: 09d6e96 / D37135
This diff make the following changes:

 # The table (non-eligible/user-created) removal codepath as well as addition of children tablets on split fetches tableInfoPtr/CDCStreamInfoPtr from master. At some of the places, we were either missing null checks which have been added
 # Removes the optimization in `UpdateCheckpointForTabletEntriesInCDCState()` that fetches TabletInfo only for a filtered set of tablet_ids when a particular table is being removed. This set used to only contain CDC state entries for the table being removed. Due to this optimisation, there was a race condition with drop table where this table removal codepath removes the table from stream metadata and the drop table cleanup codepath deletes the relevant state table entries. The race is described below:

Current algorithm for cleanup of user-created table (yb-admin command)
1. Check if stream is non-null & not in DELETING state.
2. Check if table is non-null & not in DELETING state
3. Update cdc state table entries

   i) Get tablets belonging to the table to be removed from master

    ii) Filter cdc state table entries whose tablets are part of the above collected tablet list

    iii) Get TabletInfo of filtered state table entries and check the table_id of the tablet

    iv) Iterate over the TabletInfo(s) and update checkpoint to max if table_id in TabletInfo doesnt belong to stream metadata.

    v) Delete colocated table state table entry

4. Remove user table from stream metadata

In Step 3, i) was the optimisation. The race here is suppose between step-2 & step-3, the table being removed was dropped. When we execute i) of step-3, we would receive 0 tablets from master for the table being removed. Because of 0 tablets, we wont be able to filter out any state table entries. Hence, essentially, we will not be updating any state table entry. But in step-4, we will remove the table from stream metadata. Now when the CDC drop table cleanup bg thread executes, it will find these state table entries that doesnt belong to any tables in stream metadata and hence will directly delete them. Although the end result is as expected, but we would like to avoid these partial cleanup from codepaths.

The updated algorithm for updating the checkpoint of a state table entry is as follows:

We will now scan the entire state table and fetch the TabletInfo for all the tablets in cdc state table. A state table entry can qualify for checkpoint update based on if a particular table is being removed or if all the state table entries are just being validated and synced with tables in stream metadata.

Case-1 : Removal of a particular table. Tablet has to satisfy either of the two conditions for checkpoint update:

  - If the table being removed is not a colocated table, therefore the tablet exclusively belongs to the table being removed.
  - If the table being removed is a colocated table, then all the other colocated tables on the tablet should not be present in stream metadata.

Case-2: Validation and sync of cdc state entries

  - A state table entry can only qualify for checkpoint update if it belongs to none of the tables present in stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDeleteCDCStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDeleteStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestColocatedUserTableRemovalFromCDCStream

Reviewers: skumar, stiwary, sumukh.phalgaonkar, asrinivasan

Reviewed By: sumukh.phalgaonkar

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37247
siddharth2411 added a commit that referenced this issue Aug 29, 2024
…ove optimisation in table removal from CDC stream

Summary:
**Backport description:**
Minor merge conflicts due to some refactoring of methods in xrepl_catalog_manager & missing replica
identity flags.

**Original description:**
Original commit: 09d6e96 / D37135
This diff make the following changes:

 # The table (non-eligible/user-created) removal codepath as well as addition of children tablets on split fetches tableInfoPtr/CDCStreamInfoPtr from master. At some of the places, we were either missing null checks which have been added
 # Removes the optimization in `UpdateCheckpointForTabletEntriesInCDCState()` that fetches TabletInfo only for a filtered set of tablet_ids when a particular table is being removed. This set used to only contain CDC state entries for the table being removed. Due to this optimisation, there was a race condition with drop table where this table removal codepath removes the table from stream metadata and the drop table cleanup codepath deletes the relevant state table entries. The race is described below:

Current algorithm for cleanup of user-created table (yb-admin command)
1. Check if stream is non-null & not in DELETING state.
2. Check if table is non-null & not in DELETING state
3. Update cdc state table entries

   i) Get tablets belonging to the table to be removed from master

    ii) Filter cdc state table entries whose tablets are part of the above collected tablet list

    iii) Get TabletInfo of filtered state table entries and check the table_id of the tablet

    iv) Iterate over the TabletInfo(s) and update checkpoint to max if table_id in TabletInfo doesnt belong to stream metadata.

    v) Delete colocated table state table entry

4. Remove user table from stream metadata

In Step 3, i) was the optimisation. The race here is suppose between step-2 & step-3, the table being removed was dropped. When we execute i) of step-3, we would receive 0 tablets from master for the table being removed. Because of 0 tablets, we wont be able to filter out any state table entries. Hence, essentially, we will not be updating any state table entry. But in step-4, we will remove the table from stream metadata. Now when the CDC drop table cleanup bg thread executes, it will find these state table entries that doesnt belong to any tables in stream metadata and hence will directly delete them. Although the end result is as expected, but we would like to avoid these partial cleanup from codepaths.

The updated algorithm for updating the checkpoint of a state table entry is as follows:

We will now scan the entire state table and fetch the TabletInfo for all the tablets in cdc state table. A state table entry can qualify for checkpoint update based on if a particular table is being removed or if all the state table entries are just being validated and synced with tables in stream metadata.

Case-1 : Removal of a particular table. Tablet has to satisfy either of the two conditions for checkpoint update:

  - If the table being removed is not a colocated table, therefore the tablet exclusively belongs to the table being removed.
  - If the table being removed is a colocated table, then all the other colocated tables on the tablet should not be present in stream metadata.

Case-2: Validation and sync of cdc state entries

  - A state table entry can only qualify for checkpoint update if it belongs to none of the tables present in stream metadata.
Jira: DB-11778, DB-11733, DB-11676

Test Plan:
Jenkins: urgent, test regex: .*TestNonEligibleTableCleanupWithDropTable.*
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestUserTableCleanupWithDeleteCDCStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDropTable
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestNonEligibleTableCleanupWithDeleteStream
./yb_build.sh --cxx-test cdcsdk_ysql-test --gtest_filter CDCSDKYsqlTest.TestColocatedUserTableRemovalFromCDCStream

Reviewers: xCluster, hsunder, skumar, stiwary, sumukh.phalgaonkar, asrinivasan

Reviewed By: sumukh.phalgaonkar

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37622
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants