[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

sureshdash2022-yb · 2022-08-19T07:40:57Z

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

…tion. Summary: Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: skip Running all the c and java testcases Reviewers: abharadwaj, aagrawal, vkushwaha, skumar, srangavajjula Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18882

…ntries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: urgent Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19054

…ries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Running all the c and java testcases Jenkins: urgent Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19056

Summary: When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: skip Existing cdcsdk test cases Reviewers: sdash, srangavajjula, mbautin, skumar Reviewed By: skumar Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19201

…able Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the 'time' when GetChanges API call was made for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: urgent Existing cdcsdk test cases Reviewers: sdash, srangavajjula, skumar Reviewed By: skumar Subscribers: ycdcxcluster, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19244

Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Existing cdcsdk test cases Reviewers: sdash, srangavajjula, skumar Reviewed By: skumar Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19287

Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: urgent Existing cdcsdk test cases Reviewers: srangavajjula, skumar, sdash Reviewed By: skumar, sdash Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19346

sureshdash2022-yb self-assigned this Aug 19, 2022

sureshdash2022-yb added priority/medium Medium priority issue 2.12 Backport Required 2.14 Backport Required labels Aug 19, 2022

sureshdash2022-yb closed this as completed Aug 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

sureshdash2022-yb commented Aug 19, 2022

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

Comments

sureshdash2022-yb commented Aug 19, 2022