Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

Closed
sureshdash2022-yb opened this issue Aug 19, 2022 · 0 comments
Closed

[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693

sureshdash2022-yb opened this issue Aug 19, 2022 · 0 comments
Assignees
Labels

Comments

@sureshdash2022-yb
Copy link
Contributor

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

@sureshdash2022-yb sureshdash2022-yb self-assigned this Aug 19, 2022
aishwarya24 pushed a commit to aishwarya24/yugabyte-db that referenced this issue Aug 19, 2022
…tion.

Summary:
Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time.  After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time,  TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the  CDC Service Cache, so that LEADER and FOLLOWERS are in sync.

Test Plan:
Jenkins: skip
Running all the c and java testcases

Reviewers: abharadwaj, aagrawal, vkushwaha, skumar, srangavajjula

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D18882
sureshdash2022-yb added a commit that referenced this issue Aug 20, 2022
…ntries in the cdc_state table causing tserver crash

Summary:
"Original commit:
 - 2787d62/D18882
 - 86a78b7/D18986"
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

[#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration.

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time.  After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time,  TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the  CDC Service Cache, so that LEADER and FOLLOWERS are in sync.

Test Plan: Jenkins: urgent

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D19054
sureshdash2022-yb added a commit that referenced this issue Aug 20, 2022
…ries in the cdc_state table causing tserver crash

Summary:
"Original commit:
 - 2787d62/D18882
  -  86a78b7/D18986"
During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.

To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again.

[#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration.

Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time.  After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time,  TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.

To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the  CDC Service Cache, so that LEADER and FOLLOWERS are in sync.

Test Plan:
Running all the c and java testcases

Jenkins: urgent
Running all the c and java testcases

Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster

Differential Revision: https://phabricator.dev.yugabyte.com/D19056
adithya-kb pushed a commit that referenced this issue Aug 31, 2022
Summary:
When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive.

To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time.

The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected.

This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents.

Test Plan:
Jenkins: skip
Existing cdcsdk test cases

Reviewers: sdash, srangavajjula, mbautin, skumar

Reviewed By: skumar

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19201
adithya-kb pushed a commit that referenced this issue Aug 31, 2022
…able

Summary:
Original commit: 2b8a52b/D19201
When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive.

To determine if a stream is active, we track the 'time' when GetChanges API call was made for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time.

The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected.

This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents.

Test Plan:
Jenkins: urgent
Existing cdcsdk test cases

Reviewers: sdash, srangavajjula, skumar

Reviewed By: skumar

Subscribers: ycdcxcluster, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19244
adithya-kb pushed a commit that referenced this issue Sep 2, 2022
Summary:
Original commit: 2b8a52b/D19201
When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive.

To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time.

The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected.

This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents.

Test Plan: Existing cdcsdk test cases

Reviewers: sdash, srangavajjula, skumar

Reviewed By: skumar

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19287
adithya-kb pushed a commit that referenced this issue Sep 7, 2022
Summary:
Original commit: 2b8a52b/D19201
When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive.

To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time.

The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected.

This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents.

Test Plan:
Jenkins: urgent
Existing cdcsdk test cases

Reviewers: srangavajjula, skumar, sdash

Reviewed By: skumar, sdash

Subscribers: bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D19346
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant