-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDCSDK] Stale entry in CDC Cache causes Steam Expiration. #13693
Labels
Comments
sureshdash2022-yb
added
priority/medium
Medium priority issue
2.12 Backport Required
2.14 Backport Required
labels
Aug 19, 2022
aishwarya24
pushed a commit
to aishwarya24/yugabyte-db
that referenced
this issue
Aug 19, 2022
…tion. Summary: Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: skip Running all the c and java testcases Reviewers: abharadwaj, aagrawal, vkushwaha, skumar, srangavajjula Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18882
sureshdash2022-yb
added a commit
that referenced
this issue
Aug 20, 2022
…ntries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: urgent Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19054
sureshdash2022-yb
added a commit
that referenced
this issue
Aug 20, 2022
…ries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Running all the c and java testcases Jenkins: urgent Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19056
adithya-kb
pushed a commit
that referenced
this issue
Aug 31, 2022
Summary: When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: skip Existing cdcsdk test cases Reviewers: sdash, srangavajjula, mbautin, skumar Reviewed By: skumar Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19201
adithya-kb
pushed a commit
that referenced
this issue
Aug 31, 2022
…able Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the 'time' when GetChanges API call was made for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: urgent Existing cdcsdk test cases Reviewers: sdash, srangavajjula, skumar Reviewed By: skumar Subscribers: ycdcxcluster, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19244
adithya-kb
pushed a commit
that referenced
this issue
Sep 2, 2022
Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Existing cdcsdk test cases Reviewers: sdash, srangavajjula, skumar Reviewed By: skumar Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19287
adithya-kb
pushed a commit
that referenced
this issue
Sep 7, 2022
Summary: Original commit: 2b8a52b/D19201 When CDC Clients are not running for a long period of time, we don't want to retain intents as it may cause a resource utilization issue in a YugabyteDB cluster. We have a GFLAG cdc_intent_retention_ms (default value of 4 hours) that determines when can we mark a stream as inactive. To determine if a stream is active, we track the timestamp of the GetChanges API call for the stream/tablet pair. We introduce last_active_time in the cdc_state table and store this value in the table against each stream/tablet pair. To improve performance, we also maintain an in-memory cache to ensure we don't write this into cdc_state table every time. The TabletPeer tracks the 'min' OpId and the 'oldest' active time among all the 'active' streams. If all the streams are inactive, the min OpId is returned as OpId::max() to ensure all the intents are garbage collected. This information is also sent to follower tablets along with min OpId so that they can also decide when to clear the intents. Test Plan: Jenkins: urgent Existing cdcsdk test cases Reviewers: srangavajjula, skumar, sdash Reviewed By: skumar, sdash Subscribers: bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D19346
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire.
The text was updated successfully, but these errors were encountered: