-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CDCSDK] Deleting stream IDs lead to stale entries in the cdc_state table causing tserver crash #13653
Labels
Comments
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Aug 17, 2022
samiahmedsiddiqui
pushed a commit
to samiahmedsiddiqui/yugabyte-db
that referenced
this issue
Aug 19, 2022
… the cdc_state table causing tserver crash Summary: During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. Test Plan: Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: aagrawal, skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18986
sureshdash2022-yb
added a commit
that referenced
this issue
Aug 20, 2022
…ntries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Jenkins: urgent Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19054
sureshdash2022-yb
added a commit
that referenced
this issue
Aug 20, 2022
…_state table causing tserver crash Summary: During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. Test Plan: Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: aagrawal, skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D18986
sureshdash2022-yb
added a commit
that referenced
this issue
Aug 20, 2022
…ries in the cdc_state table causing tserver crash Summary: "Original commit: - 2787d62/D18882 - 86a78b7/D18986" During the analysis we found that in case a stream_id is deleted, the metadata related to it is not getting cleared from the cdc_state table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash. To fix this we will ignore those deleted stream metadata entries as part //setCDCCheckpoint//, and will remove those entries when //UpdatePeersAndMetrics// thread is enabled again. [#13693] CDCSDK: Stale entry in CDC Cache causes Steam Expiration. Consider a cluster with 3 tservers (TS1, TS2, TS3) and a table with a single tablet. Today we maintain and track the stream active time in the cache. During starting time tablet LEADER is TS1, so there is a Cache entry for the tablet, to track its active time. After some time TS2 becomes the tablet LEADER, so an entry will be created in TS2's cache to track the active time of the tablet. now after cdc_intent_retention_ms expiration time, TS1 becomes a LEADER, but its existing cache entry is not in sync, so if we call GetChanges stream will expire. To handle this RPC request will be sent by LEADER as part //UpdatePeersAndMetrics// thread, to all the FOLLOWER to update their //last_active_time// in the CDC Service Cache, so that LEADER and FOLLOWERS are in sync. Test Plan: Running all the c and java testcases Jenkins: urgent Running all the c and java testcases Reviewers: srangavajjula, vkushwaha, abharadwaj, aagrawal, skumar Reviewed By: skumar Subscribers: ycdcxcluster Differential Revision: https://phabricator.dev.yugabyte.com/D19056
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Jira Link: DB-3219
What’s happening is that in case a
stream_id
is deleted, the metadata related to it is not getting cleared from thecdc_state
table - and now even if a new stream is created, the previous (deleted) stream is causing interference to the functioning which is ultimately leading to a tserver crash.Steps to reproduce:
yb-admin create_change_data_stream ysql.<namespace-name>
yb-admin delete_change_data_stream <stream-id>
However, there is a workaround which has been identified to unblock the users i.e.
yb_system_namespace_readonly
tofalse
ycqlsh
delete the entries corresponding to the deleted stream ID in step 5The text was updated successfully, but these errors were encountered: