Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Skewed Bootstrap timings after applying cdc_write_post_apply_metadata and cdc_immediate_transaction_cleanup gflags #21741

Closed
1 task done
shamanthchandra-yb opened this issue Mar 29, 2024 · 1 comment
Assignees
Labels

Comments

@shamanthchandra-yb
Copy link

shamanthchandra-yb commented Mar 29, 2024

Jira Link: DB-10615

Description

This issue was observed in manual LRU. slack thread could be found in JIRA description.

This is CDC LRU, where CDC was enabled recently to start with.

March 26, 2024:

  • After CDC was enabled, we identified its because of memory overhead by CDC, enabled recently.

March 27, 2024:

  • Thus, stopped CDC and deleted stream ids.
  • Upgraded to 2.23.0.0-b15, without CDC. Bootstrap timings without CDC, on each node ~ 4 mins
  • After upgrade, because it had https://phorge.dev.yugabyte.com/D31900. Applied tserver gflags: cdc_write_post_apply_metadata -> true, cdc_immediate_transaction_cleanup -> true. Bootstrap timings, on each node was ~7 mins, 40 seconds
  • Then recreated stream on 2 databases and deployed CDC connectors on each, with 2 tasks and 1 task each
  • Changed the cdc_intent_retention_ms to 2 hours from 4 hours and noted bootstrap timings. It was around 8 mins, 30 seconds

March 28, 2024:

  • Disk size was almost 90+%. For temporary purpose, increased disk capacity from 750 GB to 1000 GB.
  • Also, one of the task, in one of the CDC connector, CDC stream had expired. I suspect, for above issues, CDC connector was not able to connect, for some reason. As I saw this, Cannot proceed, the client to [172.151.17.61:7100, 172.151.26.208:7100, 172.151.31.137:7100] has already been closed. in connector log.
  • I didn’t wanted occurrence of GC’ed to happen if it doesn’t catch up, hence tried to change cdc_intent_retention_ms to 12 hrs (43200000). It failed, the first node took ~38+ minutes to bootstrap
  • Retried: On retry, it took 47+ minutes to bootstrap
  • Changed cdc_intent_retention_ms to 4 hrs to check bootstrap timings. It failed: ~18+ mins to bootstrap
  • Since I saw there is decrease in bootstrap timings, wanted to try by setting cdc_intent_retention_ms to 1 hr. It failed in preflights check, with under replicated issue.
java.lang.RuntimeException: Nodes are not safe to take down: TSERVERS: [172.151.26.208] have a problem: Server[YB Master - 172.151.31.137:7100] ILLEGAL_STATE[code 9]: [{b9adbaf1832049fc834e42b0ca16b7f7, 1}, {0e8cb68306584ef1b426b8405f070456, 1}, {dd193518aafe4532aadf4651da6e30f6, 1}, {b0c6709a22ad4a42a45440645da9e1bc, 1}, {617a2c81228544199c9d63fd55c884fd, 1}, {d3ed2c8a599b41aba5449eeb5e5f0231, 1}, {cfd472162f1b48d1bf3405d5c59160f6, 1}, {c76ffb5a5a5f4acca07f57d0d374f48f, 1}, {8417f9e5b9654d4e92bdfac8b5b2add6, 1}, {664f41b36d704f8bb4568415b6795db5, 1}, {f5d9b8de0ee4440cb3d887892aa84f4f, 1}, {018eeb811b7a4d9ca74b1548b03c539d, 1}, {87d6f515874349f8924070dc3c1b031c, 1}, {19c00c06c1e04337b2c23985e17c357f, 1}, {96fd54eb93e74ad494110c834f98f62a, 1}, {e538f5214abd438db97ad31a86306f37, 1}, {fbce62a75dfd44c9849d13da9c02eea2, 1}, {0044cf43d7a94e1ab72eb739568efff9, 1}, {75086676fa6a435fa681fc7218151704, 1}, {6508484f5cc8451b8136e0766edd4996, 1}, {b6b95ccb80894d7c854606165747718d, 1}, {71bb9b8515e14dd38fa1e99a4215dbf2, 1}, {a184bd8bd9194956875f2db8b95ee867, 1}, {b97db16a854748e0851f5ade0cb92aa0, 1}, {9ea40ac9d06e4b36b1ce8e12ddede500, 1}, {89ac492b0d2747b5bef4032d0d69b0db, 1}, {83be28da750f408283ffb4d6ec01c801, 1}, {ab14c14ce85f4cf8b6d0b3798217393d, 1}, {e9047686d2e64e1489576877351754e0, 1}, {a1913c046e6b40a8adb864411b188473, 1}, {84c2e9ad493e4da38058a1d3d9017cba, 1}, {1cddd5bdb3844e779ead1d270b9708fb, 1}, {bf4116ea18004d469d02187a572d0e41, 1}, {0e2c872087264c0682dbe434e23d7b45, 1}, {cec3...3d, 1}, {e9047686d2e64e1489576877351754e0, 1}, {a1913c046e6b40a8adb864411b188473, 1}, {84c2e9ad493e4da38058a1d3d9017cba, 1}, {1cddd5bdb3844e779ead1d270b9708fb, 1}, {bf4116ea18004d469d02187a572d0e41, 1}, {0e2c872087264c0682dbe434e23d7b45, 1}, {cec3001c023945bca9fe0a6757e2f2d1, 1}, {8afd328a71ad45ef8a9e43117e7ca2c9, 1}, {2c73c227b84e4697bf4c221562649fc9, 1}, {3a70898e28424e07a071ab4db8ecc50b, 1}, {d1b5688a20464b9f9578b57c1132b3d0, 1}, {a57e142100b741a0a0d96cf69b977f86, 1}, {37143c8f21ff44f788367139aec44f58, 1}, {6d7dee0f44494c55b3c59b9b3e3987e8, 1}, {5f3ac672c173429b89b7f7f7024d9696, 1}, {737575c3b6ff47efbf987f1523c33ae6, 1}, {661c74f6bf4a49048ff91276341e3802, 1}, {041c3a28e05c45c586169f02232304ba, 1}, {957b779f2dd44d34851fe3042ae9ebf7, 1}, {240e433c0f464e4ab4651126472ac4fd, 1}, {534c46714b5b4cbfadc5aa7b9b91c191, 1}, {986e91a9849a41f991f55d20d61d9fa4, 1}, {f026f2b4b27446029ebfb2eca0c9c159, 1}, {65d84ed5af6f405fa1d2de6e1553c1b9, 1}, {34b6e7902c9948fc9a35b3786230c1d4, 1}, {195b8fcdf7364ebe9d6d92c6b1ba47ad, 1}, {c73501383de3438894feba2abfd87f31, 1}, {51e0bb671ae54c70b4025a72c3bf50a3, 1}, {8994c8d8ce7342a59d229080cf2d2d63, 1}, {a2055187e0104d63a40c87d39b3a8839, 1}, {366c1c58d97c48cab25cb49cce76959e, 1}, {b4d75cdc0378444a868f5a085c2e4ac4, 1}, {45fca22d8deb44f1b6b58c66cff81ad3, 1}, {ad6304ce984e49f4bb7cf4a81a55efb7, 1}, {3b3cfc95457c49cdb08e78ce19f61f83, 1}] tablet(s) would be under-replicated. Example: tablet b9adbaf1832049fc834e42b0ca16b7f7 would be under-replicated by 1 replicas.

In summary, 2 issues:

  • Increase in bootstrap timings
  • Why did, under-replicated tablets happened here!

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shamanthchandra-yb shamanthchandra-yb added area/docdb YugabyteDB core features priority/high High Priority status/awaiting-triage Issue awaiting triage labels Mar 29, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Mar 29, 2024
@shamanthchandra-yb
Copy link
Author

cc: @rthallamko3 @suranjan

@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Apr 2, 2024
es1024 added a commit that referenced this issue Jun 20, 2024
…et bootstrap

Summary:
When CDC streams are lagging, there may be a large number of intent SST files whose contents have all been applied already, but must be maintained for CDC purposes. We work around the performance implications of having a large number of these files by filtering out SST files by min running hybrid time (D33131 / 97536b4), but this approach does not work as is for bootstrap, since min running hybrid time is not currently determined until bootstrap has finished.

This change adds the saving of min running hybrid time periodically with retryable requests state, and then loads this min running hybrid time into the transaction participant early in bootstrap, to allow the SST file filter used in D33131 / 97536b4 to be used at bootstrap time as well.

To avoid reintroducing the issue introduced by D34389 / 2458c08, this diff also removes the requirement that min running hybrid time must not be set before bootstrap, by moving the requirement to `transactions_loaded_`.

**Upgrade/Rollback safety:**

This change is not guarded by a gflag or autoflag. If the newly added min running hybrid time field is missing (upgrade), we do not apply a filter (the current behavior), and the presence of the optional protobuf field when downgrading is entirely ignored (the old behavior is to unconditionally not apply a filter). There are no correctness issues involved with either applying or not applying the filter, as it is entirely a performance optimization.
Jira: DB-10615

Test Plan: Jenkins

Reviewers: yyan, qhu

Reviewed By: yyan, qhu

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D35639
karthik-ramanathan-3006 pushed a commit to karthik-ramanathan-3006/yugabyte-db that referenced this issue Jun 24, 2024
…ing tablet bootstrap

Summary:
When CDC streams are lagging, there may be a large number of intent SST files whose contents have all been applied already, but must be maintained for CDC purposes. We work around the performance implications of having a large number of these files by filtering out SST files by min running hybrid time (D33131 / 97536b4), but this approach does not work as is for bootstrap, since min running hybrid time is not currently determined until bootstrap has finished.

This change adds the saving of min running hybrid time periodically with retryable requests state, and then loads this min running hybrid time into the transaction participant early in bootstrap, to allow the SST file filter used in D33131 / 97536b4 to be used at bootstrap time as well.

To avoid reintroducing the issue introduced by D34389 / 2458c08, this diff also removes the requirement that min running hybrid time must not be set before bootstrap, by moving the requirement to `transactions_loaded_`.

**Upgrade/Rollback safety:**

This change is not guarded by a gflag or autoflag. If the newly added min running hybrid time field is missing (upgrade), we do not apply a filter (the current behavior), and the presence of the optional protobuf field when downgrading is entirely ignored (the old behavior is to unconditionally not apply a filter). There are no correctness issues involved with either applying or not applying the filter, as it is entirely a performance optimization.
Jira: DB-10615

Test Plan: Jenkins

Reviewers: yyan, qhu

Reviewed By: yyan, qhu

Subscribers: rthallam, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D35639
es1024 added a commit that referenced this issue Jun 27, 2024
… time during tablet bootstrap

Summary:
Original commit: 8b23a4e / D35639
When CDC streams are lagging, there may be a large number of intent SST files whose contents have all been applied already, but must be maintained for CDC purposes. We work around the performance implications of having a large number of these files by filtering out SST files by min running hybrid time (D33131 / 97536b4), but this approach does not work as is for bootstrap, since min running hybrid time is not currently determined until bootstrap has finished.

This change adds the saving of min running hybrid time periodically with retryable requests state, and then loads this min running hybrid time into the transaction participant early in bootstrap, to allow the SST file filter used in D33131 / 97536b4 to be used at bootstrap time as well.

To avoid reintroducing the issue introduced by D34389 / 2458c08, this diff also removes the requirement that min running hybrid time must not be set before bootstrap, by moving the requirement to `transactions_loaded_`.

**Upgrade/Rollback safety:**

This change is not guarded by a gflag or autoflag. If the newly added min running hybrid time field is missing (upgrade), we do not apply a filter (the current behavior), and the presence of the optional protobuf field when downgrading is entirely ignored (the old behavior is to unconditionally not apply a filter). There are no correctness issues involved with either applying or not applying the filter, as it is entirely a performance optimization.
Jira: DB-10615

Test Plan: Jenkins

Reviewers: yyan, qhu

Reviewed By: yyan

Subscribers: ybase, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36102
@yugabyte-ci yugabyte-ci added the area/cdcsdk CDC SDK label Jun 28, 2024
es1024 added a commit that referenced this issue Jun 29, 2024
…nts by min running hybrid time during tablet bootstrap"

Summary:
This reverts commit 717fbc5, which caused CDC tests to
start failing, in order to unblock 2024.1 branch.

Test Plan: Jenkins: urgent, rebase: 2024.1

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36255
es1024 added a commit that referenced this issue Jun 29, 2024
…tents by min running hybrid time during tablet bootstrap"

Summary:
This reverts commit 717fbc5, which caused CDC tests to
start failing, in order to unblock 2024.1.1 branch.

Test Plan: Jenkins: urgent, rebase: 2024.1.1

Reviewers: rthallam

Reviewed By: rthallam

Subscribers: ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36261
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants