-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses used #21116
Closed
1 task done
Labels
2.20 Backport Required
2024.1 Backport Required
2024.1.3_blocker
2024.1.3.1_blocker
2024.2 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/highest
Highest priority issue
Comments
qvad
added
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Feb 21, 2024
qvad
changed the title
[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies with tserver related nemeses
[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses
Feb 21, 2024
qvad
changed the title
[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses
[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses used
Feb 21, 2024
Slightly different error is in 2.20
|
rthallamko3
added
area/docdb
YugabyteDB core features
priority/highest
Highest priority issue
and removed
area/ysql
Yugabyte SQL (YSQL)
status/awaiting-triage
Issue awaiting triage
labels
Feb 21, 2024
yugabyte-ci
added
priority/high
High Priority
and removed
priority/highest
Highest priority issue
labels
May 6, 2024
yugabyte-ci
added
priority/highest
Highest priority issue
and removed
priority/high
High Priority
labels
Oct 1, 2024
es1024
added a commit
that referenced
this issue
Oct 9, 2024
Summary: In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits. This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls. The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless. **Upgrade/Rollback safety:** The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2. Jira: DB-10148 Test Plan: Jenkins. Added new test: - `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100` Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793. Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures. Reviewers: sergei Reviewed By: sergei Subscribers: rthallam, ybase, yql, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D38718
es1024
added a commit
that referenced
this issue
Oct 23, 2024
Summary: In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits. This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls. The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless. **Upgrade/Rollback safety:** The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2. Jira: DB-10148 Original commit: 6109c23 / D38718 Test Plan: Jenkins. Added new test: - `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100` Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793. Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures. Reviewers: sergei Reviewed By: sergei Subscribers: svc_phabricator, yql, ybase, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D39216
es1024
added a commit
that referenced
this issue
Oct 23, 2024
Summary: In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits. This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls. The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless. **Upgrade/Rollback safety:** The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2. Jira: DB-10148 Original commit: 6109c23 / D38718 Test Plan: Jenkins. Added new test: - `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100` Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793. Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures. Reviewers: sergei Reviewed By: sergei Subscribers: rthallam, ybase, yql, svc_phabricator Differential Revision: https://phorge.dev.yugabyte.com/D39215
es1024
added a commit
that referenced
this issue
Oct 23, 2024
Summary: In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits. This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls. The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless. **Upgrade/Rollback safety:** The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2. Jira: DB-10148 Original commit: 6109c23 / D38718 Test Plan: Jenkins. Added new test: - `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100` Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793. Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures. Reviewers: sergei Reviewed By: sergei Subscribers: svc_phabricator, yql, ybase, rthallam Differential Revision: https://phorge.dev.yugabyte.com/D39214
1 task
es1024
added a commit
that referenced
this issue
Nov 8, 2024
Summary: In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits. This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls. The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless. **Upgrade/Rollback safety:** The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2. Jira: DB-10148 Original commit: 6109c23 / D38718 Test Plan: Jenkins. Added new test: - `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100` Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793. Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures. Reviewers: sergei Reviewed By: sergei Subscribers: svc_phabricator, yql, ybase, rthallam Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D39831
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.20 Backport Required
2024.1 Backport Required
2024.1.3_blocker
2024.1.3.1_blocker
2024.2 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/highest
Highest priority issue
Jira Link: DB-10148
Description
Running geo append tests with latest fixes on latest master/2.20 build leads to error
20240220T080350.354Z.zip
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: