Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses used #21116

Closed
1 task done
qvad opened this issue Feb 21, 2024 · 2 comments
Closed
1 task done

Comments

@qvad
Copy link
Contributor

qvad commented Feb 21, 2024

Jira Link: DB-10148

Description

Running geo append tests with latest fixes on latest master/2.20 build leads to error

            :not #{:read-uncommitted},
            :also-not #{:consistent-view
                        :cursor-stability
                        :forward-consistent-view
                        :monotonic-atomic-view
                        :monotonic-snapshot-read
                        :monotonic-view
                        :read-committed
                        :repeatable-read
                        :serializable
                        :snapshot-isolation
                        :strong-serializable
                        :strong-session-serializable
                        :strong-session-snapshot-isolation
                        :strong-snapshot-isolation
                        :update-serializable}},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

20240220T080350.354Z.zip

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@qvad qvad added area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Feb 21, 2024
@qvad qvad changed the title [Jepsen][YSQL] Enabling geo transactions lead to inconsistencies with tserver related nemeses [Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses Feb 21, 2024
@qvad qvad changed the title [Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses [Jepsen][YSQL] Enabling geo transactions lead to inconsistencies if tserver related nemeses used Feb 21, 2024
@qvad
Copy link
Contributor Author

qvad commented Feb 21, 2024

Slightly different error is in 2.20

                                             :type :G-single-realtime}]},
            :not #{:strong-snapshot-isolation},
            :also-not #{:strong-serializable}},
 :valid? false}


Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

@rthallamko3 rthallamko3 added area/docdb YugabyteDB core features priority/highest Highest priority issue and removed area/ysql Yugabyte SQL (YSQL) status/awaiting-triage Issue awaiting triage labels Feb 21, 2024
@yugabyte-ci yugabyte-ci added the kind/bug This issue is a bug label Feb 27, 2024
@yugabyte-ci yugabyte-ci added priority/high High Priority and removed priority/highest Highest priority issue labels May 6, 2024
@yugabyte-ci yugabyte-ci added priority/highest Highest priority issue and removed priority/high High Priority labels Oct 1, 2024
es1024 added a commit that referenced this issue Oct 9, 2024
Summary:
In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits.

This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls.

The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless.

**Upgrade/Rollback safety:**
The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2.

Jira: DB-10148

Test Plan:
Jenkins.

Added new test:
- `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100`

Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793.

Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures.

Reviewers: sergei

Reviewed By: sergei

Subscribers: rthallam, ybase, yql, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D38718
@rthallamko3
Copy link
Contributor

rthallamko3 commented Oct 11, 2024

Per @es1024 , Before backporting the fix for this, the fixes #20321 and #24317 should be backported otherwise the geo partitioning tests would fail lot more frequently.

es1024 added a commit that referenced this issue Oct 23, 2024
Summary:
In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits.

This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls.

The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless.

**Upgrade/Rollback safety:**
The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2.

Jira: DB-10148

Original commit: 6109c23 / D38718

Test Plan:
Jenkins.

Added new test:
- `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100`

Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793.

Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures.

Reviewers: sergei

Reviewed By: sergei

Subscribers: svc_phabricator, yql, ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D39216
es1024 added a commit that referenced this issue Oct 23, 2024
Summary:
In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits.

This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls.

The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless.

**Upgrade/Rollback safety:**
The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2.

Jira: DB-10148

Original commit: 6109c23 / D38718

Test Plan:
Jenkins.

Added new test:
- `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100`

Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793.

Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures.

Reviewers: sergei

Reviewed By: sergei

Subscribers: rthallam, ybase, yql, svc_phabricator

Differential Revision: https://phorge.dev.yugabyte.com/D39215
es1024 added a commit that referenced this issue Oct 23, 2024
Summary:
In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits.

This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls.

The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless.

**Upgrade/Rollback safety:**
The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2.

Jira: DB-10148

Original commit: 6109c23 / D38718

Test Plan:
Jenkins.

Added new test:
- `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100`

Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793.

Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures.

Reviewers: sergei

Reviewed By: sergei

Subscribers: svc_phabricator, yql, ybase, rthallam

Differential Revision: https://phorge.dev.yugabyte.com/D39214
@es1024 es1024 reopened this Nov 8, 2024
es1024 added a commit that referenced this issue Nov 8, 2024
Summary:
In the transaction promotion path, we did not properly replicate and save the promotion (an empty batch with the pre-promotion transaction metadata was replicated and written into WALs). This leads to changes made on tablets first touched before/after promotion to be treated as separate transactions with the same id but different status tablets after leader stepdowns and during tablet bootstrap, resulting in data loss. For example, leader stepdown on participant tablets touched before promotion after the transaction on the old status tablet has been aborted will cause changes to be cleaned up, even if the transaction has committed or later commits.

This revision changes the promotion path to send a UpdateTransaction(PROMOTING) with the new status tablet to the participant tablet, which is then replicated and written to WALs. This entirely replaces the old UpdateTransactionStatusLocation RPC calls.

The empty batch write in the UpdateTransactionStatusLocation path was also removed, as it was effectively useless.

**Upgrade/Rollback safety:**
The change to send UpdateTransaction(PROMOTING) is gated by the auto flag `replicate_transaction_promotion` to ensure that we don't send or write the new value until the upgrade has been finalized; the old UpdateTransactionStatusLocation sending code is used until then. The old UpdateTransactionStatusLocation handling code will be left intact until after 2024.2.

Jira: DB-10148

Original commit: 6109c23 / D38718

Test Plan:
Jenkins.

Added new test:
- `./yb_build.sh --cxx-test pgwrapper_geo_transactions_promotion-test --gtest_filter GeoTransactionsPromotionTest.TestParticipantLeaderStepDown -n 100`

Jenkins run with transaction_promotion_use_update_transaction turned off (pre-upgrade case) done on D38793.

Ran ysql/sz.ol.geo.append Jepsen workload with 600s timeout 20x without failures.

Reviewers: sergei

Reviewed By: sergei

Subscribers: svc_phabricator, yql, ybase, rthallam

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D39831
@es1024 es1024 closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants