-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL][DDL Atomicity] TServer Error: ReportYsqlDdlTxnStatus Call Failed Due to Transaction State Mismatch (kAborted vs kCommitted) in ysql_ddl_handler.cc #23669
Comments
I found two logs that showed contradicting message:
vs
Both are related to the same txn 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421. |
Summary: From the ticket, the following error is noticed: ``` E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted ``` I found two logs that showed contradicting message: ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true ``` vs ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false ``` Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421. However the first log says the txn has committed (is_committed: true), and the second log says the txn has aborted (is_committed: false) Seems that two different code paths deduced different commit/abort status for the same txn. This diff adds a bit more logging to help debugging to understand which code paths were giving different commit/abort status. Jira: DB-12580 Test Plan: ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17 Reviewers: fizaa, hsunder Reviewed By: hsunder Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D37599
…ailure Summary: From the ticket, the following error is noticed: ``` E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted ``` I found two logs that showed contradicting message: ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true ``` vs ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false ``` Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421. However the first log says the txn has committed (is_committed: true), and the second log says the txn has aborted (is_committed: false) Seems that two different code paths deduced different commit/abort status for the same txn. This diff adds a bit more logging to help debugging to understand which code paths were giving different commit/abort status. Jira: DB-12580 Original commit: 63f1d65 / D37599 Test Plan: ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17 ./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17 Reviewers: fizaa, hsunder Reviewed By: hsunder Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37625
Summary: bfd17b5 [PLAT-14601]Support restore to Point in time part 2 - Backup/Restore/Restore preflight related changes - Add JSON Property 91d000a [doc] Thirdparty to integrations (#23598) Excluded: 6456777 [#23626] allow loading old dumps that do not have index pg_class OIDs c8b8be3 [DB-12586] yugabyted: Update schema migration UI (#23217) Excluded: 4ea354b [#23521] YSQL: Cost YB Bitmap Table Scan remote filters b80999d [PLAT-14974][PLAT-15045] Added prometheus user as part of yugabyte group 63f1d65 [#23669] YSQL: Add more logs to debug an assertion failure 1ad4795 [docs] [TA] Added TA-23476: YCQL currenttimestamp() precision (#23642) 41ae6b4 [#23653] docdb: Adjust waits for MasterPathHandlersItest.TestUndeletedParentTablet in TSAN 78b0ae4 [DB-12587] yugabyted: Update data migration UI (#23291) d234b3a [PLAT-15046] Create log directory with correct permissions to allow users to export logs without using sudo 8713c18 [doc][yba] Clarify pre-req for cloud provider image upgrades (#23285) 9be5c91 [#23448] YSQL: fix failing test PgAutoAnalyzeTest.CheckTableMutationsCount Excluded: 9d54710 [#22147] YSQL, QueryDiagnostics: Pgss support for query diagnostics 417092a [#23373] DocDB: Add max_disk_throughput_mbps gflag to control disk full rejection e3a1a36 [PLAT-15035] Add support to sync gflags secret mount location to actual gflag file used by services 23a6a4c [PLAT-14525][PLAT-14953] Add local provider tests for switchover, failover, change replica, and restart 6026029 [PLAT-15100][Master]Observed two Scheduled Backup Policies tabs in Backup page 2cf648b [#23581] CDCSDK: Support dynamic table addition with table removal b14851d [#23702] xClusterDDLRepl: Add extra logging 8a0d6ff [#23645] docdb: Reorder heartbeat handling logic to fix regression. 2b30b5e [Docs] Changes for Experimental AI (#23714) Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Differential Revision: https://phorge.dev.yugabyte.com/D37645
Summary: From the ticket, the following error is noticed: ``` E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted ``` I found two logs that showed contradicting message: ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true ``` vs ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false ``` Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421. However the first log says the txn has committed (is_committed: true), and the second log says the txn has aborted (is_committed: false) So two different code paths deduced different commit/abort status for the same txn. DDL atomicity has two ways for docdb to follow and become eventually consistent with the PG transaction status (commit or abort): (1) PG sends a status report to master tell it whether the txn has committed or not. This is the fastest since PG drives the DDL txn and it knows firsthand. (2) Docdb has a background task, periodically polling the transaction status tablet to find out the status of the give txn. If the transaction status tablet response with the txn status as terminated (not pending), then the same background task will begin to read the current PG metadata from sys catalog table, based upon the nature of the DDL, performs a schema comparison to find out whether the current PG catalog schema indicated a commit or abort: (2.1) If the current PG catalog schema matches the current DocDB schema (the new table schema if the DDL successfully commits), it means the DDL txn has committed. (2.2) If the current PG catalog schema matches the previous DocDB schema (the old table schema prior to the DDL starts), it means the DDL txn has aborted. The bug is that there are some DDL statements that do not change the table's schema. For example, `alter table foo alter column id set not null` only bumps up the table foo's schema version without changing the schema at all. As a result, the current schema is identical to the previous schema. In other words, both (2.1) and (2.2) will compare as equal and because we check (2.1) first, we always conclude that the DDL txn has committed. Had we reversed the order by checking (2.2) first, then we would always conclude that the DDL txn has aborted. I made a change to make the type of is_committed from bool to std::optional<bool>. When is_committed.has_value() is true, it means either committed or aborted so we continue with the current logic. When is_committed is nullopt, it means the DDL does not change the table's schema at all. A new state TxnState::kNoChange is added in addition to the current kUnknown, kCommitted, kAborted. It is not possible to tell whether the DDL txn has committed or aborted by doing schema comparison if the DDL does not change the table schema. However it is fine because we do not need to know whether the DDL txn is committed or aborted, we only need to know the DDL txn has terminated. The end result is that the table's schema does not change. We just need to clear the DDL atomicity related state from the table's DocDB SysTablesEntryPB metadata and the table's schema will be correct. Jira: DB-12580 Test Plan: jenkins Reviewers: hsunder, fizaa Reviewed By: fizaa Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D37755
Summary: From the ticket, the following error is noticed: ``` E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted ``` I found two logs that showed contradicting message: ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true ``` vs ``` ./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false ``` Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421. However the first log says the txn has committed (is_committed: true), and the second log says the txn has aborted (is_committed: false) So two different code paths deduced different commit/abort status for the same txn. DDL atomicity has two ways for docdb to follow and become eventually consistent with the PG transaction status (commit or abort): (1) PG sends a status report to master tell it whether the txn has committed or not. This is the fastest since PG drives the DDL txn and it knows firsthand. (2) Docdb has a background task, periodically polling the transaction status tablet to find out the status of the give txn. If the transaction status tablet response with the txn status as terminated (not pending), then the same background task will begin to read the current PG metadata from sys catalog table, based upon the nature of the DDL, performs a schema comparison to find out whether the current PG catalog schema indicated a commit or abort: (2.1) If the current PG catalog schema matches the current DocDB schema (the new table schema if the DDL successfully commits), it means the DDL txn has committed. (2.2) If the current PG catalog schema matches the previous DocDB schema (the old table schema prior to the DDL starts), it means the DDL txn has aborted. The bug is that there are some DDL statements that do not change the table's schema. For example, `alter table foo alter column id set not null` only bumps up the table foo's schema version without changing the schema at all. As a result, the current schema is identical to the previous schema. In other words, both (2.1) and (2.2) will compare as equal and because we check (2.1) first, we always conclude that the DDL txn has committed. Had we reversed the order by checking (2.2) first, then we would always conclude that the DDL txn has aborted. I made a change to make the type of is_committed from bool to std::optional<bool>. When is_committed.has_value() is true, it means either committed or aborted so we continue with the current logic. When is_committed is nullopt, it means the DDL does not change the table's schema at all. A new state TxnState::kNoChange is added in addition to the current kUnknown, kCommitted, kAborted. It is not possible to tell whether the DDL txn has committed or aborted by doing schema comparison if the DDL does not change the table schema. However it is fine because we do not need to know whether the DDL txn is committed or aborted, we only need to know the DDL txn has terminated. The end result is that the table's schema does not change. We just need to clear the DDL atomicity related state from the table's DocDB SysTablesEntryPB metadata and the table's schema will be correct. Jira: DB-12580 Original commit: a6981f2 / D37755 Test Plan: jenkins Reviewers: hsunder, fizaa Reviewed By: fizaa Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D37925
Summary: 956f458 [PLAT-15132] Pass logging during config parsing 34e7ca9 [PLAT-15189]: ReadOnly user needs to know why they cannot perform actions on releases page ba25c6e [PLAT-14949]: Remove unused UI packages cb7fd11 [PLAT-15187]: Expand universe summary modal shows nodes instead of pods for kubernetes universe bdffcc3 [PLAT-15167]: Incorrect UI validation of the hba conf flag while setting up OIDC 855db3e [PLAT-15105]: Fix TP Secondary Dashboard Issues within YBA c9b277a [PLAT-15068]: Lengthy relation/Tablenames are overlapping d806031 [PLAT-14897][PLAT-14916] Install node agents on nodes for the marked universes for public cloud providers c628f76 [PLAT-15031] yba installer os patch guardrails 8a06c3c [Docs][Voyager] Minor fixes in known-issues PG/Oracle pages (#23842) 135aa6f [PLAT-14959] Disable connect only for onprem 80bc6a8 [PLAT-14564] Add reboot required module for YNP 1099625 [#23808] DocDB: Return after executing rpc callback in UpdateTransactionWaitingForStatus a6981f2 [#23669] YSQL: Fix DDL atomicity assertion failure db28666 iTests fix on master 9f50d34 [PLAT-15216] - fix : Missing guard condition to disable connection pool toggle 49f65a6 [PLAT-14888]: Fix old dr config which does not contains pitr params Excluded: 3bf9301 [#23596] YSQL: Fix setrefs for Bitmap Index Scan Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, tfoucher Differential Revision: https://phorge.dev.yugabyte.com/D37938
Jira Link: DB-12580
Description
Version: 2.23.1.0-b62
Logs: Added in Jira
Noticed the following in the TServer error log during one of the cross-DB DDL tests.
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted
Test details:
G-flags
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: