Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL][DDL Atomicity] TServer Error: ReportYsqlDdlTxnStatus Call Failed Due to Transaction State Mismatch (kAborted vs kCommitted) in ysql_ddl_handler.cc #23669

Closed
1 task done
shishir2001-yb opened this issue Aug 27, 2024 · 1 comment
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs

Comments

@shishir2001-yb
Copy link

shishir2001-yb commented Aug 27, 2024

Jira Link: DB-12580

Description

Version: 2.23.1.0-b62

Logs: Added in Jira

Noticed the following in the TServer error log during one of the cross-DB DDL tests.
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted

Test details:

        1. Create a cluster with required g-flags
        2. Start the cross DB DDL workload which will execute DDLs and DMLs across databases concurrently (50 colocated
           database and 100 non-colocated database), run this for 20-30 mins
        3. Create a PITR schedule on 20 random database
        4. Start a while loop and run it for 120 mins
          a. Note down time fr PITR(0) 
          b. Create a backup of 1 random database
          c. Start the cross DB DDL workload and stop it after 10 mins
          d. Note down the time for PITR(1)
          e. Start the cross DB DDL workload and run it for 10 mins
          f. Execute PITR on all 10 databases at random times(Between 1-9 sec ago).
          g. Restore to PITR(1)
          h. Validate data
          i. Restore to PITR(0) with a probability of 0.6 and validate data
          j. Delete the PITR schedule for the backup db 
          k. Drop the database 
          l. Restore the backup
          m. Create the snapshot schedule for this new DB

G-flags

           tserver_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "ysql_max_connections": "500",
                'client_read_write_timeout_ms': str(30 * 60 * 1000),
                'yb_client_admin_operation_timeout_sec': str(30 * 60),
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            },
            master_gflags={
                "ysql_enable_packed_row": "true",
                "ysql_enable_packed_row_for_colocated_table": "true",
                "enable_automatic_tablet_splitting": "true",
                "consistent_restore": "true",
                "ysql_enable_db_catalog_version_mode": "true",
                "tablet_replicas_per_gib_limit": 0,
                "ysql_pg_conf_csv": "yb_debug_report_error_stacktrace=true",
                "log_ysql_catalog_versions": "true"
            }

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@shishir2001-yb shishir2001-yb added area/ysql Yugabyte SQL (YSQL) QA QA filed bugs status/awaiting-triage Issue awaiting triage qa_stress Bugs identified via Stress automation labels Aug 27, 2024
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Aug 27, 2024
@myang2021
Copy link
Contributor

I found two logs that showed contradicting message:

./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true

vs

./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false

Both are related to the same txn 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985, and their timestamps are close: 04:05:51.536473 vs 04:05:51.54421.
However the first log says the txn has committed (is_committed: true), and the second log says the txn has aborted (is_committed: false)
Seems that two different code paths deduced different commit/abort status for the same txn.

myang2021 added a commit that referenced this issue Aug 28, 2024
Summary:
From the ticket, the following error is noticed:

```
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted
```

I found two logs that showed contradicting message:

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true
```

vs

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false
```

Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their
timestamps are close: 04:05:51.536473 vs 04:05:51.54421.

However the first log says the txn has committed (is_committed: true), and the
second log says the txn has aborted (is_committed: false)
Seems that two different code paths deduced different commit/abort status for the same txn.

This diff adds a bit more logging to help debugging to understand which code
paths were giving different commit/abort status.
Jira: DB-12580

Test Plan:
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19  --clang17

Reviewers: fizaa, hsunder

Reviewed By: hsunder

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D37599
myang2021 added a commit that referenced this issue Aug 28, 2024
…ailure

Summary:
From the ticket, the following error is noticed:

```
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted
```

I found two logs that showed contradicting message:

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true
```

vs

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false
```

Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their
timestamps are close: 04:05:51.536473 vs 04:05:51.54421.

However the first log says the txn has committed (is_committed: true), and the
second log says the txn has aborted (is_committed: false)
Seems that two different code paths deduced different commit/abort status for the same txn.

This diff adds a bit more logging to help debugging to understand which code
paths were giving different commit/abort status.
Jira: DB-12580

Original commit: 63f1d65 / D37599

Test Plan:
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18  --clang17
./yb_build.sh debug --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19  --clang17

Reviewers: fizaa, hsunder

Reviewed By: hsunder

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37625
jasonyb pushed a commit that referenced this issue Aug 29, 2024
Summary:
 bfd17b5 [PLAT-14601]Support restore to Point in time part 2 - Backup/Restore/Restore preflight related changes - Add JSON Property
 91d000a [doc] Thirdparty to integrations (#23598)
 Excluded: 6456777 [#23626] allow loading old dumps that do not have index pg_class OIDs
 c8b8be3 [DB-12586] yugabyted: Update schema migration UI (#23217)
 Excluded: 4ea354b [#23521] YSQL: Cost YB Bitmap Table Scan remote filters
 b80999d [PLAT-14974][PLAT-15045] Added prometheus user as part of yugabyte group
 63f1d65 [#23669] YSQL: Add more logs to debug an assertion failure
 1ad4795 [docs] [TA] Added TA-23476: YCQL currenttimestamp() precision (#23642)
 41ae6b4 [#23653] docdb: Adjust waits for MasterPathHandlersItest.TestUndeletedParentTablet in TSAN
 78b0ae4 [DB-12587] yugabyted: Update data migration UI (#23291)
 d234b3a [PLAT-15046] Create log directory with correct permissions to allow users to export logs without using sudo
 8713c18 [doc][yba] Clarify pre-req for cloud provider image upgrades (#23285)
 9be5c91 [#23448] YSQL: fix failing test PgAutoAnalyzeTest.CheckTableMutationsCount
 Excluded: 9d54710 [#22147] YSQL, QueryDiagnostics: Pgss support for query diagnostics
 417092a [#23373] DocDB: Add max_disk_throughput_mbps gflag to control disk full rejection
 e3a1a36 [PLAT-15035] Add support to sync gflags secret mount location to actual gflag file used by services
 23a6a4c [PLAT-14525][PLAT-14953] Add local provider tests for switchover, failover, change replica, and restart
 6026029 [PLAT-15100][Master]Observed two Scheduled Backup Policies tabs in Backup page
 2cf648b [#23581] CDCSDK: Support dynamic table addition with table removal
 b14851d [#23702] xClusterDDLRepl: Add extra logging
 8a0d6ff [#23645] docdb: Reorder heartbeat handling logic to fix regression.
 2b30b5e [Docs] Changes for Experimental AI (#23714)

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Differential Revision: https://phorge.dev.yugabyte.com/D37645
myang2021 added a commit that referenced this issue Sep 10, 2024
Summary:
From the ticket, the following error is noticed:

```
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted
```

I found two logs that showed contradicting message:

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true
```

vs

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false
```

Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their
timestamps are close: 04:05:51.536473 vs 04:05:51.54421.

However the first log says the txn has committed (is_committed: true), and the
second log says the txn has aborted (is_committed: false)
So two different code paths deduced different commit/abort status for the same txn.

DDL atomicity has two ways for docdb to follow and become eventually
consistent with the PG transaction status (commit or abort):
(1) PG sends a status report to master tell it whether the txn has committed or
not. This is the fastest since PG drives the DDL txn and it knows firsthand.
(2) Docdb has a background task, periodically polling the transaction status
tablet to find out the status of the give txn. If the transaction status tablet
response with the txn status as terminated (not pending), then the same
background task will begin to read the current PG metadata from sys catalog
table, based upon the nature of the DDL, performs a schema comparison to find
out whether the current PG catalog schema indicated a commit or abort:
(2.1) If the current PG catalog schema matches the current DocDB schema (the new
table schema if the DDL successfully commits), it means the DDL txn has
committed.
(2.2) If the current PG catalog schema matches the previous DocDB schema (the
old table schema prior to the DDL starts), it means the DDL txn has aborted.

The bug is that there are some DDL statements that do not change the table's
schema. For example, `alter table foo alter column id set not null` only bumps
up the table foo's schema version without changing the schema at all. As a
result, the current schema is identical to the previous schema. In other words,
both (2.1) and (2.2) will compare as equal and because we check (2.1) first, we
always conclude that the DDL txn has committed. Had we reversed the order by
checking (2.2) first, then we would always conclude that the DDL txn has
aborted.

I made a change to make the type of is_committed from bool to
std::optional<bool>. When is_committed.has_value() is true, it means either
committed or aborted so we continue with the current logic. When is_committed is
nullopt, it means the DDL does not change the table's schema at all.

A new state TxnState::kNoChange is added in addition to the current kUnknown,
kCommitted, kAborted. It is not possible to tell whether the DDL txn has
committed or aborted by doing schema comparison if the DDL does not change the
table schema. However it is fine because we do not need to know whether the DDL
txn is committed or aborted, we only need to know the DDL txn has terminated.
The end result is that the table's schema does not change. We just need to clear
the DDL atomicity related state from the table's DocDB SysTablesEntryPB
metadata and the table's schema will be correct.
Jira: DB-12580

Test Plan: jenkins

Reviewers: hsunder, fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D37755
myang2021 added a commit that referenced this issue Sep 10, 2024
Summary:
From the ticket, the following error is noticed:

```
E0824 04:05:51.544625 262184 pg_client_session.cc:1109] Sending ReportYsqlDdlTxnStatus call failed: Illegal state (yb/master/ysql_ddl_handler.cc:221): Mismatch in txn_state for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985: kAborted vs kCommitted
```

I found two logs that showed contradicting message:

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.536473 291955 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: true
```

vs

```
./yb-master.ip-172-151-28-119.us-west-2.compute.internal.yugabyte.log.INFO.20240824-035800.38216.gz:I0824 04:05:51.544212 185834 ysql_ddl_handler.cc:188] YsqlDdlTxnCompleteCallback for transaction 6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985 is_committed: false
```

Both are related to the same txn `6c3e8bad-d77d-4fc3-b05f-cde9ba7cc985`, and their
timestamps are close: 04:05:51.536473 vs 04:05:51.54421.

However the first log says the txn has committed (is_committed: true), and the
second log says the txn has aborted (is_committed: false)
So two different code paths deduced different commit/abort status for the same txn.

DDL atomicity has two ways for docdb to follow and become eventually
consistent with the PG transaction status (commit or abort):
(1) PG sends a status report to master tell it whether the txn has committed or
not. This is the fastest since PG drives the DDL txn and it knows firsthand.
(2) Docdb has a background task, periodically polling the transaction status
tablet to find out the status of the give txn. If the transaction status tablet
response with the txn status as terminated (not pending), then the same
background task will begin to read the current PG metadata from sys catalog
table, based upon the nature of the DDL, performs a schema comparison to find
out whether the current PG catalog schema indicated a commit or abort:
(2.1) If the current PG catalog schema matches the current DocDB schema (the new
table schema if the DDL successfully commits), it means the DDL txn has
committed.
(2.2) If the current PG catalog schema matches the previous DocDB schema (the
old table schema prior to the DDL starts), it means the DDL txn has aborted.

The bug is that there are some DDL statements that do not change the table's
schema. For example, `alter table foo alter column id set not null` only bumps
up the table foo's schema version without changing the schema at all. As a
result, the current schema is identical to the previous schema. In other words,
both (2.1) and (2.2) will compare as equal and because we check (2.1) first, we
always conclude that the DDL txn has committed. Had we reversed the order by
checking (2.2) first, then we would always conclude that the DDL txn has
aborted.

I made a change to make the type of is_committed from bool to
std::optional<bool>. When is_committed.has_value() is true, it means either
committed or aborted so we continue with the current logic. When is_committed is
nullopt, it means the DDL does not change the table's schema at all.

A new state TxnState::kNoChange is added in addition to the current kUnknown,
kCommitted, kAborted. It is not possible to tell whether the DDL txn has
committed or aborted by doing schema comparison if the DDL does not change the
table schema. However it is fine because we do not need to know whether the DDL
txn is committed or aborted, we only need to know the DDL txn has terminated.
The end result is that the table's schema does not change. We just need to clear
the DDL atomicity related state from the table's DocDB SysTablesEntryPB
metadata and the table's schema will be correct.
Jira: DB-12580

Original commit: a6981f2 / D37755

Test Plan: jenkins

Reviewers: hsunder, fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D37925
jasonyb pushed a commit that referenced this issue Sep 10, 2024
Summary:
 956f458 [PLAT-15132] Pass logging during config parsing
 34e7ca9 [PLAT-15189]: ReadOnly user needs to know why they cannot perform actions on releases page
 ba25c6e [PLAT-14949]: Remove unused UI packages
 cb7fd11 [PLAT-15187]: Expand universe summary modal shows nodes instead of pods for kubernetes universe
 bdffcc3 [PLAT-15167]: Incorrect UI validation of the hba conf flag while setting up OIDC
 855db3e [PLAT-15105]: Fix TP Secondary Dashboard Issues within YBA
 c9b277a [PLAT-15068]: Lengthy relation/Tablenames are overlapping
 d806031 [PLAT-14897][PLAT-14916] Install node agents on nodes for the marked universes for public cloud providers
 c628f76 [PLAT-15031] yba installer os patch guardrails
 8a06c3c [Docs][Voyager] Minor fixes in known-issues PG/Oracle pages (#23842)
 135aa6f [PLAT-14959] Disable connect only for onprem
 80bc6a8 [PLAT-14564] Add reboot required module for YNP
 1099625 [#23808] DocDB: Return after executing rpc callback in UpdateTransactionWaitingForStatus
 a6981f2 [#23669] YSQL: Fix DDL atomicity assertion failure
 db28666 iTests fix on master
 9f50d34 [PLAT-15216] - fix : Missing guard condition to disable connection pool toggle
 49f65a6 [PLAT-14888]: Fix old dr config which does not contains pitr params
 Excluded: 3bf9301 [#23596] YSQL: Fix setrefs for Bitmap Index Scan

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Differential Revision: https://phorge.dev.yugabyte.com/D37938
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Jan 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/bug This issue is a bug priority/medium Medium priority issue qa_stress Bugs identified via Stress automation QA QA filed bugs
Projects
None yet
Development

No branches or pull requests

3 participants