Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[#23923] YSQL: Fix DDL atomicity check failure
Summary: The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 in asan build fails a lot due to a newly added assertion by commit a6981f2 like: ``` F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn() ``` After debugging, I found that the assertion failure is related to a DDL statement that does alter table add a unique constraint. It creates an index for the base table. The DDL transaction involves two tables: the base table and the index. If we compare the base table's schema to decide whether the PG side of the DDL transaction has committed or aborted, we will find that the base table's schema does not change. So it is not possible to use base table's schema to make a decision of the transaction's commit/abort state. In this case we must use the index for schema comparison because it is newly created. The bug is that the base table's schema is used for schema comparison and we returned std::nullopt to signify that we do not know whether the DDL transaction is committed or aborted. However, the std::nullopt is not only used by the base table, but also used by the index to call `TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in the DDL transaction and we just loop through them using the same std::nullopt. For the index, it is being created so the first DCHECK below fails: ``` if (!txn_success.has_value()) { // This means the DDL txn only increments the schema version and does not change // the DocDB schema at all. It cannot be one of the following 2 cases. DCHECK(!l->is_being_created_by_ysql_ddl_txn()); DCHECK(!l->is_being_deleted_by_ysql_ddl_txn()); return false; } ``` I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to not use std::optional<bool> for its argument `txn_success`. The function used to take a simple `bool txn_success` as argument. I get it back and only call the function when the argument isn't `std::nullopt`. In other words, I now only call the function when the caller has already figured out the DDL transaction's commit/abort status. Added a new unit test that would fail the assertion before the fix. NOTE: this diff only fixes the DCHECK failure so that we are not worse than before. However we are not better than before either. There is still a flaw with the current DDL atomicity work flow based upon schema comparison when the DDL transaction involves multiple tables. Only the first table's schema is used for comparison on the grounds that we only need to compare one table's schema against PG catalog to determine the DDL transaction's committed/aborted status. While this assumption is generally true, it fails when the first table's schema does not change and only involves a schema version increment. Currently it is not straightforward to revise the work flow to pick the right table which can be used to tell the DDL transaction's status based upon schema comparison. Therefore if the PG side has aborted the DDL transaction, schema comparison based method will conclude that there is no schema change of the base table and therefore it simply clears the DDL verification state for **both the base table and the index**, which is equivalent to a successful commit. This isn't right and the DocDB index will be left as garbage that isn't cleaned up. This issue is tracked separately by #23988 Jira: DB-12825 Test Plan: ./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10 ./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1 Reviewers: fizaa Reviewed By: fizaa Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D38107
- Loading branch information