-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 unit test failure #23923
Closed
1 task done
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
kind/failing-test
Tests and testing infra
priority/medium
Medium priority issue
status/awaiting-triage
Issue awaiting triage
Comments
agsh-yb
added
kind/failing-test
Tests and testing infra
area/docdb
YugabyteDB core features
area/unittest
tracking platform unit test related issues
status/awaiting-triage
Issue awaiting triage
labels
Sep 13, 2024
yugabyte-ci
added
priority/medium
Medium priority issue
kind/enhancement
This is an enhancement of an existing feature
area/ysql
Yugabyte SQL (YSQL)
and removed
area/unittest
tracking platform unit test related issues
area/docdb
YugabyteDB core features
labels
Sep 13, 2024
myang2021
added a commit
that referenced
this issue
Sep 19, 2024
Summary: The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 in asan build fails a lot due to a newly added assertion by commit a6981f2 like: ``` F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn() ``` After debugging, I found that the assertion failure is related to a DDL statement that does alter table add a unique constraint. It creates an index for the base table. The DDL transaction involves two tables: the base table and the index. If we compare the base table's schema to decide whether the PG side of the DDL transaction has committed or aborted, we will find that the base table's schema does not change. So it is not possible to use base table's schema to make a decision of the transaction's commit/abort state. In this case we must use the index for schema comparison because it is newly created. The bug is that the base table's schema is used for schema comparison and we returned std::nullopt to signify that we do not know whether the DDL transaction is committed or aborted. However, the std::nullopt is not only used by the base table, but also used by the index to call `TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in the DDL transaction and we just loop through them using the same std::nullopt. For the index, it is being created so the first DCHECK below fails: ``` if (!txn_success.has_value()) { // This means the DDL txn only increments the schema version and does not change // the DocDB schema at all. It cannot be one of the following 2 cases. DCHECK(!l->is_being_created_by_ysql_ddl_txn()); DCHECK(!l->is_being_deleted_by_ysql_ddl_txn()); return false; } ``` I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to not use std::optional<bool> for its argument `txn_success`. The function used to take a simple `bool txn_success` as argument. I get it back and only call the function when the argument isn't `std::nullopt`. In other words, I now only call the function when the caller has already figured out the DDL transaction's commit/abort status. Added a new unit test that would fail the assertion before the fix. NOTE: this diff only fixes the DCHECK failure so that we are not worse than before. However we are not better than before either. There is still a flaw with the current DDL atomicity work flow based upon schema comparison when the DDL transaction involves multiple tables. Only the first table's schema is used for comparison on the grounds that we only need to compare one table's schema against PG catalog to determine the DDL transaction's committed/aborted status. While this assumption is generally true, it fails when the first table's schema does not change and only involves a schema version increment. Currently it is not straightforward to revise the work flow to pick the right table which can be used to tell the DDL transaction's status based upon schema comparison. Therefore if the PG side has aborted the DDL transaction, schema comparison based method will conclude that there is no schema change of the base table and therefore it simply clears the DDL verification state for **both the base table and the index**, which is equivalent to a successful commit. This isn't right and the DocDB index will be left as garbage that isn't cleaned up. This issue is tracked separately by #23988 Jira: DB-12825 Test Plan: ./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10 ./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1 Reviewers: fizaa Reviewed By: fizaa Subscribers: ybase, yql Differential Revision: https://phorge.dev.yugabyte.com/D38107
myang2021
added a commit
that referenced
this issue
Sep 20, 2024
Summary: The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 in asan build fails a lot due to a newly added assertion by commit a6981f2 like: ``` F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn() ``` After debugging, I found that the assertion failure is related to a DDL statement that does alter table add a unique constraint. It creates an index for the base table. The DDL transaction involves two tables: the base table and the index. If we compare the base table's schema to decide whether the PG side of the DDL transaction has committed or aborted, we will find that the base table's schema does not change. So it is not possible to use base table's schema to make a decision of the transaction's commit/abort state. In this case we must use the index for schema comparison because it is newly created. The bug is that the base table's schema is used for schema comparison and we returned std::nullopt to signify that we do not know whether the DDL transaction is committed or aborted. However, the std::nullopt is not only used by the base table, but also used by the index to call `TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in the DDL transaction and we just loop through them using the same std::nullopt. For the index, it is being created so the first DCHECK below fails: ``` if (!txn_success.has_value()) { // This means the DDL txn only increments the schema version and does not change // the DocDB schema at all. It cannot be one of the following 2 cases. DCHECK(!l->is_being_created_by_ysql_ddl_txn()); DCHECK(!l->is_being_deleted_by_ysql_ddl_txn()); return false; } ``` I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to not use std::optional<bool> for its argument `txn_success`. The function used to take a simple `bool txn_success` as argument. I get it back and only call the function when the argument isn't `std::nullopt`. In other words, I now only call the function when the caller has already figured out the DDL transaction's commit/abort status. Added a new unit test that would fail the assertion before the fix. NOTE: this diff only fixes the DCHECK failure so that we are not worse than before. However we are not better than before either. There is still a flaw with the current DDL atomicity work flow based upon schema comparison when the DDL transaction involves multiple tables. Only the first table's schema is used for comparison on the grounds that we only need to compare one table's schema against PG catalog to determine the DDL transaction's committed/aborted status. While this assumption is generally true, it fails when the first table's schema does not change and only involves a schema version increment. Currently it is not straightforward to revise the work flow to pick the right table which can be used to tell the DDL transaction's status based upon schema comparison. Therefore if the PG side has aborted the DDL transaction, schema comparison based method will conclude that there is no schema change of the base table and therefore it simply clears the DDL verification state for **both the base table and the index**, which is equivalent to a successful commit. This isn't right and the DocDB index will be left as garbage that isn't cleaned up. This issue is tracked separately by #23988 Jira: DB-12825 Original commit: 8d228a8 / D38107 Test Plan: ./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10 ./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1 Reviewers: fizaa Reviewed By: fizaa Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38223
myang2021
added a commit
that referenced
this issue
Sep 20, 2024
Summary: The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 in asan build fails a lot due to a newly added assertion by commit a6981f2 like: ``` F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn() ``` After debugging, I found that the assertion failure is related to a DDL statement that does alter table add a unique constraint. It creates an index for the base table. The DDL transaction involves two tables: the base table and the index. If we compare the base table's schema to decide whether the PG side of the DDL transaction has committed or aborted, we will find that the base table's schema does not change. So it is not possible to use base table's schema to make a decision of the transaction's commit/abort state. In this case we must use the index for schema comparison because it is newly created. The bug is that the base table's schema is used for schema comparison and we returned std::nullopt to signify that we do not know whether the DDL transaction is committed or aborted. However, the std::nullopt is not only used by the base table, but also used by the index to call `TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in the DDL transaction and we just loop through them using the same std::nullopt. For the index, it is being created so the first DCHECK below fails: ``` if (!txn_success.has_value()) { // This means the DDL txn only increments the schema version and does not change // the DocDB schema at all. It cannot be one of the following 2 cases. DCHECK(!l->is_being_created_by_ysql_ddl_txn()); DCHECK(!l->is_being_deleted_by_ysql_ddl_txn()); return false; } ``` I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to not use std::optional<bool> for its argument `txn_success`. The function used to take a simple `bool txn_success` as argument. I get it back and only call the function when the argument isn't `std::nullopt`. In other words, I now only call the function when the caller has already figured out the DDL transaction's commit/abort status. Added a new unit test that would fail the assertion before the fix. NOTE: this diff only fixes the DCHECK failure so that we are not worse than before. However we are not better than before either. There is still a flaw with the current DDL atomicity work flow based upon schema comparison when the DDL transaction involves multiple tables. Only the first table's schema is used for comparison on the grounds that we only need to compare one table's schema against PG catalog to determine the DDL transaction's committed/aborted status. While this assumption is generally true, it fails when the first table's schema does not change and only involves a schema version increment. Currently it is not straightforward to revise the work flow to pick the right table which can be used to tell the DDL transaction's status based upon schema comparison. Therefore if the PG side has aborted the DDL transaction, schema comparison based method will conclude that there is no schema change of the base table and therefore it simply clears the DDL verification state for **both the base table and the index**, which is equivalent to a successful commit. This isn't right and the DocDB index will be left as garbage that isn't cleaned up. This issue is tracked separately by #23988 Jira: DB-12825 Original commit: 8d228a8 / D38107 Test Plan: ./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10 ./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1 Reviewers: fizaa Reviewed By: fizaa Subscribers: yql, ybase Differential Revision: https://phorge.dev.yugabyte.com/D38224
foucher
pushed a commit
that referenced
this issue
Sep 23, 2024
Summary: ead90cc [#23645] docdb: Fix tests timing out on TSAN after 15786f3 0a6a31e [doc] Fix BNL flag defaults (#23945) 54793c8 [#22925] docdb: Persist tserver registry entries to sys catalog fbef568 [PLAT-15378][localProvider][dr] Deflake testDrConfigSetup local provider test 64ac031 [#23978] xCluster: set up sequences_data stream(s) on target universe 8d228a8 [#23923] YSQL: Fix DDL atomicity check failure 903d793 [PLAT-15328] Configure cgroup for non rhel9 machines as part of provision Excluded: 5dc71ea [#23882] YSQL: Improve cache re-invalidation for alter table commands 1e70024 [DOC-480] CDC metric description and voyager minor fixes (#24028) 7a4b409 [#23700] CDCSDK: Use leader epoch instead of leader term in table removal bg task 4d922ca [#23922] docdb: Handle colocated tablets correctly in tablet limit checks. 487bc77 [PLAT-15158 Update replication frequency tooltip 2059eee [#24001] docdb: Replace tablet in tablegroup manager on repartition of colocated table 90d4e93 [#24020] DocDB: Vector LSM 294b7bb [PLAT-14435]Fix args parsing in failure detection py script Excluded: 872b59e [#23542] YSQL: Add new YSQL function yb_servers_metrics() to fetch metrics such as cpu/memory usage from all nodes in cluster 252717b [PLAT-12263] G-Flag upgrade fails for tmp_dir if Rolling restart used Test Plan: Jenkins: rebase: pg15-cherrypicks Reviewers: jason, jenkins-bot Differential Revision: https://phorge.dev.yugabyte.com/D38266
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
area/ysql
Yugabyte SQL (YSQL)
kind/enhancement
This is an enhancement of an existing feature
kind/failing-test
Tests and testing infra
priority/medium
Medium priority issue
status/awaiting-triage
Issue awaiting triage
Jira Link: DB-12825
Description
This test starts failing on alma8-clang17-asan from commit
Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
Analyze_Trends
../../src/yb/integration-tests/ts_itest-base.cc:396 Value of: daemon->IsShutdown() || daemon->IsProcessAlive() Actual: false Expected: true Daemon: m-2
Issue Type
kind/bug
Warning: Please confirm that this issue does not contain any sensitive information
The text was updated successfully, but these errors were encountered: