Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 unit test failure #23923

Closed
1 task done
agsh-yb opened this issue Sep 13, 2024 · 0 comments
Closed
1 task done
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage

Comments

@agsh-yb
Copy link
Contributor

agsh-yb commented Sep 13, 2024

Jira Link: DB-12825

Description

This test starts failing on alma8-clang17-asan from commit

Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
Analyze_Trends
../../src/yb/integration-tests/ts_itest-base.cc:396 Value of: daemon->IsShutdown() || daemon->IsProcessAlive() Actual: false Expected: true Daemon: m-2

Issue Type

kind/bug

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@agsh-yb agsh-yb added kind/failing-test Tests and testing infra area/docdb YugabyteDB core features area/unittest tracking platform unit test related issues status/awaiting-triage Issue awaiting triage labels Sep 13, 2024
@yugabyte-ci yugabyte-ci added priority/medium Medium priority issue kind/enhancement This is an enhancement of an existing feature area/ysql Yugabyte SQL (YSQL) and removed area/unittest tracking platform unit test related issues area/docdb YugabyteDB core features labels Sep 13, 2024
myang2021 added a commit that referenced this issue Sep 19, 2024
Summary:
The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
in asan build fails a lot due to a newly added assertion by commit
a6981f2 like:

```
F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn()
```

After debugging, I found that the assertion failure is related to a DDL statement
that does alter table add a unique constraint. It creates an index for
the base table. The DDL transaction involves two tables: the base table
and the index. If we compare the base table's schema to decide whether
the PG side of the DDL transaction has committed or aborted, we will
find that the base table's schema does not change. So it is not possible
to use base table's schema to make a decision of the transaction's
commit/abort state. In this case we must use the index for schema
comparison because it is newly created.

The bug is that the base table's schema is used for schema comparison
and we returned std::nullopt to signify that we do not know whether the
DDL transaction is committed or aborted. However, the std::nullopt is
not only used by the base table, but also used by the index to call
`TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in
the DDL transaction and we just loop through them using the same
std::nullopt. For the index, it is being created so the first DCHECK
below fails:

```
  if (!txn_success.has_value()) {
    // This means the DDL txn only increments the schema version and does not change
    // the DocDB schema at all. It cannot be one of the following 2 cases.
    DCHECK(!l->is_being_created_by_ysql_ddl_txn());
    DCHECK(!l->is_being_deleted_by_ysql_ddl_txn());
    return false;
  }
```

I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to
not use std::optional<bool> for its argument `txn_success`. The function used to
take a simple `bool txn_success` as argument. I get it back and only call the
function when the argument isn't `std::nullopt`. In other words, I now only call
the function when the caller has already figured out the DDL transaction's
commit/abort status.

Added a new unit test that would fail the assertion before the fix.

NOTE: this diff only fixes the DCHECK failure so that we are not worse than before.
However we are not better than before either. There is still a flaw with the current
DDL atomicity work flow based upon schema comparison when the DDL transaction
involves multiple tables. Only the first table's schema is used for comparison
on the grounds that we only need to compare one table's schema against PG catalog
to determine the DDL transaction's committed/aborted status. While this assumption
is generally true, it fails when the first table's schema does not change and only involves
a schema version increment. Currently it is not straightforward to revise the work flow
to pick the right table which can be used to tell the DDL transaction's status based upon
schema comparison. Therefore if the PG side has aborted the DDL transaction, schema
comparison based method will conclude that there is no schema change of the base table
and therefore it simply clears the DDL verification state for **both the base table and the index**,
which is equivalent to a successful commit. This isn't right and the DocDB index will be left
as garbage that isn't cleaned up. This issue is tracked separately by #23988
Jira: DB-12825

Test Plan:
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10

./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D38107
myang2021 added a commit that referenced this issue Sep 20, 2024
Summary:
The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
in asan build fails a lot due to a newly added assertion by commit
a6981f2 like:

```
F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn()
```

After debugging, I found that the assertion failure is related to a DDL statement
that does alter table add a unique constraint. It creates an index for
the base table. The DDL transaction involves two tables: the base table
and the index. If we compare the base table's schema to decide whether
the PG side of the DDL transaction has committed or aborted, we will
find that the base table's schema does not change. So it is not possible
to use base table's schema to make a decision of the transaction's
commit/abort state. In this case we must use the index for schema
comparison because it is newly created.

The bug is that the base table's schema is used for schema comparison
and we returned std::nullopt to signify that we do not know whether the
DDL transaction is committed or aborted. However, the std::nullopt is
not only used by the base table, but also used by the index to call
`TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in
the DDL transaction and we just loop through them using the same
std::nullopt. For the index, it is being created so the first DCHECK
below fails:

```
  if (!txn_success.has_value()) {
    // This means the DDL txn only increments the schema version and does not change
    // the DocDB schema at all. It cannot be one of the following 2 cases.
    DCHECK(!l->is_being_created_by_ysql_ddl_txn());
    DCHECK(!l->is_being_deleted_by_ysql_ddl_txn());
    return false;
  }
```

I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to
not use std::optional<bool> for its argument `txn_success`. The function used to
take a simple `bool txn_success` as argument. I get it back and only call the
function when the argument isn't `std::nullopt`. In other words, I now only call
the function when the caller has already figured out the DDL transaction's
commit/abort status.

Added a new unit test that would fail the assertion before the fix.

NOTE: this diff only fixes the DCHECK failure so that we are not worse than before.
However we are not better than before either. There is still a flaw with the current
DDL atomicity work flow based upon schema comparison when the DDL transaction
involves multiple tables. Only the first table's schema is used for comparison
on the grounds that we only need to compare one table's schema against PG catalog
to determine the DDL transaction's committed/aborted status. While this assumption
is generally true, it fails when the first table's schema does not change and only involves
a schema version increment. Currently it is not straightforward to revise the work flow
to pick the right table which can be used to tell the DDL transaction's status based upon
schema comparison. Therefore if the PG side has aborted the DDL transaction, schema
comparison based method will conclude that there is no schema change of the base table
and therefore it simply clears the DDL verification state for **both the base table and the index**,
which is equivalent to a successful commit. This isn't right and the DocDB index will be left
as garbage that isn't cleaned up. This issue is tracked separately by #23988
Jira: DB-12825

Original commit: 8d228a8 / D38107

Test Plan:
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10

./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38223
myang2021 added a commit that referenced this issue Sep 20, 2024
Summary:
The unit test Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1
in asan build fails a lot due to a newly added assertion by commit
a6981f2 like:

```
F20240916 20:59:08 ../../src/yb/master/catalog_entity_info.cc:531] Check failed: !l->is_being_created_by_ysql_ddl_txn()
```

After debugging, I found that the assertion failure is related to a DDL statement
that does alter table add a unique constraint. It creates an index for
the base table. The DDL transaction involves two tables: the base table
and the index. If we compare the base table's schema to decide whether
the PG side of the DDL transaction has committed or aborted, we will
find that the base table's schema does not change. So it is not possible
to use base table's schema to make a decision of the transaction's
commit/abort state. In this case we must use the index for schema
comparison because it is newly created.

The bug is that the base table's schema is used for schema comparison
and we returned std::nullopt to signify that we do not know whether the
DDL transaction is committed or aborted. However, the std::nullopt is
not only used by the base table, but also used by the index to call
`TableInfo::IsBeingDroppedDueToDdlTxn` because there are two tables in
the DDL transaction and we just loop through them using the same
std::nullopt. For the index, it is being created so the first DCHECK
below fails:

```
  if (!txn_success.has_value()) {
    // This means the DDL txn only increments the schema version and does not change
    // the DocDB schema at all. It cannot be one of the following 2 cases.
    DCHECK(!l->is_being_created_by_ysql_ddl_txn());
    DCHECK(!l->is_being_deleted_by_ysql_ddl_txn());
    return false;
  }
```

I fixed the DCHECK failure by changing the API of `TableInfo::IsBeingDroppedDueToDdlTxn` to
not use std::optional<bool> for its argument `txn_success`. The function used to
take a simple `bool txn_success` as argument. I get it back and only call the
function when the argument isn't `std::nullopt`. In other words, I now only call
the function when the caller has already figured out the DDL transaction's
commit/abort status.

Added a new unit test that would fail the assertion before the fix.

NOTE: this diff only fixes the DCHECK failure so that we are not worse than before.
However we are not better than before either. There is still a flaw with the current
DDL atomicity work flow based upon schema comparison when the DDL transaction
involves multiple tables. Only the first table's schema is used for comparison
on the grounds that we only need to compare one table's schema against PG catalog
to determine the DDL transaction's committed/aborted status. While this assumption
is generally true, it fails when the first table's schema does not change and only involves
a schema version increment. Currently it is not straightforward to revise the work flow
to pick the right table which can be used to tell the DDL transaction's status based upon
schema comparison. Therefore if the PG side has aborted the DDL transaction, schema
comparison based method will conclude that there is no schema change of the base table
and therefore it simply clears the DDL verification state for **both the base table and the index**,
which is equivalent to a successful commit. This isn't right and the DocDB index will be left
as garbage that isn't cleaned up. This issue is tracked separately by #23988
Jira: DB-12825

Original commit: 8d228a8 / D38107

Test Plan:
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test --gtest_filter PgDdlAtomicityTest.TestAlterTableAddUniqueConstraint -n 10

./yb_build.sh asan --cxx-test tools_yb-admin-snapshot-schedule-test --gtest_filter Colocation/YbAdminSnapshotScheduleTestWithYsqlParam.PgsqlAddUniqueConstraint/1 --clang17 -n 10 --tp 1

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D38224
foucher pushed a commit that referenced this issue Sep 23, 2024
Summary:
 ead90cc [#23645] docdb: Fix tests timing out on TSAN after 15786f3
 0a6a31e [doc] Fix BNL flag defaults (#23945)
 54793c8 [#22925] docdb: Persist tserver registry entries to sys catalog
 fbef568 [PLAT-15378][localProvider][dr] Deflake testDrConfigSetup local provider test
 64ac031 [#23978] xCluster: set up sequences_data stream(s) on target universe
 8d228a8 [#23923] YSQL: Fix DDL atomicity check failure
 903d793 [PLAT-15328] Configure cgroup for non rhel9 machines as part of provision
 Excluded: 5dc71ea [#23882] YSQL: Improve cache re-invalidation for alter table commands
 1e70024 [DOC-480] CDC metric description and voyager minor fixes (#24028)
 7a4b409 [#23700] CDCSDK: Use leader epoch instead of leader term in table removal bg task
 4d922ca [#23922] docdb: Handle colocated tablets correctly in tablet limit checks.
 487bc77 [PLAT-15158 Update replication frequency tooltip
 2059eee [#24001] docdb: Replace tablet in tablegroup manager on repartition of colocated table
 90d4e93 [#24020] DocDB: Vector LSM
 294b7bb [PLAT-14435]Fix args parsing in failure detection py script
 Excluded: 872b59e [#23542] YSQL: Add new YSQL function yb_servers_metrics() to fetch metrics such as cpu/memory usage from all nodes in cluster
 252717b [PLAT-12263] G-Flag upgrade fails for tmp_dir if Rolling restart used

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, jenkins-bot

Differential Revision: https://phorge.dev.yugabyte.com/D38266
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage
Projects
None yet
Development

No branches or pull requests

3 participants