Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[YSQL] YSQL DDL Atomicity #13358

Open
deeps1991 opened this issue Jul 19, 2022 · 1 comment
Open

[YSQL] YSQL DDL Atomicity #13358

deeps1991 opened this issue Jul 19, 2022 · 1 comment
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@deeps1991
Copy link
Contributor

deeps1991 commented Jul 19, 2022

Jira Link: DB-2996

Description

Currently YSQL table metadata is stored in both the PG catalog and DocDB sys-catalog. This ticket tracks building infrastructure to roll-back any DocDB side schema changes if the DDL transaction is aborted.

The basic idea is to update the DocDB schema and PG catalog in a 2 phase manner. We bind the DocDB schema change to its corresponding YSQL transaction, maintain both the previous and the in-flux changes in the schema and poll the status of the YSQL transaction. Just two versions of schema is sufficient, because we do not support Concurrent DDL on a table. Once the transaction completes, based on the transaction status, roll-back or roll-forward the changes in the DocDB schema such that they are both consistent.

Preview/Early Adoption

Status Task Comments
Part 1: YB-Master BG Task for DDL Verification DocDB schema changes are bound to a YSQL DDL transaction id. A YB-Master background task is started up for each such DDL, which polls the txn status, compares both schemas and resolves inconsistencies without affecting concurrent DML. Support DROP TABLE, ALTER TABLE ADD COLUMN, Rename of Table/Column
Part 2: YSQL sends DDL status to YB-Master YSQL sends the transaction status to the YB-Master at the end of every DDL transaction, eliminating the need for the expensive YB-Master background task in non-crash scenarios
Part 3: Rollback ALTER TABLE DROP COLUMN Drop Column should happen in 2 phases - first mark the column for deletion and prevent concurrent accesses to the column. After the DDL transaction commits, perform actual cleanup of the data in the column. Otherwise, the delete marker is rolled-back.
Part 4: Support Rollback for Indexes, Tablegroups, Colocated Tables and Materialized Views Extend Rollback support from just regular tables to for Tablegroups, Colocated Tables, Materialized Views and Indexes
Part 5: YB-Backup Support YB-Backup script must be able to handle tables that are undergoing transaction verification at the time of backup
Part 6: Support Sequences #16309

GA

Status Task Comments
⬜️ Part 9: Handle Upgrade Handle the case where a cluster already has tables that have inconsistent DocDB metadata and PG catalog metadata so that future operations that compare PG catalog and DocDB schema do not fail.
⬜️ Part 11: YB-Master identifies stuck rollback tasks In extremely rare cases, we may fail scheduling the YB-Master background task. Add a task to the catalog manager to identify such stuck DDL verification tasks and resume them.

Longer Term/Post GA

Status Task Comments
⬜️ Part 7: 2-phased Rollback of DROP TABLE/INDEX Like DROP COLUMN, implement DROP TABLE/INDEX in 2 phases. First mark the table/index for deletion and prevent concurrent accesses to the table/index. After the DDL transaction commits, perform actual cleanup of the data in the table/index.
⬜️ Part 8: YB-Admin tool to fix inconsistent schemas YB-Admin support that can help identify and resolve any tables that have mismatched schemas
⬜️ Part 10: Support concurrent PITR and DDL Rollback Once #14679 lands, support PITR + Rollback running in parallel
⬜️ Support rolling back other forms of Alter Support for altering the schema, pg_type.
⬜️ Rework Tablespaces Set the tablespace id in the DocDB schema rather than using a periodic background task to read the PG catalog
⬜️ Support Rollback of DDL in Savepoints Support ROLLBACK TO SAVEPOINT when the save point contains DDL operations
@adil246
Copy link
Contributor

adil246 commented Jul 21, 2022

Adding other folks who might be interested in this issue @phendric-uk @jasonriddell

@yugabyte-ci yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Oct 5, 2022
deeps1991 added a commit that referenced this issue Nov 17, 2022
Summary:
Currently the metadata for YSQL Tables is stored in both the PG catalog and DocDB schema. Sometimes when a DDL transaction fails, we can hit an issue where these two sources of metadata go out of sync. This is because the PG catalog is modified using the DocDB transactions framework, but the DocDB schema is stored in the Kudu-inherited structures in the YB-Master sys catalog. Thus when a transaction is aborted, only the changes to the PG catalog rollback, but the changes made to the DocDB schema cannot be rolled-back.

This patch contains the first phase of changes required to fix this issue. We now enforce that only one DDL that modifies the DocDB schema can occur on a table. This simplifies the problem because, if a DDL transaction starts, the DocDB schema now only needs to maintain two versions of the schema - the current committed schema, and the uncommitted schema of the ongoing transaction.

This patch then extends the Transaction GC framework introduced in commit: 88e9d59 to also handle ALTER operations and DROP operations. When a DDL transaction modifies the DocDB schema, the transaction metadata, and the current schema of the table is stored in the SysTablesEntryPB. YB-Master monitors the state of the transaction. Once the transaction ends, YB-Master detects whether the transaction was successful or not by comparing the DocDB schema and PG catalog. If the transaction is aborted, YB-Master will rollback the changes made to the DocDB schema using the state present in the SysTablesEntryPB.

**Correctness**
Thus, the two schemas can be described to be "eventually consistent" because we can have periods of time where both the schemas do not match.
However this mostly not affect correctness because of the following two properties:

  - When inconsistent, the DocDB schema always has more columns/constraints than PG schema.
  - Clients always use the PG schema (which is guaranteed to not return uncommitted state) to prepare their read/write requests.

These two properties ensure that we can't have orphaned data or failed integrity checks or use DDL entities created by uncommitted
transactions because of DocDB schema being inconsistent.

**Feature Flag:** ysql_ddl_rollback_enabled
Note that this flag is shared across pg backends and YB-Master. This flag is used by the PG backends to determine whether or not to perform best-effort rollback for ALTER TABLE ADD column introduced in 5a6bfcd
Note that we do not support changing this flag //**during**// a DDL operation.
If this flag is changed to false after a DDL but before the rollback completes, then all future DDL operations will ignore the presence of YsqlTxnVerifierState. Alter operations in particular will clear the YsqlTxnVerifierState as well.

Notes:
  - Since this patch contains only the first phase of changes, this feature is disabled by default using the flag `ysql_ddl_rollback_enabled`
  - This patch only handles Create Table, Alter Table Add Column, Alter Table Add Primary Key, Alter Table Rename Table/Column and Drop Table
  -  This patch does not have support for databases, materialized views, indexes, colocated tables, PG catalog tables and tablegroups.
  - GUC `yb_test_fail_next_ddl` has been changed to fail any upcoming DDL operation, not just CREATE TABLE ddl operations.

Design Doc: https://docs.google.com/document/d/1YAUnSV9X7Gv2ZVpILUe15NbXbVwgFvpV8IRaj6wFMZo/edit?usp=sharing

Test Plan: ybd --cxx-test pg_ddl_atomicity-test

Reviewers: ena, nicolas, myang

Reviewed By: nicolas, myang

Subscribers: zyu, pjain, jmaley, hsunder, yguan, dmitry, ybase, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D18398
jayant07-yb pushed a commit to jayant07-yb/yugabyte-db that referenced this issue Dec 7, 2022
Summary:
Currently the metadata for YSQL Tables is stored in both the PG catalog and DocDB schema. Sometimes when a DDL transaction fails, we can hit an issue where these two sources of metadata go out of sync. This is because the PG catalog is modified using the DocDB transactions framework, but the DocDB schema is stored in the Kudu-inherited structures in the YB-Master sys catalog. Thus when a transaction is aborted, only the changes to the PG catalog rollback, but the changes made to the DocDB schema cannot be rolled-back.

This patch contains the first phase of changes required to fix this issue. We now enforce that only one DDL that modifies the DocDB schema can occur on a table. This simplifies the problem because, if a DDL transaction starts, the DocDB schema now only needs to maintain two versions of the schema - the current committed schema, and the uncommitted schema of the ongoing transaction.

This patch then extends the Transaction GC framework introduced in commit: 88e9d59 to also handle ALTER operations and DROP operations. When a DDL transaction modifies the DocDB schema, the transaction metadata, and the current schema of the table is stored in the SysTablesEntryPB. YB-Master monitors the state of the transaction. Once the transaction ends, YB-Master detects whether the transaction was successful or not by comparing the DocDB schema and PG catalog. If the transaction is aborted, YB-Master will rollback the changes made to the DocDB schema using the state present in the SysTablesEntryPB.

**Correctness**
Thus, the two schemas can be described to be "eventually consistent" because we can have periods of time where both the schemas do not match.
However this mostly not affect correctness because of the following two properties:

  - When inconsistent, the DocDB schema always has more columns/constraints than PG schema.
  - Clients always use the PG schema (which is guaranteed to not return uncommitted state) to prepare their read/write requests.

These two properties ensure that we can't have orphaned data or failed integrity checks or use DDL entities created by uncommitted
transactions because of DocDB schema being inconsistent.

**Feature Flag:** ysql_ddl_rollback_enabled
Note that this flag is shared across pg backends and YB-Master. This flag is used by the PG backends to determine whether or not to perform best-effort rollback for ALTER TABLE ADD column introduced in 5a6bfcd
Note that we do not support changing this flag //**during**// a DDL operation.
If this flag is changed to false after a DDL but before the rollback completes, then all future DDL operations will ignore the presence of YsqlTxnVerifierState. Alter operations in particular will clear the YsqlTxnVerifierState as well.

Notes:
  - Since this patch contains only the first phase of changes, this feature is disabled by default using the flag `ysql_ddl_rollback_enabled`
  - This patch only handles Create Table, Alter Table Add Column, Alter Table Add Primary Key, Alter Table Rename Table/Column and Drop Table
  -  This patch does not have support for databases, materialized views, indexes, colocated tables, PG catalog tables and tablegroups.
  - GUC `yb_test_fail_next_ddl` has been changed to fail any upcoming DDL operation, not just CREATE TABLE ddl operations.

Design Doc: https://docs.google.com/document/d/1YAUnSV9X7Gv2ZVpILUe15NbXbVwgFvpV8IRaj6wFMZo/edit?usp=sharing

Test Plan: ybd --cxx-test pg_ddl_atomicity-test

Reviewers: ena, nicolas, myang

Reviewed By: nicolas, myang

Subscribers: zyu, pjain, jmaley, hsunder, yguan, dmitry, ybase, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D18398
deeps1991 added a commit that referenced this issue Jan 21, 2023
…atus to YB-Master

Summary:
In Part-1, for DDL operations that change the DocDB schema, YB-Master
asynchronously polls the transaction status tablet to fetch the status of those
DDL transactions.
Once the transaction is determined to have completed, Yb-Master performs
the relatively expensive operation of comparing the DocDB schema and PG
schema to determine whether the transaction was a success to know
whether to roll-back the DocDB changes.

This behavior is optimized in this patch, wherein at the end of the
transaction, YSQL sends the status of the transaction to YB-Master. This
way YB-Master can stop polling the transaction status tablet and also
skip the relatively expensive schema comparison.

If the YSQL backend crashes or is unable to send the status of the transaction
to YB-Master, it will still fall-back to the behavior of Part-1 and compare the
DocDB and PG schema to know whether the transaction succeeded or failed.

**Flags**
The above behavior can be controlled by the runtime safe TServer flag **report_ysql_ddl_txn_status_to_master**.
It is turned off for now as ysql_ddl_rollback enabled is false

Test Plan: ybd --cxx-test pg_ddl_atomicity

Reviewers: nicolas, myang, dmitry

Reviewed By: myang, dmitry

Subscribers: lnguyen, yql, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D18915
deeps1991 added a commit that referenced this issue Jun 26, 2023
…OLUMN

Summary:
The current workflow for ALTER TABLE DROP COLUMN is as follows:
1. Start YSQL transaction to update the PG catalog tables
2. Send AlterTable request to the YB-Master. This aborts any ongoing transactions on the table and drops the column. Any future transactions on the table with the old schema fail due to schema version mismatch.
3. YSQL polls the YB-Master until the above Alter operation is complete on all TServers and commits the transaction.

The above workflow has the issue that if the YSQL backend crashes during
Step 3, it is possible that the changes made to the PG catalog are aborted,
but the column (and data) are already dropped on the DocDB side.

It is not safe to merely postpone the actual column drop until the end of
the transaction. In that case, we could have transactions using the dropped
column until either of the following conditions are met:
(1) The new table schema is propagated to all the TServers. This is not instantaneous and can be subject to internal or network delays
(2) The new catalog version is propagated to all the PG backends (which can take a long time, because Alter Table operations do not cause breaking_catalog_version increment. This means caches will be refreshed on the backends only on  transaction boundaries).

This patch introduces the following algorithm for dropping a column:
1. Start YSQL transaction to update the PG catalog tables
2. Send AlterTable request to the YB-Master. This aborts any ongoing transactions on the table and marks the column for deletion. The TServers will fail any future operations on this table that access this dropped column.
3. YSQL polls the YB-Master until the above Alter operation is complete on all TServers and completes the transaction.
4. YB-Master detects that this transaction is complete (either by RPC from YSQL backend ([[ https://phabricator.dev.yugabyte.com/D18915 | Part2]]) or through its background task polling the txn status tablet ([[ https://phabricator.dev.yugabyte.com/D18398 | Part1 ]]).
5. If success, it drops the column. If abort, it rolls-back to the previous schema.

**Design doc:** https://docs.google.com/document/d/1YAUnSV9X7Gv2ZVpILUe15NbXbVwgFvpV8IRaj6wFMZo/edit#heading=h.mc087k1zv9cr

**Note**
1. DROP column now has 2 schema changes. This can cause an increase in the number of SchemaVersionMismatch errors. Ideally, merely marking a column for deletion should not cause schema version mismatch. If the operation is touching a column, the tablet server will drop the operation anyway. Clients on schema n-1 need not fail with SchemaVersionMismatch if they are not accessing this dropped column. This is out of scope for the current diff, which is intended mainly to solve the possibility of data loss when dropping a column. This improvement will be implemented as part of a separate diff.
2. This diff will not catch the case where a query accessing the dropped column falls on an index. This will be covered as part of 2 upcoming changes in DDL atomicity: a) Enabling DDL rollback for indexes b) Implement 2-phase rollback for indexes
Jira: DB-2996

Test Plan:
ybd --cxx-test pg_drop_column_test
ybd --cxx-test pg_ddl_atomicity

Manually run TestAlterTableWithConcurrentTxn.java with ysql_ddl_rollback enabled.

Reviewers: hsunder, myang

Reviewed By: hsunder, myang

Subscribers: pjain, mihnea, tverona, amitanand, ena, jason, ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D19373
deeps1991 added a commit that referenced this issue Aug 17, 2023
…ocatedTables/Tablegroups

Summary:
Commit 6e604ba introduced the YSQL DDL Atomicity framework. However the framework explicitly handled only regular tables.

This diff introduces DDL Atomicity support for Indexes, Matviews, Colocated Tables and Tablegroups and contains the following changes:
(i)  Remove all the checks disabling DDL Atomicity support for entities other than tables.
(ii) If DDL Atomicity is enabled, disable the old method of handling DROP INDEX, DROP TABLEGROUP (The old way involved sending the DocDB request to Drop the entity only after transaction commit)
(iii) The framework in ysql_transaction_ddl must know whether to check pg_class or pg_tablegroup based on the entity for which we are performing DDL verification
(iv) YB-Master hits a deadlock when an index and the indexed table are dropped at the same time. This is because dropping an index acquires a lock on the index and then on the indexed table. Dropping the indexed table acquires a lock on the indexed table and then the index. This case is extremely difficult to hit through YSQL, however if CREATE TABLE with UNIQUE CONSTRAINT runs, we will create a table and index in the same transaction. If this transaction fails, the atomicity framework left as is will issue a DROP for the table and index separately at the same time. This will run into deadlock due to the inverted lock acquisition. To prevent this, introduce separate handling for create-table-with-index/drop-table-with-index - enqueue a transaction verification task only for the indexed table if a CREATE/DROP transaction operates on both the table and the index.

Additionally this diff also fixes the behavior of the ysql_ddl_rollback_enabled flag. Earlier this flag could be modified at either the TServer/YB-Master. Now the flag can only be set at the TServer, and will be propagated to YB-Master as part of the RequestPB proto for the affected DDLs. Thus the value of this flag will be checked only once at the beginning of a DDL transaction.
Jira: DB-2996

Test Plan:
ybd --cxx-test pg_ddl_atomicity
ybd --cxx-test pg_drop_column_test

Reviewers: hsunder, myang

Reviewed By: hsunder, myang

Subscribers: ybase, jhe, ena, yql, bogdan

Differential Revision: https://phorge.dev.yugabyte.com/D22148
deeps1991 added a commit that referenced this issue Sep 25, 2023
Summary:
**Problem**
This patch handles the case where YB-Backup is performed when Ddl verification
is still underway. Currently there is a check in the yb_backup.py script that
loops on the catalog version to verify that no Ddls have happened during the
backupprocess. However, if it is only the DDL verification that happened after a
successful/failed DDL then this will not be picked up the catalog version check.

Thus, it is possible for backup to have run when the PG catalog and DocDB
schema were not in sync with each other. This could lead to the snapshot
containing a table with different metadata compared to the table created on the
target side based on the PG metadata (as it uses ysql_dump).

Currently we try to match the source table and the target table by comparing
their schemas. Since they may mismatch, we can no longer match the tables.
Technically the source side does contain the YsqlDdlTxnVerifierState which
means the snapshot contains both the current and previous schemas. However
it is possible that we find a match for both the current and previous schemas.

Example: Say rename failed for a table because the name was already in use.
It could be possible for the rename to have succeeded on the DocDB side, leading
to 2 tables with the same name on YB-Master. If backup starts at this point, the
snapshot would contain two tables with the same name. When we iterate through
the tables in the snapshot, we would see this table with the transaction
verification state with 2 schemas. However we would find a table that matches
the current schema (the other table with the same name) and a table that matches
the previous schema (the old table before rename). We wouldn't have a simple way
of finding out which table is the matching table.

**Fix**
Fail backup if we find tables in the snapshot with YsqlDdlTxnVerifierState on
them.

**Other minor fixes**
The new tests caught the following problems, so their one-line fixes have been
sneaked into this diff

1.While matching schemas DDL Atomicity should handle the presence of column
**ybidxbasectid** which is a DocDB only column used in UNIQUE indexes
2.Missed a continue statement in YsqlTransactionDdl::MatchPgDocDBSchemaColumns
which was leading to issues in the happy path when ALTER TABLE DROP COLUMN
was successful. It was not caught earlier as existing tests only dropped the
last column in the schema.

**API/Gflag changes**
1.GetTableSchema RPC now additionally returns any YsqlDdlTxnVerifierState for
this table.  Currently this is only used by tests to wait for DDL verification
to end.
2.**FLAGS_retry_if_ddl_txn_verification_pending** is used to indicate whether
CatalogManager::AlterTable and CatalogManager::DeleteTable should best-effort
wait for any existing YsqlDdlTxnVerifierState on the table to be resolved before
throwing an error  (before this patch it threw error immediately). This flag
helps the case when sql scripts (such as running output of ysql_dump/PG regress
tests) run 2 DDLs on a table in quick succession. In this case, without this
best-effort wait, we might fail the sequentially occurring second DDL simply
because DDL verification has not ended. It is not easy for customers to
introduce retries in such DDL scripts, so this best effort check should be
useful. This fix is required in this diff as it is needed for backup tests to
finish successfully (as they run ysql_dump)
Jira: DB-2996

Test Plan:
./yb_build.sh debug --cxx-test yb-backup-cross-feature-test --gtest_filter "*YBDdlAtomicityBackupTest*"
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test

I made some proto changes to YsqlDdlTxnVerifierState structure. It should
theoretically not affect upgrade as it does not change the actual number/type of
fields in the proto but I ran an upgrade test anyway.
1)Revert this diff. Enable DDL rollback but disable DDL verification. Run some
DDLs. This will lead to YB-Master writing SysTableEntryProto with the old format
into disk
2)Stop the cluster. Un-revert the diff and start the cluster with new changes
3)Verify that DDLs were verified and all future DML and DDL succeed with the
new proto changes

Reviewers: jhe, myang, loginov

Reviewed By: jhe, loginov

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D27966
deeps1991 pushed a commit that referenced this issue Oct 20, 2023
…pport

Summary:
Original commit: 6ff2aea / D27966
**Problem**
This patch handles the case where YB-Backup is performed when Ddl verification
is still underway. Currently there is a check in the yb_backup.py script that
loops on the catalog version to verify that no Ddls have happened during the
backupprocess. However, if it is only the DDL verification that happened after a
successful/failed DDL then this will not be picked up the catalog version check.

Thus, it is possible for backup to have run when the PG catalog and DocDB
schema were not in sync with each other. This could lead to the snapshot
containing a table with different metadata compared to the table created on the
target side based on the PG metadata (as it uses ysql_dump).

Currently we try to match the source table and the target table by comparing
their schemas. Since they may mismatch, we can no longer match the tables.
Technically the source side does contain the YsqlDdlTxnVerifierState which
means the snapshot contains both the current and previous schemas. However
it is possible that we find a match for both the current and previous schemas.

Example: Say rename failed for a table because the name was already in use.
It could be possible for the rename to have succeeded on the DocDB side, leading
to 2 tables with the same name on YB-Master. If backup starts at this point, the
snapshot would contain two tables with the same name. When we iterate through
the tables in the snapshot, we would see this table with the transaction
verification state with 2 schemas. However we would find a table that matches
the current schema (the other table with the same name) and a table that matches
the previous schema (the old table before rename). We wouldn't have a simple way
of finding out which table is the matching table.

**Fix**
Fail backup if we find tables in the snapshot with YsqlDdlTxnVerifierState on
them.

**Other minor fixes**
The new tests caught the following problems, so their one-line fixes have been
sneaked into this diff

1.While matching schemas DDL Atomicity should handle the presence of column
**ybidxbasectid** which is a DocDB only column used in UNIQUE indexes
2.Missed a continue statement in YsqlTransactionDdl::MatchPgDocDBSchemaColumns
which was leading to issues in the happy path when ALTER TABLE DROP COLUMN
was successful. It was not caught earlier as existing tests only dropped the
last column in the schema.

**API/Gflag changes**
1.GetTableSchema RPC now additionally returns any YsqlDdlTxnVerifierState for
this table.  Currently this is only used by tests to wait for DDL verification
to end.
2.**FLAGS_retry_if_ddl_txn_verification_pending** is used to indicate whether
CatalogManager::AlterTable and CatalogManager::DeleteTable should best-effort
wait for any existing YsqlDdlTxnVerifierState on the table to be resolved before
throwing an error  (before this patch it threw error immediately). This flag
helps the case when sql scripts (such as running output of ysql_dump/PG regress
tests) run 2 DDLs on a table in quick succession. In this case, without this
best-effort wait, we might fail the sequentially occurring second DDL simply
because DDL verification has not ended. It is not easy for customers to
introduce retries in such DDL scripts, so this best effort check should be
useful. This fix is required in this diff as it is needed for backup tests to
finish successfully (as they run ysql_dump)
Jira: DB-2996

Test Plan:
Jenkins: urgent

./yb_build.sh debug --cxx-test yb-backup-cross-feature-test --gtest_filter "*YBDdlAtomicityBackupTest*"
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test

I made some proto changes to YsqlDdlTxnVerifierState structure. It should
theoretically not affect upgrade as it does not change the actual number/type of
fields in the proto but I ran an upgrade test anyway.
1)Revert this diff. Enable DDL rollback but disable DDL verification. Run some
DDLs. This will lead to YB-Master writing SysTableEntryProto with the old format
into disk
2)Stop the cluster. Un-revert the diff and start the cluster with new changes
3)Verify that DDLs were verified and all future DML and DDL succeed with the
new proto changes

Reviewers: jhe, myang, loginov

Reviewed By: jhe

Subscribers: yql, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29521
myang2021 added a commit that referenced this issue May 17, 2024
Summary:
The way the test YBDdlAtomicityBackupTest::RunDdlAtomicityTest is written, all
the DDLs will be inflight during the backup - so backup will fail. It's better
to test the case where backup fails if any single DDL is inflight. This diff
changes the test to randomly pick one DDL to be inflight and verify the backup
fails.
Jira: DB-2996

Test Plan:
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter YBDdlAtomicityBackupTest.DdlRollbackAtomicityTest  -n 20

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D35140
myang2021 added a commit that referenced this issue May 17, 2024
…cityTest test change

Summary:
The way the test YBDdlAtomicityBackupTest::RunDdlAtomicityTest is written, all
the DDLs will be inflight during the backup - so backup will fail. It's better
to test the case where backup fails if any single DDL is inflight. This diff
changes the test to randomly pick one DDL to be inflight and verify the backup
fails.
Jira: DB-2996

Original commit: 5ebfef8 / D35140

Test Plan: ./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter YBDdlAtomicityBackupTest.DdlRollbackAtomicityTest  -n 20

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35150
jharveysmith pushed a commit that referenced this issue May 24, 2024
…pport

Summary:
Original commit: 20b330695d58fa4840ab558e1c1134ec74f97721 / D27966
**Problem**
This patch handles the case where YB-Backup is performed when Ddl verification
is still underway. Currently there is a check in the yb_backup.py script that
loops on the catalog version to verify that no Ddls have happened during the
backupprocess. However, if it is only the DDL verification that happened after a
successful/failed DDL then this will not be picked up the catalog version check.

Thus, it is possible for backup to have run when the PG catalog and DocDB
schema were not in sync with each other. This could lead to the snapshot
containing a table with different metadata compared to the table created on the
target side based on the PG metadata (as it uses ysql_dump).

Currently we try to match the source table and the target table by comparing
their schemas. Since they may mismatch, we can no longer match the tables.
Technically the source side does contain the YsqlDdlTxnVerifierState which
means the snapshot contains both the current and previous schemas. However
it is possible that we find a match for both the current and previous schemas.

Example: Say rename failed for a table because the name was already in use.
It could be possible for the rename to have succeeded on the DocDB side, leading
to 2 tables with the same name on YB-Master. If backup starts at this point, the
snapshot would contain two tables with the same name. When we iterate through
the tables in the snapshot, we would see this table with the transaction
verification state with 2 schemas. However we would find a table that matches
the current schema (the other table with the same name) and a table that matches
the previous schema (the old table before rename). We wouldn't have a simple way
of finding out which table is the matching table.

**Fix**
Fail backup if we find tables in the snapshot with YsqlDdlTxnVerifierState on
them.

**Other minor fixes**
The new tests caught the following problems, so their one-line fixes have been
sneaked into this diff

1.While matching schemas DDL Atomicity should handle the presence of column
**ybidxbasectid** which is a DocDB only column used in UNIQUE indexes
2.Missed a continue statement in YsqlTransactionDdl::MatchPgDocDBSchemaColumns
which was leading to issues in the happy path when ALTER TABLE DROP COLUMN
was successful. It was not caught earlier as existing tests only dropped the
last column in the schema.

**API/Gflag changes**
1.GetTableSchema RPC now additionally returns any YsqlDdlTxnVerifierState for
this table.  Currently this is only used by tests to wait for DDL verification
to end.
2.**FLAGS_retry_if_ddl_txn_verification_pending** is used to indicate whether
CatalogManager::AlterTable and CatalogManager::DeleteTable should best-effort
wait for any existing YsqlDdlTxnVerifierState on the table to be resolved before
throwing an error  (before this patch it threw error immediately). This flag
helps the case when sql scripts (such as running output of ysql_dump/PG regress
tests) run 2 DDLs on a table in quick succession. In this case, without this
best-effort wait, we might fail the sequentially occurring second DDL simply
because DDL verification has not ended. It is not easy for customers to
introduce retries in such DDL scripts, so this best effort check should be
useful. This fix is required in this diff as it is needed for backup tests to
finish successfully (as they run ysql_dump)
Jira: DB-2996

Test Plan:
Jenkins: urgent

./yb_build.sh debug --cxx-test yb-backup-cross-feature-test --gtest_filter "*YBDdlAtomicityBackupTest*"
./yb_build.sh debug --cxx-test pg_ddl_atomicity-test

I made some proto changes to YsqlDdlTxnVerifierState structure. It should
theoretically not affect upgrade as it does not change the actual number/type of
fields in the proto but I ran an upgrade test anyway.
1)Revert this diff. Enable DDL rollback but disable DDL verification. Run some
DDLs. This will lead to YB-Master writing SysTableEntryProto with the old format
into disk
2)Stop the cluster. Un-revert the diff and start the cluster with new changes
3)Verify that DDLs were verified and all future DML and DDL succeed with the
new proto changes

Reviewers: jhe, myang, loginov

Reviewed By: jhe

Subscribers: yql, ybase, bogdan

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D29521
svarnau pushed a commit that referenced this issue May 25, 2024
Summary:
The way the test YBDdlAtomicityBackupTest::RunDdlAtomicityTest is written, all
the DDLs will be inflight during the backup - so backup will fail. It's better
to test the case where backup fails if any single DDL is inflight. This diff
changes the test to randomly pick one DDL to be inflight and verify the backup
fails.
Jira: DB-2996

Test Plan:
./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter YBDdlAtomicityBackupTest.DdlRollbackAtomicityTest  -n 20

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D35140
svarnau pushed a commit that referenced this issue May 29, 2024
…cityTest test change

Summary:
The way the test YBDdlAtomicityBackupTest::RunDdlAtomicityTest is written, all
the DDLs will be inflight during the backup - so backup will fail. It's better
to test the case where backup fails if any single DDL is inflight. This diff
changes the test to randomly pick one DDL to be inflight and verify the backup
fails.
Jira: DB-2996

Original commit: 5ebfef8 / D35140

Test Plan: ./yb_build.sh release --cxx-test yb-backup-cross-feature-test --gtest_filter YBDdlAtomicityBackupTest.DdlRollbackAtomicityTest  -n 20

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D35150
myang2021 added a commit that referenced this issue Jul 24, 2024
Summary:
This diff adds a new DDL atomicity stress test. The stress test setup a test
table with some data, then spawns 4 concurrent test threads. The first 3 of
these 4 threads continuously run a pair of complementing DDL statements for 10
iterations, the 4th thread runs a update DML statement for 10 iterations.

Thread 1: alter table add and drop column
Thread 2: add and drop columns with default values
Thread 3: create/drop an index on this table
Thread 4: thread continuously updates the rows on this table.

Because there are many concurrency between DDLs and DMLs, there will be execution
errors that are expected. The test lists all such expected errors. If an error
is one of these expected errors, the test will retry the failed statement until
eventually succeed. The test fails if an unexpected error is encountered.

When all the 4 threads have completed, the test does a number of verifications:

1. the table does not contain any additional columns
2. no indexes are present on this table
3. the rows on this table are updated correctly

The test adds some error injections that are controlled by new test gflags.

This diff also fixes a tsan data race that showed up in the newly added stress tests
by re-acquiring CatalogManager's mutex_ which is already released earlier in order
to fix a deadlock (commit ac184a8).

Note that if we add more test scenarios with more concurrent DDL threads, we may
see more expected errors that are not already listed and the test would fail. In
that case we need to debug the failure. If it turns out that the error is
expected, we need to add the new expected error to the list of expected errors.
If the error indicates a bug, then we need to fix the bug to allow the test to
pass.

The stress test has two sets: non-colocated table and colocated table. A third
set for partitioned table was taken out of this diff. Partitioned table stress
test will be addressed by a separate bug fix diff.
Jira: DB-2996

Test Plan:
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlVerification -n 20 --tp 1

Reviewers: hsunder, zdrudi, fizaa

Reviewed By: zdrudi

Subscribers: ybase, yql

Differential Revision: https://phorge.dev.yugabyte.com/D32682
jasonyb pushed a commit that referenced this issue Jul 24, 2024
Summary:
 Excluded: 7414711 [#20972] YSQL: Fix new local variable names in PG code
 3484013 [#20766] YSQL: Remove time stuff from the GetTableKeyRanges functions.
 8b23629 [#18360] YSQL: fix yb_hash_code compared to an out of range constant
 4ee3511 [PLAT-13438]: Add support for preview gflags
 fe5e675 [#23218] YSQL: Closing the backend connection on receiving the error "Database might have been dropped by another user"
 0901f6c [doc] Default databases (#23261)
 afaf353 [PLAT-14722]Upgrade YBC client and server versio to 2.2.0.0-b3
 fdffc33 [#13358] YSQL: DDL Atomicity Stress Test
 f0b3c46 [ #21499] YSQL: Skipping the test with connection manager as they are unsuitable
 5ed864d [#23260] CDCSDK: Propagate stream expiry & GC error to walsender
 e37d791 [PLAT-14627][PLAT-14729][PLAT-14726]Process group-based specific gflags comprehensively across multiple callsites in YBA
 0e17c97 [PLAT-14737]Missing/incorrect universe status loader positions on dashboard page
 6477bad [PLAT-12434] YNP fix XDG_RUNTIME_DIR to be set in .bashrc for non login shells
 f3ffb18 [docs] Tablet metadata and ASH updates (#23221)

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36814
myang2021 added a commit that referenced this issue Jul 30, 2024
Summary:
This diff adds a new type of DDL atomicity stress test for partitioned table
into the existing DDL atomicity stress test suite. These new tests exposed a
bug that caused them to be flaky (manifested as test timeout after 10 minutes).

The bug happens when the DDL transaction is rolled back. In the DDL transaction
verifier state, we have `verifier_state->tables` to represent all the tables
involved in this transaction. Once we know that the transaction is aborted,
we call `YsqlDdlTxnCompleteCallbackInternal` to process all the tables in
verifier_state->tables. Then we set the verifier state to
YsqlDdlVerificationState::kDdlPostProcessing to indicate that the transaction is
finished.

But it is possible that verifier_state->tables does not already have all the
tables in the txn. For example, a txn involves three tables t1, t2, t3. After t1
and t2 are added to txn, the txn is aborted due to a conflict. In this case
`YsqlDdlTxnCompleteCallbackInternal` is only called on t1 and t2 and the state is
set to kDdlPostProcessing before t3 gets added. Later when t3 gets added. We
currently returns Status::OK() if the state == kDdlPostProcessing, and t3 will
never be processed, leading to test failure due to timeout.

The fix is always process all the tables in verifier_state->tables. In the
above example we re-process t1 and t2 (they will turn into no-op), and process
t3.
Jira: DB-2996

Test Plan:
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlVerification -n 20 --tp 1

Reviewers: hsunder, fizaa

Reviewed By: fizaa

Subscribers: yql, ybase

Differential Revision: https://phorge.dev.yugabyte.com/D36862
myang2021 added a commit that referenced this issue Jul 30, 2024
Summary:
This diff adds a new DDL atomicity stress test. The stress test setup a test
table with some data, then spawns 4 concurrent test threads. The first 3 of
these 4 threads continuously run a pair of complementing DDL statements for 10
iterations, the 4th thread runs a update DML statement for 10 iterations.

Thread 1: alter table add and drop column
Thread 2: add and drop columns with default values
Thread 3: create/drop an index on this table
Thread 4: thread continuously updates the rows on this table.

Because there are many concurrency between DDLs and DMLs, there will be execution
errors that are expected. The test lists all such expected errors. If an error
is one of these expected errors, the test will retry the failed statement until
eventually succeed. The test fails if an unexpected error is encountered.

When all the 4 threads have completed, the test does a number of verifications:

1. the table does not contain any additional columns
2. no indexes are present on this table
3. the rows on this table are updated correctly

The test adds some error injections that are controlled by new test gflags.

This diff also fixes a tsan data race that showed up in the newly added stress tests
by re-acquiring CatalogManager's mutex_ which is already released earlier in order
to fix a deadlock (commit ac184a8).

Note that if we add more test scenarios with more concurrent DDL threads, we may
see more expected errors that are not already listed and the test would fail. In
that case we need to debug the failure. If it turns out that the error is
expected, we need to add the new expected error to the list of expected errors.
If the error indicates a bug, then we need to fix the bug to allow the test to
pass.

The stress test has two sets: non-colocated table and colocated table. A third
set for partitioned table was taken out of this diff. Partitioned table stress
test will be addressed by a separate bug fix diff.
Jira: DB-2996

Original commit: fdffc33 / D32682

Test Plan:
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlVerification -n 20 --tp 1

Reviewers: hsunder, zdrudi, fizaa

Reviewed By: zdrudi

Subscribers: yql, ybase

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36927
jasonyb pushed a commit that referenced this issue Jul 31, 2024
Summary:
 80c02fc [#23296] xCluster: Skip BackfillMetadataForXRepl for SequencesSystemTable
 d4eeace [#23295] YSQL: Fix test failure when the whole of TestPgAuthorization is run.
 baf343f [#13358] YSQL: Fix bug in partitioned table DDL atomicity stress test
 7bf0eed [#23294] YSQL, ASH: Fix flaky test 'TestYbAsh#testEmptyCircularBuffer'
 17828e2 [PLAT-13286] Make ReprovisionNode task retryable
 34e6e7a [#23250] YSQL: UT org.yb.pgsql.TestPgMemoryGC#testPgMemoryGcOrderBy is failing with connection manager
 3104eb7 [PLAT-14204] Azure Provider - Allow the Resource Group for the Network to be overridden per Region.
 e27c3b9 [#18360] YSQL: Rework yb_hash_code comparision with an out of range constants
 2463ee0 [PLAT-14709] Fix DevSpace command for shared Kubernetes clusters

PgDdlAtomicityPartitionedTablesStressTest.* flaky fails with

```
[ts-1] 2024-07-30 15:46:42.577 UTC [16209] ERROR:  Restart read required at: { read: { days: 19934 time: 15:46:42.510692 } local_limit: { days: 19934 time: 15:46:42.510692 } global_limit: <min> in_txn_limit: <max> serial_no: 0 }
[ts-1] 2024-07-30 15:46:42.577 UTC [16209] CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again.
[ts-1] 2024-07-30 15:46:42.577 UTC [16209] STATEMENT:  CREATE INDEX idx_0 ON test_table(key)
[ts-1] TRAP: FailedAssertion("entry->refcount == 1", File: "../../../../../../../src/postgres/src/backend/utils/cache/yb_inheritscache.c", Line: 337, PID: 16209)
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(ExceptionalCondition+0x96)[0xb1f06d]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(YbPgInheritsCacheDelete+0x125)[0xaf8e4c]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(YbPgInheritsCacheInvalidate+0x88)[0xaf8f5f]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX[0xaf90f7]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(InvalidateSystemCachesExtended+0x63)[0xafdd3c]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(CallSystemCacheCallbacks+0x13)[0xafddad]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX[0x9a13fb]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(PostgresMain+0x66c)[0x9a8642]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX[0x8e964d]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(PostmasterMain+0x1d74)[0x8ebca2]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(main+0x0)[0x81ed89]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(main+0x20)[0x81eda9]
[ts-1] /lib64/libc.so.6(__libc_start_main+0xe5)[0x7f1a00fcd7e5]
[ts-1] postgres: postgres yugabyte 127.0.0.1(43848) CREATE INDEX(_start+0x2e)[0x4bbdfe]
```

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36922
myang2021 added a commit that referenced this issue Jul 31, 2024
…icity stress test

Summary:
This diff adds a new type of DDL atomicity stress test for partitioned table
into the existing DDL atomicity stress test suite. These new tests exposed a
bug that caused them to be flaky (manifested as test timeout after 10 minutes).

The bug happens when the DDL transaction is rolled back. In the DDL transaction
verifier state, we have `verifier_state->tables` to represent all the tables
involved in this transaction. Once we know that the transaction is aborted,
we call `YsqlDdlTxnCompleteCallbackInternal` to process all the tables in
verifier_state->tables. Then we set the verifier state to
YsqlDdlVerificationState::kDdlPostProcessing to indicate that the transaction is
finished.

But it is possible that verifier_state->tables does not already have all the
tables in the txn. For example, a txn involves three tables t1, t2, t3. After t1
and t2 are added to txn, the txn is aborted due to a conflict. In this case
`YsqlDdlTxnCompleteCallbackInternal` is only called on t1 and t2 and the state is
set to kDdlPostProcessing before t3 gets added. Later when t3 gets added. We
currently returns Status::OK() if the state == kDdlPostProcessing, and t3 will
never be processed, leading to test failure due to timeout.

The fix is always process all the tables in verifier_state->tables. In the
above example we re-process t1 and t2 (they will turn into no-op), and process
t3.
Jira: DB-2996

Original commit: baf343f / D36862

Test Plan:
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityPartitionedTablesStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest.TestFailDdlVerification -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.BasicTest -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestTxnVerificationFailure -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailCatalogWrites -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlRollback -n 20 --tp 1
./yb_build.sh debug --sj  --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityColocatedStressTest.TestFailDdlVerification -n 20 --tp 1

Reviewers: hsunder, fizaa

Reviewed By: fizaa

Subscribers: ybase, yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D36933
myang2021 added a commit that referenced this issue Aug 5, 2024
Summary:
This diff addresses two previous review comments regarding the DDL atomicity
stress test:

(1) Use parameterized test to repeat the same test logic across different
environments / fixtures instead of virtual methods on test classes.
(2) Make a connection for each test thread instead of making a connection per
request, this increases the concurrency of the test.

As a result of (1), I was able to add one more test scenario naturally/easily:
colocated + partitioned tables.

I also introduced a global_conn that is used for test table/data setup
and test result verification. Added one more allowed error message
"current transaction is expired or aborted" in addition to the existing
"expired or aborted by a conflict".
Jira: DB-2996

Test Plan:
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 -n 20 --tp 1

Reviewers: zdrudi

Reviewed By: zdrudi

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D36996
myang2021 added a commit that referenced this issue Aug 6, 2024
Summary:
This diff addresses two previous review comments regarding the DDL atomicity
stress test:

(1) Use parameterized test to repeat the same test logic across different
environments / fixtures instead of virtual methods on test classes.
(2) Make a connection for each test thread instead of making a connection per
request, this increases the concurrency of the test.

As a result of (1), I was able to add one more test scenario naturally/easily:
colocated + partitioned tables.

I also introduced a global_conn that is used for test table/data setup
and test result verification. Added one more allowed error message
"current transaction is expired or aborted" in addition to the existing
"expired or aborted by a conflict".
Jira: DB-2996

Original commit: ce80f7a / D36996

Test Plan:
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 -n 20 --tp 1
./yb_build.sh fastdebug --gcc11 --sj --cxx-test pg_ddl_atomicity_stress-test --gtest_filter=PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 -n 20 --tp 1

Reviewers: zdrudi

Reviewed By: zdrudi

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37068
jasonyb pushed a commit that referenced this issue Aug 7, 2024
Summary:
 50931bf [#23273] yugabyted: Fix `yugabyted configure_read_replica` commands.
 64e1bf8 [#23278] CDCSDK: Handle non-eligible tables cleanup with drop table while loading CDC stream
 ce80f7a [#13358] YSQL: Refactor DDL Atomicity Stress Test
 Excluded: 6d40d27 [#23407] YSQL: clean up compound BNL logic
 5cb74a7 [PLAT-14164] New Alert for clock drift
 f39c76c [PLAT-14800] Fix yb.allow_db_version_more_than_yba_version being insufficient for YBA/DB version checks
 a42549e [#23377] DocDB: Implement the way to apply vector index updates to DocDB
 3923ec5 [PLAT-14749][Platform]Add a warning message to image upgrade dialog
 709cd92 [PLAT-14848] postgres.service file did not have RestartSec filled out
 da10672 [#23069] docdb: implemented per-iterator readahead for sequential reads
 f439c8a [PLAT-14852]: Do not raise error when JWT_JWKS_URL has valid value and JWT has empty keyset

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37095
myang2021 added a commit that referenced this issue Aug 8, 2024
Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D37111
myang2021 added a commit that referenced this issue Aug 8, 2024
… in tsan build

Summary:
The DDL atomicity stress tests failed more on pg15 branch with an error like:

```
WARNING: ThreadSanitizer: data race (pid=180911)
  Write of size 8 at 0x7b2c000257b8 by thread T17 (mutexes: write M0):
    #0 profile_open_file prof_file.c (libkrb5.so.3+0xf45b3)
    #1 profile_init_flags <null> (libkrb5.so.3+0xfb056)
    #2 k5_os_init_context <null> (libkrb5.so.3+0xe5546)
    #3 krb5_init_context_profile <null> (libkrb5.so.3+0xabc90)
    #4 krb5_init_context <null> (libkrb5.so.3+0xabbd5)
    #5 krb5_gss_init_context init_sec_context.c (libgssapi_krb5.so.2+0x448da)
    #6 acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39159)
    #7 krb5_gss_acquire_cred_from acquire_cred.c (libgssapi_krb5.so.2+0x39072)
    #8 gss_add_cred_from <null> (libgssapi_krb5.so.2+0x1fcd3)
    #9 gss_acquire_cred_from <null> (libgssapi_krb5.so.2+0x1f69d)
    #10 gss_acquire_cred <null> (libgssapi_krb5.so.2+0x1f431)
    #11 pg_GSS_have_cred_cache ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-gssapi-common.c:68:10 (libpq.so.5+0x543fe)
    #12 PQconnectPoll ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2909:22 (libpq.so.5+0x359ca)
    #13 connectDBComplete ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:2241:10 (libpq.so.5+0x30807)
    #14 PQconnectdb ${YB_SRC_ROOT}/src/postgres/src/interfaces/libpq/../../../../../../src/postgres/src/interfaces/libpq/fe-connect.c:719:10 (libpq.so.5+0x30af1)
    #15 yb::pgwrapper::PGConn::Connect(string const&, std::chrono::time_point<yb::CoarseMonoClock, std::chrono::duration<long long, std::ratio<1l, 1000000000l>>>, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:348:24 (libpq_utils.so+0x13c5b)
    #16 yb::pgwrapper::PGConn::Connect(string const&, bool, string const&) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.h:254:12 (libpq_utils.so+0x1a77e)
    #17 yb::pgwrapper::PGConnBuilder::Connect(bool) const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_utils.cc:743:10 (libpq_utils.so+0x1a77e)
    #18 yb::pgwrapper::LibPqTestBase::ConnectToDBAsUser(string const&, string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:54:6 (libpg_wrapper_test_base.so+0x26f34)
    #19 yb::pgwrapper::LibPqTestBase::ConnectToDB(string const&, bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:44:10 (libpg_wrapper_test_base.so+0x26b1e)
    #20 yb::pgwrapper::LibPqTestBase::Connect(bool) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/libpq_test_base.cc:40:10 (libpg_wrapper_test_base.so+0x26b1e)
    #21 yb::pgwrapper::PgDdlAtomicityStressTest::Connect() ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:147:25 (pg_ddl_atomicity_stress-test+0x136d6c)
    #22 yb::pgwrapper::PgDdlAtomicityStressTest::TestDdl(std::vector<string, std::allocator<string>> const&, int) ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:165:15 (pg_ddl_atomicity_stress-test+0x136df5)
    #23 yb::pgwrapper::PgDdlAtomicityStressTest_StressTest_Test::TestBody()::$_2::operator()() const ${YB_SRC_ROOT}/src/yb/yql/pgwrapper/pg_ddl_atomicity_stress-test.cc:316:5 (pg_ddl_atomicity_stress-test+0x13d2eb)
```

It appears that the function `yb::pgwrapper::LibPqTestBase::Connect` isn't
thread safe. I restructured the code to make the connections in a single thread
and then pass them to various concurrent threads for testing.
Jira: DB-2996

Original commit: bd4874b / D37111

Test Plan:
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/0 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/1 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/2 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/3 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/4 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/5 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/6 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/7 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/8 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/9 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/10 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/11 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/12 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/13 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/14 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/15 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/16 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/17 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/18 --clang17
./yb_build.sh tsan --cxx-test pgwrapper_pg_ddl_atomicity_stress-test --gtest_filter PgDdlAtomicityStressTest/PgDdlAtomicityStressTest.StressTest/19 --clang17

Verified that no more tsan errors.

Reviewers: fizaa

Reviewed By: fizaa

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37167
jasonyb pushed a commit that referenced this issue Aug 9, 2024
Summary:
 43d4b49 [PLAT-14771]: Restrict user actions on releases page based on appropriate RBAC permissions
 1473a23 [PLAT-14792]: Master and TServer nodes displays empty tool tip if nodes are unreachable
 Excluded: 78a2bc7 [#23114] YSQL: Refactoring the yb_pg_pgaudit.sql/.out test.
 da9b281 [doc][ybm] Prometheus integration (#23292)
 bd4874b [#13358] YSQL: Fix DDL atomicity stress test failure in tsan build
 edd8e3f [PLAT-14524] Undo all @JsonProperty annotations added earlier to fix APIs
 Excluded: 4a2657e [PLAT-14787] Implement a master tablet lb api for YW
 b9d2e9d [#23445]yugabyted: Node not starting with DNS name and `--secure` option
 f171e13 [#23447]yugabyted: Node doesn't restart with `--secure` enabled
 d4f036d [PLAT-14790] Region metadata is not populated when provisioned nodes are added via Node Agent when using assisted Manual Provisioning (provision_instance.py script)
 71ab66f [doc][ybm] Backup and restore clarifications (#23400)
 4768023 [doc] ysql_yb_bnl_batch_size flag (#23397)
 3d4bc2a [#23117] YSQL: Enable ALTER VIEW in parser
 Excluded: 03bbbed Bumping version to 2.23.1.0 on branch master
 622046d [#23335] DocDB: Set field close timestamp for the log segment being copied
 Excluded: f69b08f [#1999] YSQL: fix temp table INSERT ON CONFLICT
 Excluded: efd4cb7 [#23429] YSQL: fix INSERT ON CONFLICT TupleDesc ref leak
 Excluded: 9e7181f [PLAT-14785] Add REST APIs for job scheduler (auto-master failover)
 d56903c [PLAT-14850]API Token authentication loops through the users and checks token against each of these.

Test Plan: Jenkins: rebase: pg15-cherrypicks

Reviewers: jason, tfoucher

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D37201
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants