Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[docdb] Make YSQL create table metadata consistent #3979

Closed
bmatican opened this issue Mar 16, 2020 · 3 comments
Closed

[docdb] Make YSQL create table metadata consistent #3979

bmatican opened this issue Mar 16, 2020 · 3 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue

Comments

@bmatican
Copy link
Contributor

bmatican commented Mar 16, 2020

Jira Link: DB-2461
Currently, a YSQL issued CREATE TABLE will need to update 2 different types of metadata:

  • pg system tables (reads and writes to the system tables)
  • docdb metadata, via a CreateTable RPC (actually create tablets on TServers and track the table/tablets in the catalog)

While each of these 2 steps could be considered atomic, we need to guarantee consistency across both. Assume these are actually done sequentially right now, first the pg table queries, then the master CreateTable RPC. We consider 4 cases:

  1. Both the pg system table operations and the CreateTable rpc work fine
    Awesome, this is the happy path. Metadata is consistent.

  2. Some failures during the pg system table operations.
    The transaction can just be cleanly aborted and no side effects are left in the system.

  3. All the pg system table operations go through, but the CreateTable rpc fails
    We can hold off on committing the transaction until we actually get a success from the CreateTable RPC. That way, on a failed CreateTable, we can abort the txn, and that part is handled like in 1). As for the CreateTable rpc itself, it should guarantee its own fault tolerance semantics, so upon clear failure, we should be able to try again.

  4. (really 3.1...) All the pg system table operations go through, but the CreateTable rpc times out
    Time outs are tricky. Did the CreateTable even make it to the maser? Did it succeed just fine? Did it fail? Is it still being worked on? If we check for a created table, with the same ID, can we be confident that it was our session that generated it?

Handling this final case is the most problematic. To address this, we're considering the following approach. On timeout, rollback the pg transaction.

  • If the CreateTable failed, we have nothing to worry about.
  • If it succeeded, we have a stray table in the system, so we need a way to cleanup unneeded tables.

We already have a background thread in CatalogManager which does regular work every 1s (load balancing). We can augment this to also check for invalid tables. We can then tag the YSQL tables with the txn ID of the corresponding pg system table transaction that they were supposed to be consistent with. On cleanup, we can then check against the txn status tablet, by this txn ID, to see if:

  • Txn is aborted -- we can delete this table
  • Txn is committed -- this table should be here (this is the happy path, should make sure we don't do this over and over again)
  • Txn info is missing -- could be aborted/committed but cleaned up. Now we have to check the master pg system tables, based on the table UUID (OID for pg system tables). We already have some code to do read/write in sys_catalog, as part of CREATE DATABASE, when copying template1 to a new database name.

Couple of open questions:

  • How do we get the txn ID at the YSQL layer, so we can send it down with the CreateTable request?
  • Do we have an API we can use from the master-side, for checking the status of a txn by ID?
  • How do we read from the pg system tables? Which table should we check against?

cc @mbautin @m-iancu @ndeodhar

@bmatican bmatican added the area/docdb YugabyteDB core features label Mar 16, 2020
@bmatican bmatican assigned nspiegelberg and unassigned hectorgcr Apr 14, 2020
@bmatican
Copy link
Contributor Author

We can probably tackle this as part of #3097, in a more unified manner.

@nspiegelberg
Copy link
Contributor

Chatted with @m-iancu today to clarify this task. It looks like we can break this into 3 parts:

  1. Ensure Failure/Timeout does not prevent us from Creating a Table or Database with the same name. (next release)
  2. Ensure that Failed Table/Database entries are garbage collected. (medium term)
  3. Provide a Rollback execution stack for failed DDL transactions. (long term)

The main work in #1 is running ysqlsh, injecting lower-level failures, and ensuring that a subsequent create with the same parameters will work. This will mostly mean [a] not adding Namespaces (Databases) to a names map if PGSQL. We will also need to [b] ensure that a yb-master of a partially-created or destroyed table & namespace will perform proper cleanup. Finally, we want to ensure that [c] postgres-level transaction timeouts are respected and any costly IO work is canceled after the deadline. [c] might be tricky since these transactions span a multiple YBClient API requests.

nspiegelberg added a commit that referenced this issue Jun 17, 2020
Summary:
Working on integration between YSQL and DocDB for failure scenarios.  I ran a number of
create/delete commands with regular Sys Catalog failures.  A number of issues arose:

  1. Ensuring that we end the YBClient on a failure scenario where the 'done' flag is set.
  2. Allowing YSQL to control name uniqueness instead of DOCDB to handle failure scenarios.
  3. Proper logging of errors. An invalid CHECK would crash the server instead of logging an error.

Test Plan:
./build/latest/bin/yb-ts-cli --server_address=127.0.0.1:7100 set_flag -force TEST_sys_catalog_write_rejection_percentage 33
ysqlsh
  CREATE DATABASE d; // looped
  CREATE TABLE t; // looped

Reviewers: mihnea, bogdan, rahuldesirazu, hector

Reviewed By: hector

Subscribers: yql, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8619
@nspiegelberg
Copy link
Contributor

Note: End solution needs to support the ROLLBACK command for CREATE TABLE (https://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis)

deeps1991 pushed a commit to deeps1991/yugabyte-db that referenced this issue Jul 22, 2020
Summary:
Working on integration between YSQL and DocDB for failure scenarios.  I ran a number of
create/delete commands with regular Sys Catalog failures.  A number of issues arose:

  1. Ensuring that we end the YBClient on a failure scenario where the 'done' flag is set.
  2. Allowing YSQL to control name uniqueness instead of DOCDB to handle failure scenarios.
  3. Proper logging of errors. An invalid CHECK would crash the server instead of logging an error.

Test Plan:
./build/latest/bin/yb-ts-cli --server_address=127.0.0.1:7100 set_flag -force TEST_sys_catalog_write_rejection_percentage 33
ysqlsh
  CREATE DATABASE d; // looped
  CREATE TABLE t; // looped

Reviewers: mihnea, bogdan, rahuldesirazu, hector

Reviewed By: hector

Subscribers: yql, ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8619
m-iancu added a commit that referenced this issue Oct 8, 2020
…olerance for DDLs

Summary:
Maintain the ysql catalog version in a (new) system table for new clusters.
The table schema is:
```
            Table "pg_catalog.pg_yb_catalog_version"
        Column         |  Type  | Collation | Nullable | Default
-----------------------+--------+-----------+----------+---------
 db_oid                | oid    |           | not null |
 current_version       | bigint |           | not null |
 last_breaking_version | bigint |           | not null |
Indexes:
    "pg_yb_catalog_version_db_oid_index" PRIMARY KEY,
```

This means we now:
1. Only modify the version number if the overall DDL statement (txn) committed
   successfully (fixing an important correctness bug with version management)
2. Only increment the version number (at most) once per change (minimizing
   version mismatch errors)
- Explicitly distinguish safe vs unsafe changes, so we only need to force a
  refresh during a transaction if necessary (further minimizing version
  mismatch errors).

Additionally, adjust the order in which internal operations are executed
following this design doc:
https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/online-schema-migrations.md
The goal being to guarantee that under any failure scenario that cluster is
left in a consistent state.

Also added a new (master) test gflag `TEST_ysql_catalog_write_rejection_percentage`
that rejects writes to YSQL system catalog tables with the set percentage (default `0`).

Future work:
1. Implement the upgrade path for old clusters to safely migrate to this new
   versioning method.
2. Maintain the version per-database instead of one per cluster (schema is
   already set up for that).
3. Once more schema changes are handled by the online schema migrations plan,
   adjust their version increment (to become non-breaking) to fully avoid the
   DDL mismatch errors.
4. While all failures should leave the cluster in a consistent state, some may
   leave behind orphan DocDB relations. These are harmless from correctness
   perspective to OID/UUID uniqueness but we should have a simple (perhaps
   automated way to clean them up).
5. Consider how to handle truncate (in particular when used as part of drop colocated table).

Test Plan:
1. Existing jenkins tests
2. Added TestPgDdlFaultTolerance.java
3. Run sysbench with more than one threads.

Reviewers: neha, neil, alex, jason

Reviewed By: alex, jason

Subscribers: alex, vgopalan, mikhail, nicolas, yql, bogdan

Differential Revision: https://phabricator.dev.yugabyte.com/D8729
nspiegelberg added a commit that referenced this issue Nov 5, 2020
Summary:
Postgres now creates a Transaction ID for the DDL operation and sends it to yb-master on CREATE.   This enables the following creation flow:

1. Create Transaction ID in Postgres
2. Create in yb-master
3. Postgres Writes its system table data & Commits/Aborts
4. Async thread in yb-master observes the Transaction result and performs a rollback on abort.

We query the Transaction Coordinator for this information because yb-master may start its query on Step #4 before Postgres flushes any actual system table writes to the Transaction Participant.  This is a best-effort cleanup, so we only rollback the entry if we know that the Transaction failed and keep it around if the async job encounters unknown errors.  Can toggle feature with 'FLAGS_enable_transactional_ddl_gc'

Test Plan: ybd --cxx-test pg_libpq-test --gtest_filter PgLibPq*TimeoutTest.* --test-timeout-sec 60

Reviewers: mihnea, bogdan, hector, amitanand, sergei, mikhail

Reviewed By: mikhail

Subscribers: ybase

Differential Revision: https://phabricator.dev.yugabyte.com/D8821
@yugabyte-ci yugabyte-ci added kind/bug This issue is a bug priority/medium Medium priority issue labels Jun 9, 2022
myang2021 added a commit that referenced this issue Jan 22, 2024
Summary:
A customer reported that backup failed and the reason is that the function
`yb_catalog_version()` does not exist. This function is introduced in the
migration script `V1__3979__pg_yb_catalog_version.sql`.

After debugging I found there is a code bug in the YSQL upgrade code.
Specifically:

```
Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) {
  int major_version = 0;

  // Helper macro removing boilerplate.
    if (VERIFY_RESULT(oneliner_with_result)) { \
      ++major_version; \
    } else { \
      return major_version; \
    }

  // V1: #3979 introducing pg_yb_catalog_version table.
  INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version"))
```
This function tries to determine which version we should start from. For
example, it checks the existence of a non-empty table `pg_yb_catalog_version` to
decide whether we need to skip V1 migration script or not. The V1 migration
script has two blocks:

* The first block introduces the table `pg_catalog.pg_yb_catalog_version`
* The second block introduces the function `yb_catalog_version()`

In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists
but `yb_catalog_version()` does not. As a result, the above code will skip V1
migration script and that's why function `yb_catalog_version` isn't introduced.

I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer
release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4,
`pg_catalog.pg_yb_catalog_version` does not exist so the above code
will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()`
also exists so it is correct to skip V1 migration script.

The customer cluster was manually fixed by reapplying V1 migration script. I
made a fix such that after the normal migration has completed, check whether
`yb_catalog_version()` exists or not. If it is missing then apply V1 migration
script.

Test Plan:
(1)

./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions'

(2) Download old release packages such as
yugabyte-2.2.7.0
yugabyte-2.4.0.0
yugabyte-2.4.8.0
yugabyte-2.6.0.0,
yugabyte-2.6.20.0
yugabyte-2.8.0.0
yugabyte-2.8.12.0

Run the following test manually against each of the old releases. For example:

```
./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at yb-tserver.INFO
```
W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1
I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1
I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump
W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0
I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump
W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres
I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump
W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte
I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump
W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform
I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump
```

Rerun the upgrade command
```
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at the yb-tserver.INFO again:
```
I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1
I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0
I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres
I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte
I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform
```

Reviewers: jason, tverona

Reviewed By: jason

Subscribers: yql

Differential Revision: https://phorge.dev.yugabyte.com/D31793
myang2021 added a commit that referenced this issue Jan 24, 2024
…exist

Summary:
A customer reported that backup failed and the reason is that the function
`yb_catalog_version()` does not exist. This function is introduced in the
migration script `V1__3979__pg_yb_catalog_version.sql`.

After debugging I found there is a code bug in the YSQL upgrade code.
Specifically:

```
Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) {
  int major_version = 0;

  // Helper macro removing boilerplate.
    if (VERIFY_RESULT(oneliner_with_result)) { \
      ++major_version; \
    } else { \
      return major_version; \
    }

  // V1: #3979 introducing pg_yb_catalog_version table.
  INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version"))
```
This function tries to determine which version we should start from. For
example, it checks the existence of a non-empty table `pg_yb_catalog_version` to
decide whether we need to skip V1 migration script or not. The V1 migration
script has two blocks:

* The first block introduces the table `pg_catalog.pg_yb_catalog_version`
* The second block introduces the function `yb_catalog_version()`

In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists
but `yb_catalog_version()` does not. As a result, the above code will skip V1
migration script and that's why function `yb_catalog_version` isn't introduced.

I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer
release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4,
`pg_catalog.pg_yb_catalog_version` does not exist so the above code
will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()`
also exists so it is correct to skip V1 migration script.

The customer cluster was manually fixed by reapplying V1 migration script. I
made a fix such that after the normal migration has completed, check whether
`yb_catalog_version()` exists or not. If it is missing then apply V1 migration
script.

Original commit: 55aac07 / D31793
Jira: DB-7466

Test Plan:
(1)

./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions'

(2) Download old release packages such as
yugabyte-2.2.7.0
yugabyte-2.4.0.0
yugabyte-2.4.8.0
yugabyte-2.6.0.0,
yugabyte-2.6.20.0
yugabyte-2.8.0.0
yugabyte-2.8.12.0

Run the following test manually against each of the old releases. For example:

```
./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at yb-tserver.INFO
```
W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1
I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1
I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump
W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0
I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump
W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres
I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump
W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte
I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump
W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform
I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump
```

Rerun the upgrade command
```
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at the yb-tserver.INFO again:
```
I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1
I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0
I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres
I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte
I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform
```

Reviewers: jason, tverona

Reviewed By: jason

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31898
myang2021 added a commit that referenced this issue Jan 24, 2024
…exist

Summary:
A customer reported that backup failed and the reason is that the function
`yb_catalog_version()` does not exist. This function is introduced in the
migration script `V1__3979__pg_yb_catalog_version.sql`.

After debugging I found there is a code bug in the YSQL upgrade code.
Specifically:

```
Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) {
  int major_version = 0;

  // Helper macro removing boilerplate.
    if (VERIFY_RESULT(oneliner_with_result)) { \
      ++major_version; \
    } else { \
      return major_version; \
    }

  // V1: #3979 introducing pg_yb_catalog_version table.
  INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version"))
```
This function tries to determine which version we should start from. For
example, it checks the existence of a non-empty table `pg_yb_catalog_version` to
decide whether we need to skip V1 migration script or not. The V1 migration
script has two blocks:

* The first block introduces the table `pg_catalog.pg_yb_catalog_version`
* The second block introduces the function `yb_catalog_version()`

In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists
but `yb_catalog_version()` does not. As a result, the above code will skip V1
migration script and that's why function `yb_catalog_version` isn't introduced.

I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer
release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4,
`pg_catalog.pg_yb_catalog_version` does not exist so the above code
will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()`
also exists so it is correct to skip V1 migration script.

The customer cluster was manually fixed by reapplying V1 migration script. I
made a fix such that after the normal migration has completed, check whether
`yb_catalog_version()` exists or not. If it is missing then apply V1 migration
script.

Original commit: 55aac07 / D31793
Jira: DB-7466

Test Plan:
(1)

./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions'

(2) Download old release packages such as
yugabyte-2.2.7.0
yugabyte-2.4.0.0
yugabyte-2.4.8.0
yugabyte-2.6.0.0,
yugabyte-2.6.20.0
yugabyte-2.8.0.0
yugabyte-2.8.12.0

Run the following test manually against each of the old releases. For example:

```
./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at yb-tserver.INFO
```
W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1
I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1
I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump
W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0
I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump
W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres
I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump
W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte
I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump
W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform
I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump
```

Rerun the upgrade command
```
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at the yb-tserver.INFO again:
```
I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1
I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0
I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres
I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte
I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform
```

Reviewers: jason, tverona

Reviewed By: jason

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31929
myang2021 added a commit that referenced this issue Jan 25, 2024
…exist

Summary:
A customer reported that backup failed and the reason is that the function
`yb_catalog_version()` does not exist. This function is introduced in the
migration script `V1__3979__pg_yb_catalog_version.sql`.

After debugging I found there is a code bug in the YSQL upgrade code.
Specifically:

```
Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) {
  int major_version = 0;

  // Helper macro removing boilerplate.
    if (VERIFY_RESULT(oneliner_with_result)) { \
      ++major_version; \
    } else { \
      return major_version; \
    }

  // V1: #3979 introducing pg_yb_catalog_version table.
  INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version"))
```
This function tries to determine which version we should start from. For
example, it checks the existence of a non-empty table `pg_yb_catalog_version` to
decide whether we need to skip V1 migration script or not. The V1 migration
script has two blocks:

* The first block introduces the table `pg_catalog.pg_yb_catalog_version`
* The second block introduces the function `yb_catalog_version()`

In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists
but `yb_catalog_version()` does not. As a result, the above code will skip V1
migration script and that's why function `yb_catalog_version` isn't introduced.

I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer
release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4,
`pg_catalog.pg_yb_catalog_version` does not exist so the above code
will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()`
also exists so it is correct to skip V1 migration script.

The customer cluster was manually fixed by reapplying V1 migration script. I
made a fix such that after the normal migration has completed, check whether
`yb_catalog_version()` exists or not. If it is missing then apply V1 migration
script.

Original commit: 55aac07 / D31793
Jira: DB-7466

Test Plan:
(1)

./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment'
./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions'

(2) Download old release packages such as
yugabyte-2.2.7.0
yugabyte-2.4.0.0
yugabyte-2.4.8.0
yugabyte-2.6.0.0,
yugabyte-2.6.20.0
yugabyte-2.8.0.0
yugabyte-2.8.12.0

Run the following test manually against each of the old releases. For example:

```
./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at yb-tserver.INFO
```
W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1
I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1
I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump
W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0
I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump
W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres
I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump
W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte
I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump
W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform
I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql'
I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump
```

Rerun the upgrade command
```
./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql
```

Look at the yb-tserver.INFO again:
```
I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1
I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0
I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres
I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte
I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform
```

Reviewers: jason, tverona

Reviewed By: jason

Subscribers: yql

Tags: #jenkins-ready

Differential Revision: https://phorge.dev.yugabyte.com/D31939
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/bug This issue is a bug priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

4 participants