-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[docdb] Make YSQL create table metadata consistent #3979
Comments
We can probably tackle this as part of #3097, in a more unified manner. |
Chatted with @m-iancu today to clarify this task. It looks like we can break this into 3 parts:
The main work in #1 is running ysqlsh, injecting lower-level failures, and ensuring that a subsequent create with the same parameters will work. This will mostly mean [a] not adding Namespaces (Databases) to a names map if PGSQL. We will also need to [b] ensure that a yb-master of a partially-created or destroyed table & namespace will perform proper cleanup. Finally, we want to ensure that [c] postgres-level transaction timeouts are respected and any costly IO work is canceled after the deadline. [c] might be tricky since these transactions span a multiple YBClient API requests. |
Summary: Working on integration between YSQL and DocDB for failure scenarios. I ran a number of create/delete commands with regular Sys Catalog failures. A number of issues arose: 1. Ensuring that we end the YBClient on a failure scenario where the 'done' flag is set. 2. Allowing YSQL to control name uniqueness instead of DOCDB to handle failure scenarios. 3. Proper logging of errors. An invalid CHECK would crash the server instead of logging an error. Test Plan: ./build/latest/bin/yb-ts-cli --server_address=127.0.0.1:7100 set_flag -force TEST_sys_catalog_write_rejection_percentage 33 ysqlsh CREATE DATABASE d; // looped CREATE TABLE t; // looped Reviewers: mihnea, bogdan, rahuldesirazu, hector Reviewed By: hector Subscribers: yql, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8619
Note: End solution needs to support the ROLLBACK command for CREATE TABLE (https://wiki.postgresql.org/wiki/Transactional_DDL_in_PostgreSQL:_A_Competitive_Analysis) |
Summary: Working on integration between YSQL and DocDB for failure scenarios. I ran a number of create/delete commands with regular Sys Catalog failures. A number of issues arose: 1. Ensuring that we end the YBClient on a failure scenario where the 'done' flag is set. 2. Allowing YSQL to control name uniqueness instead of DOCDB to handle failure scenarios. 3. Proper logging of errors. An invalid CHECK would crash the server instead of logging an error. Test Plan: ./build/latest/bin/yb-ts-cli --server_address=127.0.0.1:7100 set_flag -force TEST_sys_catalog_write_rejection_percentage 33 ysqlsh CREATE DATABASE d; // looped CREATE TABLE t; // looped Reviewers: mihnea, bogdan, rahuldesirazu, hector Reviewed By: hector Subscribers: yql, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8619
…olerance for DDLs Summary: Maintain the ysql catalog version in a (new) system table for new clusters. The table schema is: ``` Table "pg_catalog.pg_yb_catalog_version" Column | Type | Collation | Nullable | Default -----------------------+--------+-----------+----------+--------- db_oid | oid | | not null | current_version | bigint | | not null | last_breaking_version | bigint | | not null | Indexes: "pg_yb_catalog_version_db_oid_index" PRIMARY KEY, ``` This means we now: 1. Only modify the version number if the overall DDL statement (txn) committed successfully (fixing an important correctness bug with version management) 2. Only increment the version number (at most) once per change (minimizing version mismatch errors) - Explicitly distinguish safe vs unsafe changes, so we only need to force a refresh during a transaction if necessary (further minimizing version mismatch errors). Additionally, adjust the order in which internal operations are executed following this design doc: https://github.com/yugabyte/yugabyte-db/blob/master/architecture/design/online-schema-migrations.md The goal being to guarantee that under any failure scenario that cluster is left in a consistent state. Also added a new (master) test gflag `TEST_ysql_catalog_write_rejection_percentage` that rejects writes to YSQL system catalog tables with the set percentage (default `0`). Future work: 1. Implement the upgrade path for old clusters to safely migrate to this new versioning method. 2. Maintain the version per-database instead of one per cluster (schema is already set up for that). 3. Once more schema changes are handled by the online schema migrations plan, adjust their version increment (to become non-breaking) to fully avoid the DDL mismatch errors. 4. While all failures should leave the cluster in a consistent state, some may leave behind orphan DocDB relations. These are harmless from correctness perspective to OID/UUID uniqueness but we should have a simple (perhaps automated way to clean them up). 5. Consider how to handle truncate (in particular when used as part of drop colocated table). Test Plan: 1. Existing jenkins tests 2. Added TestPgDdlFaultTolerance.java 3. Run sysbench with more than one threads. Reviewers: neha, neil, alex, jason Reviewed By: alex, jason Subscribers: alex, vgopalan, mikhail, nicolas, yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D8729
Summary: Postgres now creates a Transaction ID for the DDL operation and sends it to yb-master on CREATE. This enables the following creation flow: 1. Create Transaction ID in Postgres 2. Create in yb-master 3. Postgres Writes its system table data & Commits/Aborts 4. Async thread in yb-master observes the Transaction result and performs a rollback on abort. We query the Transaction Coordinator for this information because yb-master may start its query on Step #4 before Postgres flushes any actual system table writes to the Transaction Participant. This is a best-effort cleanup, so we only rollback the entry if we know that the Transaction failed and keep it around if the async job encounters unknown errors. Can toggle feature with 'FLAGS_enable_transactional_ddl_gc' Test Plan: ybd --cxx-test pg_libpq-test --gtest_filter PgLibPq*TimeoutTest.* --test-timeout-sec 60 Reviewers: mihnea, bogdan, hector, amitanand, sergei, mikhail Reviewed By: mikhail Subscribers: ybase Differential Revision: https://phabricator.dev.yugabyte.com/D8821
Summary: A customer reported that backup failed and the reason is that the function `yb_catalog_version()` does not exist. This function is introduced in the migration script `V1__3979__pg_yb_catalog_version.sql`. After debugging I found there is a code bug in the YSQL upgrade code. Specifically: ``` Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) { int major_version = 0; // Helper macro removing boilerplate. if (VERIFY_RESULT(oneliner_with_result)) { \ ++major_version; \ } else { \ return major_version; \ } // V1: #3979 introducing pg_yb_catalog_version table. INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version")) ``` This function tries to determine which version we should start from. For example, it checks the existence of a non-empty table `pg_yb_catalog_version` to decide whether we need to skip V1 migration script or not. The V1 migration script has two blocks: * The first block introduces the table `pg_catalog.pg_yb_catalog_version` * The second block introduces the function `yb_catalog_version()` In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists but `yb_catalog_version()` does not. As a result, the above code will skip V1 migration script and that's why function `yb_catalog_version` isn't introduced. I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4, `pg_catalog.pg_yb_catalog_version` does not exist so the above code will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()` also exists so it is correct to skip V1 migration script. The customer cluster was manually fixed by reapplying V1 migration script. I made a fix such that after the normal migration has completed, check whether `yb_catalog_version()` exists or not. If it is missing then apply V1 migration script. Test Plan: (1) ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions' (2) Download old release packages such as yugabyte-2.2.7.0 yugabyte-2.4.0.0 yugabyte-2.4.8.0 yugabyte-2.6.0.0, yugabyte-2.6.20.0 yugabyte-2.8.0.0 yugabyte-2.8.12.0 Run the following test manually against each of the old releases. For example: ``` ./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at yb-tserver.INFO ``` W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1 I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1 I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0 I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump ``` Rerun the upgrade command ``` ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at the yb-tserver.INFO again: ``` I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1 I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0 I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform ``` Reviewers: jason, tverona Reviewed By: jason Subscribers: yql Differential Revision: https://phorge.dev.yugabyte.com/D31793
…exist Summary: A customer reported that backup failed and the reason is that the function `yb_catalog_version()` does not exist. This function is introduced in the migration script `V1__3979__pg_yb_catalog_version.sql`. After debugging I found there is a code bug in the YSQL upgrade code. Specifically: ``` Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) { int major_version = 0; // Helper macro removing boilerplate. if (VERIFY_RESULT(oneliner_with_result)) { \ ++major_version; \ } else { \ return major_version; \ } // V1: #3979 introducing pg_yb_catalog_version table. INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version")) ``` This function tries to determine which version we should start from. For example, it checks the existence of a non-empty table `pg_yb_catalog_version` to decide whether we need to skip V1 migration script or not. The V1 migration script has two blocks: * The first block introduces the table `pg_catalog.pg_yb_catalog_version` * The second block introduces the function `yb_catalog_version()` In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists but `yb_catalog_version()` does not. As a result, the above code will skip V1 migration script and that's why function `yb_catalog_version` isn't introduced. I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4, `pg_catalog.pg_yb_catalog_version` does not exist so the above code will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()` also exists so it is correct to skip V1 migration script. The customer cluster was manually fixed by reapplying V1 migration script. I made a fix such that after the normal migration has completed, check whether `yb_catalog_version()` exists or not. If it is missing then apply V1 migration script. Original commit: 55aac07 / D31793 Jira: DB-7466 Test Plan: (1) ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions' (2) Download old release packages such as yugabyte-2.2.7.0 yugabyte-2.4.0.0 yugabyte-2.4.8.0 yugabyte-2.6.0.0, yugabyte-2.6.20.0 yugabyte-2.8.0.0 yugabyte-2.8.12.0 Run the following test manually against each of the old releases. For example: ``` ./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at yb-tserver.INFO ``` W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1 I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1 I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0 I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump ``` Rerun the upgrade command ``` ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at the yb-tserver.INFO again: ``` I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1 I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0 I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform ``` Reviewers: jason, tverona Reviewed By: jason Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31898
…exist Summary: A customer reported that backup failed and the reason is that the function `yb_catalog_version()` does not exist. This function is introduced in the migration script `V1__3979__pg_yb_catalog_version.sql`. After debugging I found there is a code bug in the YSQL upgrade code. Specifically: ``` Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) { int major_version = 0; // Helper macro removing boilerplate. if (VERIFY_RESULT(oneliner_with_result)) { \ ++major_version; \ } else { \ return major_version; \ } // V1: #3979 introducing pg_yb_catalog_version table. INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version")) ``` This function tries to determine which version we should start from. For example, it checks the existence of a non-empty table `pg_yb_catalog_version` to decide whether we need to skip V1 migration script or not. The V1 migration script has two blocks: * The first block introduces the table `pg_catalog.pg_yb_catalog_version` * The second block introduces the function `yb_catalog_version()` In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists but `yb_catalog_version()` does not. As a result, the above code will skip V1 migration script and that's why function `yb_catalog_version` isn't introduced. I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4, `pg_catalog.pg_yb_catalog_version` does not exist so the above code will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()` also exists so it is correct to skip V1 migration script. The customer cluster was manually fixed by reapplying V1 migration script. I made a fix such that after the normal migration has completed, check whether `yb_catalog_version()` exists or not. If it is missing then apply V1 migration script. Original commit: 55aac07 / D31793 Jira: DB-7466 Test Plan: (1) ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions' (2) Download old release packages such as yugabyte-2.2.7.0 yugabyte-2.4.0.0 yugabyte-2.4.8.0 yugabyte-2.6.0.0, yugabyte-2.6.20.0 yugabyte-2.8.0.0 yugabyte-2.8.12.0 Run the following test manually against each of the old releases. For example: ``` ./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at yb-tserver.INFO ``` W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1 I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1 I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0 I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump ``` Rerun the upgrade command ``` ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at the yb-tserver.INFO again: ``` I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1 I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0 I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform ``` Reviewers: jason, tverona Reviewed By: jason Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31929
…exist Summary: A customer reported that backup failed and the reason is that the function `yb_catalog_version()` does not exist. This function is introduced in the migration script `V1__3979__pg_yb_catalog_version.sql`. After debugging I found there is a code bug in the YSQL upgrade code. Specifically: ``` Result<int> GetMajorVersionFromSystemCatalogState(PGConn* pgconn) { int major_version = 0; // Helper macro removing boilerplate. if (VERIFY_RESULT(oneliner_with_result)) { \ ++major_version; \ } else { \ return major_version; \ } // V1: #3979 introducing pg_yb_catalog_version table. INCREMENT_VERSION_OR_RETURN_IT(SystemTableHasRows(pgconn, "pg_yb_catalog_version")) ``` This function tries to determine which version we should start from. For example, it checks the existence of a non-empty table `pg_yb_catalog_version` to decide whether we need to skip V1 migration script or not. The V1 migration script has two blocks: * The first block introduces the table `pg_catalog.pg_yb_catalog_version` * The second block introduces the function `yb_catalog_version()` In a 2.4.x or 2.6.x cluster, `pg_catalog.pg_yb_catalog_version` already exists but `yb_catalog_version()` does not. As a result, the above code will skip V1 migration script and that's why function `yb_catalog_version` isn't introduced. I have verified that any older release < 2.4 (e.g., 2.2.7.0), or any newer release > 2.6 (e.g., 2.8.0.0) do not have this bug. In an older release < 2.4, `pg_catalog.pg_yb_catalog_version` does not exist so the above code will not skip V1 migration script. In a newer release > 2.6, `yb_catalog_version()` also exists so it is correct to skip V1 migration script. The customer cluster was manually fixed by reapplying V1 migration script. I made a fix such that after the normal migration has completed, check whether `yb_catalog_version()` exists or not. If it is missing then apply V1 migration script. Original commit: 55aac07 / D31793 Jira: DB-7466 Test Plan: (1) ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsByNonSuperuser' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsCreatesThemEverywhere' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSharedRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsDontFireTriggers' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemRelsAfterFailure' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#sharedRelsIndexesWork' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#creatingSystemViewsIsLikeInitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#viewReloptionsAreFilteredOnReplace' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#replacingViewKeepsCacheConsistent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#insertOnConflictWithOidsWorks' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#dmlsUpdatePgCache' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#pinnedObjectsCacheIsUpdated' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotent' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#upgradeIsIdempotentSingleConn' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migratingIsEquivalentToReinitdb' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationInGeoPartitionedSetup' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#migrationFilenameComment' ./yb_build.sh release --sj --java-test 'org.yb.pgsql.TestYsqlUpgrade#invalidUpgradeActions' (2) Download old release packages such as yugabyte-2.2.7.0 yugabyte-2.4.0.0 yugabyte-2.4.8.0 yugabyte-2.6.0.0, yugabyte-2.6.20.0 yugabyte-2.8.0.0 yugabyte-2.8.12.0 Run the following test manually against each of the old releases. For example: ``` ./bin/yb-ctl create --timeout-yb-admin-sec 180 --rf 1 --master_flags initial_sys_catalog_snapshot_path=$HOME/tmp/yugabyte-2.4.0.0/share/initial_sys_catalog_snapshot ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at yb-tserver.INFO ``` W0118 17:16:55.246408 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template1 I0118 17:16:55.246526 11555 ysql_upgrade.cc:473] template1: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.246542 11555 ysql_upgrade.cc:481] Found pg_global in migration file V1__3979__pg_yb_catalog_version.sql when applying to template1 I0118 17:16:55.341861 11555 ysql_upgrade.cc:517] template1: migration successfully applied without version bump W0118 17:16:55.478569 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in template0 I0118 17:16:55.478693 11555 ysql_upgrade.cc:473] template0: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.563812 11555 ysql_upgrade.cc:517] template0: migration successfully applied without version bump W0118 17:16:55.700487 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in postgres I0118 17:16:55.700606 11555 ysql_upgrade.cc:473] postgres: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.783318 11555 ysql_upgrade.cc:517] postgres: migration successfully applied without version bump W0118 17:16:55.802189 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in yugabyte I0118 17:16:55.802299 11555 ysql_upgrade.cc:473] yugabyte: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.826416 11555 ysql_upgrade.cc:517] yugabyte: migration successfully applied without version bump W0118 17:16:55.845325 11555 ysql_upgrade.cc:438] Function yb_catalog_version is missing in system_platform I0118 17:16:55.845443 11555 ysql_upgrade.cc:473] system_platform: applying migration 'V1__3979__pg_yb_catalog_version.sql' I0118 17:16:55.865423 11555 ysql_upgrade.cc:517] system_platform: migration successfully applied without version bump ``` Rerun the upgrade command ``` ./build/latest/bin/yb-admin --timeout_ms=720000 upgrade_ysql ``` Look at the yb-tserver.INFO again: ``` I0118 17:34:51.982327 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template1 I0118 17:34:51.985905 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in template0 I0118 17:34:51.989349 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in postgres I0118 17:34:51.992571 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in yugabyte I0118 17:34:51.997953 11558 ysql_upgrade.cc:443] Found function yb_catalog_version in system_platform ``` Reviewers: jason, tverona Reviewed By: jason Subscribers: yql Tags: #jenkins-ready Differential Revision: https://phorge.dev.yugabyte.com/D31939
Jira Link: DB-2461
Currently, a YSQL issued CREATE TABLE will need to update 2 different types of metadata:
While each of these 2 steps could be considered atomic, we need to guarantee consistency across both. Assume these are actually done sequentially right now, first the pg table queries, then the master CreateTable RPC. We consider 4 cases:
Both the pg system table operations and the CreateTable rpc work fine
Awesome, this is the happy path. Metadata is consistent.
Some failures during the pg system table operations.
The transaction can just be cleanly aborted and no side effects are left in the system.
All the pg system table operations go through, but the CreateTable rpc fails
We can hold off on committing the transaction until we actually get a success from the CreateTable RPC. That way, on a failed CreateTable, we can abort the txn, and that part is handled like in 1). As for the CreateTable rpc itself, it should guarantee its own fault tolerance semantics, so upon clear failure, we should be able to try again.
(really 3.1...) All the pg system table operations go through, but the CreateTable rpc times out
Time outs are tricky. Did the CreateTable even make it to the maser? Did it succeed just fine? Did it fail? Is it still being worked on? If we check for a created table, with the same ID, can we be confident that it was our session that generated it?
Handling this final case is the most problematic. To address this, we're considering the following approach. On timeout, rollback the pg transaction.
We already have a background thread in
CatalogManager
which does regular work every 1s (load balancing). We can augment this to also check for invalid tables. We can then tag the YSQL tables with the txn ID of the corresponding pg system table transaction that they were supposed to be consistent with. On cleanup, we can then check against the txn status tablet, by this txn ID, to see if:template1
to a new database name.Couple of open questions:
cc @mbautin @m-iancu @ndeodhar
The text was updated successfully, but these errors were encountered: