[YSQL] Propagate catalog changes more reliably #6238

jaki · 2020-10-31T08:45:54Z

Jira Link: DB-1446
This is a broad issue. Hopefully, I'm not creating a duplicate here.

The model of having a single catalog version has flaws, yes. Let's focus on the fact that tservers pull the catalog version changes via heartbeats.

Heartbeats take some time, and they can get lost. It's easy for the YSQL catalog version on tservers to lag behind master sometimes.

YSQL DDL transactions will commit the changes, which causes writes on the system tables located on master sys catalog tablet, then wait one heartbeat (according to internal docs; I didn't verify). If tservers don't pull in the catalog version bump caused by the system table writes, they will be behind! The commit went to master but possibly not to all the tservers.

One example of where this can be a problem is with the new index design that relies more on pg_index system table commits for determining the schema. We do not want to get nodes more than 2 schemas apart, and this may be possible if commits aren't robustly propagated to all the tservers.

The text was updated successfully, but these errors were encountered:

Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Close: #5325 Test Plan: - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: amitanand, mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9682

Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Depends on D9803 Test Plan: Jenkins: rebase: 2.3 - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9804

jaki · 2020-12-15T01:54:13Z

I believe that the breaking catalog change system is flawed. Here's an example of what can happen:

ts-1 is at catalog version 2; ts-2 is at catalog version 1
postgres-2 sees catalog version 1 and sends a request to ts-2
ts-2 heartbeats and gets catalog version 2
master goes to catalog version 3
ts-1 heartbeats and gets catalog version 3
postgres-1 sees catalog version 3 and sends a request to ts-1
ts-1 receives request, and it goes through
ts-2 receives request, sees that the catalog version 1 != 2, but (say) that was not a breaking catalog change, so it goes through

This is unlikely because table schema versions catch this. However, for index backfill, which tries to do away with docdb table schema changes, this issue is probably possible.

m-iancu · 2020-12-15T02:36:02Z

@jaki Can you give a more concrete example, or elaborate on the current one. I'm not sure I follow.
Is the problem that 2->3 could be a breaking change and it would be allowed by step 8 in your example?
In that case how does the 1->2 change even matter? Also we know there is a delay for the version propagation that we need to account for and that can cause such issues. Or is this a different more subtle issue?

jaki · 2020-12-15T06:28:45Z

@m-iancu ,

First, postgres only checks for catalog version on either

retry loop (each query)
version mismatch error

If postgres spends a long time on a query, it wouldn't notice catalog version changes.

Second, writes not part of a transaction get a write time assigned to them on tserver side (see MvccManager::AddPending). If the network is really bad and packets take a while to get to tserver, we can have a delayed write.

Using those two, here's an example I can think of. I have not proven it or tried to produce it.

all are at catalog version 1
pg-2: delete B in progress...
all are at catalog version 2 (breaking version 1) for CREATE INDEX indislive true
ts-1 and pg-1 go to catalog version 3 (breaking version 2) for CREATE INDEX indisready true
pg-1: insert B goes to ts-2 (fine because v3 > v1 and TabletServiceImpl::Write only checks if the request is behind, not ahead), and it also writes to index because indisready is true
pg-2: ...finally delete B goes through to ts-2 (fine because v1 = v1), but it doesn't write to index because it's not aware of the index
the index is inconsistent for having a row that doesn't exist in the indexed table

I fact checked only some of these things, so please correct me if I'm wrong on a point.

jaki · 2020-12-16T23:47:08Z

The above concrete example is covered because of table schema version mismatch. pg-2's table schema doesn't have the index while ts-2's table schema does have the index. The above example can probably be shifted around to be a problem for DROP INDEX (when we start doing it online).

jaki · 2021-11-30T05:44:14Z

To add to the write time derivation (I previously only mentioned MvccManager::AddPending),

(gdb) bt
#0  yb::tablet::MvccManager::AddPending (this=this@entry=0x42b8790, ht=..., op_id=..., is_follower_side=is_follower_side@entry=false) at ../../src/yb/tablet/mvcc.cc:335
#1  0x00007f5087710b49 in yb::tablet::MvccManager::AddLeaderPending (this=0x42b8790, op_id=...) at ../../src/yb/tablet/mvcc.cc:308
#2  0x00007f50876f2cc6 in yb::tablet::Operation::AddedToLeader (this=0x431d080, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation.cc:115
#3  0x00007f50876f6efb in yb::tablet::OperationDriver::AddedToLeader (this=0x1b2c480, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation_driver.cc:203
#4  0x00007f508732488b in yb::consensus::RaftConsensus::AppendNewRoundsToQueueUnlocked (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1259
#5  0x00007f508731fc76 in yb::consensus::RaftConsensus::DoReplicateBatch (this=this@entry=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=processed_rounds@entry=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1166
#6  0x00007f508731fdf4 in yb::consensus::RaftConsensus::ReplicateBatch (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}) at ../../src/yb/consensus/raft_consensus.cc:1134
#7  0x00007f508772aceb in yb::tablet::PreparerImpl::ReplicateSubBatch (this=this@entry=0x4267140, batch_begin=..., batch_begin@entry=0x1b2c480, batch_end=..., batch_end@entry=0x0) at ../../src/yb/tablet/preparer.cc:332
#8  0x00007f508772b274 in yb::tablet::PreparerImpl::ProcessAndClearLeaderSideBatch (this=this@entry=0x4267140) at ../../src/yb/tablet/preparer.cc:298
#9  0x00007f508772b710 in yb::tablet::PreparerImpl::Run (this=0x4267140) at ../../src/yb/tablet/preparer.cc:190
#10 0x00007f507ed55a94 in yb::ThreadPool::DispatchThread (this=0x1d95200, permanent=true) at ../../src/yb/util/threadpool.cc:611
#11 0x00007f507ed50b65 in operator() (this=0x1a3a058) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#12 yb::Thread::SuperviseThread (arg=0x1a3a000) at ../../src/yb/util/thread.cc:774
#13 0x00007f5079206694 in start_thread (arg=0x7f50764e5700) at pthread_create.c:333
#14 0x00007f5078f4841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

HybridTime MvccManager::AddLeaderPending(const OpId& op_id) {
  std::lock_guard<std::mutex> lock(mutex_);
  auto ht = clock_->Now();
  AtomicFlagSleepMs(&FLAGS_TEST_inject_mvcc_delay_add_leader_pending_ms);
  VLOG_WITH_PREFIX(1) << __func__ << "(" << op_id << "), time: " << ht;
  AddPending(ht, op_id, /* is_follower_side= */ false);
...
  return ht;

void Operation::AddedToLeader(const OpId& op_id, const OpId& committed_op_id) {
  HybridTime hybrid_time;
  if (use_mvcc()) {
    hybrid_time = tablet_->mvcc_manager()->AddLeaderPending(op_id);
  } else {
    hybrid_time = tablet_->clock()->Now();
  }

  {
    std::lock_guard<simple_spinlock> l(mutex_);
    hybrid_time_ = hybrid_time;

  HybridTime hybrid_time() const {
    std::lock_guard<simple_spinlock> l(mutex_);
    DCHECK(hybrid_time_.is_valid());
    return hybrid_time_;
  }

HybridTime Operation::WriteHybridTime() const {
  return hybrid_time();
}

HybridTime WriteOperation::WriteHybridTime() const {
  if (request()->has_external_hybrid_time()) {
    return HybridTime(request()->external_hybrid_time());
  }
  return Operation::WriteHybridTime();
}

void TabletServiceImpl::Write(const WriteRequestPB* req,
                              WriteResponsePB* resp,
                              rpc::RpcContext context) {
...
  auto operation = std::make_unique<WriteOperation>(
      tablet.leader_term, context.GetClientDeadline(), tablet.peer.get(),
      tablet.peer->tablet(), resp);

Status Tablet::ApplyOperation(
    const Operation& operation, int64_t batch_idx,
    const docdb::KeyValueWriteBatchPB& write_batch,
    AlreadyAppliedToRegularDB already_applied_to_regular_db) {
  auto hybrid_time = operation.WriteHybridTime();

Given tablets

table: 98f2baf9d47647f4afd344d02c7b15da
index: 7a0cc88740784b4a98efb3ffdbedf1bd

On insert, tserver logs show

I1129 21:08:06.547746  6229 mvcc.cc:307] T 7a0cc88740784b4a98efb3ffdbedf1bd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.4), time: { physical: 1638248886547743 }
I1129 21:08:06.547755  7149 mvcc.cc:307] T 98f2baf9d47647f4afd344d02c7b15da P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.5), time: { physical: 1638248886547751 }
...
I1129 21:08:06.551976  6229 mvcc.cc:307] T 59fb1a59b1b943c19442b79902427fcd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.10), time: { physical: 1638248886551975 }
...
I1129 21:08:06.553159  7149 mvcc.cc:307] T 98f2baf9d47647f4afd344d02c7b15da P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.6), time: { physical: 1638248886553158 }
I1129 21:08:06.553160  6229 mvcc.cc:307] T 7a0cc88740784b4a98efb3ffdbedf1bd P 2305d5364caa44b4aea3dca1a7286f8d: AddLeaderPending(1.5), time: { physical: 1638248886553158 logical: 1 }

sst_dump shows

SubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [SystemColumnId(0); HT{ physical: 1638248886551975 }]) -> null; intent doc ht: HT{ physical: 1638248886547751 }
SubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [ColumnId(1); HT{ physical: 1638248886551975 w: 1 }]) -> 2; intent doc ht: HT{ physical: 1638248886547751 w: 1 }

and

SubDocKey(DocKey(0xc0c4, [2], [EncodedSubDocKey(DocKey(0xa39b, ["J\xce\x8dP\xfeaH\"\xa3\xf2\xca\xd4y\xd5q\xb8"], []), [])]), [SystemColumnId(0); HT{ physical: 1638248886551975 }]) -> null; intent doc ht: HT{ physical: 1638248886547743 }

So it appears the ht = clock_->Now() is the time used for the records inserted. And this "Now" time is after index permissions are checked in execIndexing.c.

jaki · 2021-11-30T06:03:00Z

@mbautin mentioned these related issues:

jaki · 2022-01-22T00:31:38Z

Don't forget that postgres catalog caches can be behind even if tservers aren't. As far as I'm aware, postgres catalog is only checked

between statements
when RPCs hit master or tserver

A mismatch in the latter case will cause catalog version mismatch and (if possible) a single transparent retry, but if the retry also fails, it's a bad user experience.

mbautin · 2022-01-22T01:03:42Z

To the point of de-synchronization between postgres and tserver processes, this is one more reason to consider implementing #9008 in the long term.

jaki added kind/bug This issue is a bug area/ysql Yugabyte SQL (YSQL) labels Oct 31, 2020

jaki assigned m-iancu and jaki Oct 31, 2020

jaki mentioned this issue Nov 13, 2021

YSQL: insertsWhileCreatingIndex sometimes fails with mismatching row counts #7376

Closed

m-iancu added this to YQL-beta Dec 7, 2021

yugabyte-ci added the priority/medium Medium priority issue label Jun 8, 2022

tverona1 added lifecycle/open and removed lifecycle/open labels Jun 9, 2022

yugabyte-ci unassigned jaki Jul 8, 2022

yugabyte-ci unassigned m-iancu Sep 18, 2022

yugabyte-ci added kind/enhancement This is an enhancement of an existing feature and removed kind/bug This issue is a bug labels Sep 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YSQL] Propagate catalog changes more reliably #6238

[YSQL] Propagate catalog changes more reliably #6238

jaki commented Oct 31, 2020 •

edited by yugabyte-ci

Loading

jaki commented Dec 15, 2020

m-iancu commented Dec 15, 2020 •

edited

Loading

jaki commented Dec 15, 2020

jaki commented Dec 16, 2020

jaki commented Nov 30, 2021

jaki commented Nov 30, 2021

jaki commented Jan 22, 2022 •

edited

Loading

mbautin commented Jan 22, 2022

[YSQL] Propagate catalog changes more reliably #6238

[YSQL] Propagate catalog changes more reliably #6238

Comments

jaki commented Oct 31, 2020 • edited by yugabyte-ci Loading

jaki commented Dec 15, 2020

m-iancu commented Dec 15, 2020 • edited Loading

jaki commented Dec 15, 2020

jaki commented Dec 16, 2020

jaki commented Nov 30, 2021

jaki commented Nov 30, 2021

jaki commented Jan 22, 2022 • edited Loading

mbautin commented Jan 22, 2022

jaki commented Oct 31, 2020 •

edited by yugabyte-ci

Loading

m-iancu commented Dec 15, 2020 •

edited

Loading

jaki commented Jan 22, 2022 •

edited

Loading