-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[YSQL] Propagate catalog changes more reliably #6238
Comments
Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Close: #5325 Test Plan: - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: amitanand, mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9682
Summary: To improve YSQL index backfill failure handling, switch over to a new design: - Make `pg_index` permissions `indislive`, `indisready`, `indisvalid` the authority, and don't really use the DocDB permissions (only use three of them for persisting state information) - For the backfill step, add a postgres-master RPC `BackfillIndex` Now, `CREATE INDEX` with backfill enabled looks like 1. postgres: send create index request to master 1. master: create index table 1. master: alter index table to have index info at `WRITE_AND_DELETE` perm - don't set fully applied schema - set schema to `WRITE_AND_DELETE` - set `ALTERING` state 1. tservers: apply the alter to the indexed table 1. master: _don't_ move on to the next permission - don't set fully applied schema - keep schema at `WRITE_AND_DELETE` 1. postgres: create postgres index at `indislive` permission 1. postgres: add `indisready` permission 1. postgres: send backfill request and wait for at least `READ_WRITE_AND_DELETE` permissions 1. master: move on to backfilling 1. master: get safe time for read 1. tservers: get safe time 1. master: send backfill requests to tservers 1. tservers: send backfill requests to postgres 1. master: alter to success (`READ_WRITE_AND_DELETE`) or abort (`WRITE_AND_DELETE_WHILE_REMOVING`) - clear fully applied schema - set schema to `READ_WRITE_AND_DELETE` or `WRITE_AND_DELETE_WHILE_REMOVING` - clear `ALTERING` state 1. postgres: finish waiting and, on success, add `indisvalid` permission If postgres dies before backfill, master isn't stuck, and a separate request can be made to drop the half-created index. If postgres dies during or after backfill, we can still drop the index, but you may need to kill any backfills in progress. If master fails to backfill, postgres will just stop at that point and not set the index public, so a separate request can be made to drop the index. Add some gflags: - master: - `ysql_backfill_is_create_table_done_delay_ms`: delay after finding index info in table for YSQL `IsCreateTableDone` (currently helpful, but not foolproof, for preventing backfill from happening while a tserver doesn't have the index info--issue #6234) - `ysql_index_backfill_rpc_timeout_ms`: deadline for master to tserver backfill tablet rpc calls (useful to handle large tables because we currently don't respect the deadline and try to backfill the entire tablet at once--issue #5326) - tserver: - `TEST_ysql_index_state_flags_update_delay_ms`: delay after committing `pg_index` state flag change (currently helpful, but not foolproof for consistency, when, in our current design, commits aren't guaranteed to have been seen by all tservers by the time the commit finishes--issue #6238) - `ysql_wait_until_index_permissions_timeout_ms`: timeout for `WaitUntilIndexPermissionsAtLeast` client function (again, useful for handling large tables--issue #5326) Add some helper functions: - `yb::ExternalMiniCluster::SetFlagOnMasters` for external minicluster tests (currently unused) - `yb::pgwrapper::GetBool` for pgwrapper tests Adjust some tests to the new backfill model: - `PgLibPqTest.BackfillPermissions` - `PgLibPqTest.BackfillReadTime` Fix `BackfillIndexesForYsql` to better handle libpq connections and results. Other issues: - `CREATE INDEX` can't be stopped with Ctrl+C, probably because of the long `YBCPgWaitUntilIndexPermissionsAtLeast` call - An in-progress master to tserver backfill RPC will not notice a DROP INDEX removes index tablets - master leader failover during backfill will not cause backfill to resume, unlike YCQL (issue #6218) Depends on D9803 Test Plan: Jenkins: rebase: 2.3 - `./yb_build.sh --cxx-test pgwrapper_pg_libpq-test --gtest_filter` - `PgLibPqTest.BackfillWaitBackfillTimeout` - `PgLibPqTest.BackfillDropAfterFail` - `PgLibPqTest.BackfillMasterLeaderStepdown` - `PgLibPqTest.BackfillDropWhileBackfilling` - `PgLibPqTest.BackfillAuth` Reviewers: mihnea Reviewed By: mihnea Subscribers: yql, bogdan Differential Revision: https://phabricator.dev.yugabyte.com/D9804
I believe that the breaking catalog change system is flawed. Here's an example of what can happen:
This is unlikely because table schema versions catch this. However, for index backfill, which tries to do away with docdb table schema changes, this issue is probably possible. |
@jaki Can you give a more concrete example, or elaborate on the current one. I'm not sure I follow. |
@m-iancu , First, postgres only checks for catalog version on either
If postgres spends a long time on a query, it wouldn't notice catalog version changes. Second, writes not part of a transaction get a write time assigned to them on tserver side (see Using those two, here's an example I can think of. I have not proven it or tried to produce it.
I fact checked only some of these things, so please correct me if I'm wrong on a point. |
The above concrete example is covered because of table schema version mismatch. pg-2's table schema doesn't have the index while ts-2's table schema does have the index. The above example can probably be shifted around to be a problem for |
To add to the write time derivation (I previously only mentioned (gdb) bt
#0 yb::tablet::MvccManager::AddPending (this=this@entry=0x42b8790, ht=..., op_id=..., is_follower_side=is_follower_side@entry=false) at ../../src/yb/tablet/mvcc.cc:335
#1 0x00007f5087710b49 in yb::tablet::MvccManager::AddLeaderPending (this=0x42b8790, op_id=...) at ../../src/yb/tablet/mvcc.cc:308
#2 0x00007f50876f2cc6 in yb::tablet::Operation::AddedToLeader (this=0x431d080, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation.cc:115
#3 0x00007f50876f6efb in yb::tablet::OperationDriver::AddedToLeader (this=0x1b2c480, op_id=..., committed_op_id=...) at ../../src/yb/tablet/operations/operation_driver.cc:203
#4 0x00007f508732488b in yb::consensus::RaftConsensus::AppendNewRoundsToQueueUnlocked (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1259
#5 0x00007f508731fc76 in yb::consensus::RaftConsensus::DoReplicateBatch (this=this@entry=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}, processed_rounds=processed_rounds@entry=0x7f50764dcd78) at ../../src/yb/consensus/raft_consensus.cc:1166
#6 0x00007f508731fdf4 in yb::consensus::RaftConsensus::ReplicateBatch (this=0x45e0250, rounds=std::vector of length 1, capacity 1 = {...}) at ../../src/yb/consensus/raft_consensus.cc:1134
#7 0x00007f508772aceb in yb::tablet::PreparerImpl::ReplicateSubBatch (this=this@entry=0x4267140, batch_begin=..., batch_begin@entry=0x1b2c480, batch_end=..., batch_end@entry=0x0) at ../../src/yb/tablet/preparer.cc:332
#8 0x00007f508772b274 in yb::tablet::PreparerImpl::ProcessAndClearLeaderSideBatch (this=this@entry=0x4267140) at ../../src/yb/tablet/preparer.cc:298
#9 0x00007f508772b710 in yb::tablet::PreparerImpl::Run (this=0x4267140) at ../../src/yb/tablet/preparer.cc:190
#10 0x00007f507ed55a94 in yb::ThreadPool::DispatchThread (this=0x1d95200, permanent=true) at ../../src/yb/util/threadpool.cc:611
#11 0x00007f507ed50b65 in operator() (this=0x1a3a058) at /opt/yb-build/brew/linuxbrew-20181203T161736v9-3ba4c2ed9b0587040949a4a9a95b576f520bae/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#12 yb::Thread::SuperviseThread (arg=0x1a3a000) at ../../src/yb/util/thread.cc:774
#13 0x00007f5079206694 in start_thread (arg=0x7f50764e5700) at pthread_create.c:333
#14 0x00007f5078f4841d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109 HybridTime MvccManager::AddLeaderPending(const OpId& op_id) {
std::lock_guard<std::mutex> lock(mutex_);
auto ht = clock_->Now();
AtomicFlagSleepMs(&FLAGS_TEST_inject_mvcc_delay_add_leader_pending_ms);
VLOG_WITH_PREFIX(1) << __func__ << "(" << op_id << "), time: " << ht;
AddPending(ht, op_id, /* is_follower_side= */ false);
...
return ht; void Operation::AddedToLeader(const OpId& op_id, const OpId& committed_op_id) {
HybridTime hybrid_time;
if (use_mvcc()) {
hybrid_time = tablet_->mvcc_manager()->AddLeaderPending(op_id);
} else {
hybrid_time = tablet_->clock()->Now();
}
{
std::lock_guard<simple_spinlock> l(mutex_);
hybrid_time_ = hybrid_time; HybridTime hybrid_time() const {
std::lock_guard<simple_spinlock> l(mutex_);
DCHECK(hybrid_time_.is_valid());
return hybrid_time_;
} HybridTime Operation::WriteHybridTime() const {
return hybrid_time();
} HybridTime WriteOperation::WriteHybridTime() const {
if (request()->has_external_hybrid_time()) {
return HybridTime(request()->external_hybrid_time());
}
return Operation::WriteHybridTime();
} void TabletServiceImpl::Write(const WriteRequestPB* req,
WriteResponsePB* resp,
rpc::RpcContext context) {
...
auto operation = std::make_unique<WriteOperation>(
tablet.leader_term, context.GetClientDeadline(), tablet.peer.get(),
tablet.peer->tablet(), resp); Status Tablet::ApplyOperation(
const Operation& operation, int64_t batch_idx,
const docdb::KeyValueWriteBatchPB& write_batch,
AlreadyAppliedToRegularDB already_applied_to_regular_db) {
auto hybrid_time = operation.WriteHybridTime(); Given tablets
On insert, tserver logs show
sst_dump shows
and
So it appears the |
Don't forget that postgres catalog caches can be behind even if tservers aren't. As far as I'm aware, postgres catalog is only checked
A mismatch in the latter case will cause catalog version mismatch and (if possible) a single transparent retry, but if the retry also fails, it's a bad user experience. |
To the point of de-synchronization between postgres and tserver processes, this is one more reason to consider implementing #9008 in the long term. |
Jira Link: DB-1446
This is a broad issue. Hopefully, I'm not creating a duplicate here.
The model of having a single catalog version has flaws, yes. Let's focus on the fact that tservers pull the catalog version changes via heartbeats.
Heartbeats take some time, and they can get lost. It's easy for the YSQL catalog version on tservers to lag behind master sometimes.
YSQL DDL transactions will commit the changes, which causes writes on the system tables located on master sys catalog tablet, then wait one heartbeat (according to internal docs; I didn't verify). If tservers don't pull in the catalog version bump caused by the system table writes, they will be behind! The commit went to master but possibly not to all the tservers.
One example of where this can be a problem is with the new index design that relies more on
pg_index
system table commits for determining the schema. We do not want to get nodes more than 2 schemas apart, and this may be possible if commits aren't robustly propagated to all the tservers.The text was updated successfully, but these errors were encountered: