Skip to content

Commit

Permalink
[#24549] docdb: Fix flaky YbAdminSnapshotSchedule restore tests
Browse files Browse the repository at this point in the history
Summary:
This diff re-adds the retry logic removed in https://phorge.dev.yugabyte.com/D35120 to
`GetSuitableSnapshotForRestore`. Without this logic, YbAdminSnapshotSchedule tests sometimes fail
with the following error:
```
Creating snapshot in progress: 0b84f78a-4a44-4f55-901e-78eccbefa8f0 (passed 0.003s): PARALLEL_SNAPSHOT_OPERATION (master error 25)
```
This mostly affects tests because we take snapshots much more frequently than in production
(currently, every 6s). I prefer the retry approach to checking for this error on the client because:
1. The approach was already used before I removed it
2. Checking the error on the client means we have to do so for clone and backup/restore separately,
and it's annoying to do so because the tests using the yb-admin interface for restore (so the error
output would have to be string-parsed)
3. This fixes the issue for production as well (even though there re-running is a simple fix).
Jira: DB-13584

Test Plan:
Created a YbAdminSnapshotScheduleTest which sets an artificial 1s delay in the create
snapshot path by setting `TEST_delay_tablet_export_metadata_ms=1000`, and with a snapshot interval
of 1s. The test just creates a table and restores to the current time. This reliably reproduced the
PARALLEL_SNAPSHOT_ERROR message.

```
TEST_P(YbAdminSnapshotScheduleTestWithYsqlColocationRestoreParam, Anubhav) {
  auto schedule_id = ASSERT_RESULT(PreparePgWithColocatedParam());
  ASSERT_OK(cluster_->SetFlagOnTServers("TEST_delay_tablet_export_metadata_ms", "1000"));
  auto conn = ASSERT_RESULT(PgConnect(client::kTableName.namespace_name()));
  ASSERT_OK(conn.Execute("CREATE TABLE test_table (key INT PRIMARY KEY, value TEXT)"));
  Timestamp time = ASSERT_RESULT(GetCurrentTime());
  ASSERT_OK(RestoreSnapshotSchedule(schedule_id, time));
}
```

This test passed -n 100 locally. I don't think it's worth committing this as a regression test.

Reviewers: zdrudi

Reviewed By: zdrudi

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39850
  • Loading branch information
SrivastavaAnubhav committed Nov 12, 2024
1 parent 3d0d0f9 commit b1b63ee
Showing 1 changed file with 25 additions and 13 deletions.
38 changes: 25 additions & 13 deletions src/yb/master/master_snapshot_coordinator.cc
Original file line number Diff line number Diff line change
Expand Up @@ -761,21 +761,33 @@ class MasterSnapshotCoordinator::Impl {
restore_at, GetCurrentHybridTime(), InvalidArgument,
Format("Cannot restore to $0 since it is in the future", restore_at));

TxnSnapshotId snapshot_id = TxnSnapshotId::Nil();

// First check if a suitable snapshot already exists.
auto snapshot_result = FindSnapshotSuitableForRestoreAt(schedule_id, restore_at);
if (snapshot_result.ok()) {
snapshot_id = *snapshot_result;
} else if (!snapshot_result.status().IsNotFound()) {
return snapshot_result.status();
} else {
auto try_get_suitable_snapshot = [this, &schedule_id, &restore_at, leader_term, &deadline]()
-> Result<TxnSnapshotId> {
// First check if a suitable snapshot already exists.
auto result = FindSnapshotSuitableForRestoreAt(schedule_id, restore_at);
if (result.ok()) return result;
if (!result.status().IsNotFound()) return result.status();

// If a suitable snapshot does not exist, try to create one.
VERIFY_RESULT(CreateForSchedule(schedule_id, leader_term, deadline));
snapshot_id = VERIFY_RESULT(FindSnapshotSuitableForRestoreAt(schedule_id, restore_at));
}
return CreateForSchedule(schedule_id, leader_term, deadline);
};

return WaitForSnapshotToComplete(snapshot_id, restore_at, deadline);
// Try to find a suitable snapshot for restore, retrying if a snapshot is currently being
// created.
while (CoarseMonoClock::now() < deadline) {
auto result = try_get_suitable_snapshot();
if (result.ok()) {
return WaitForSnapshotToComplete(*result, restore_at, deadline);
} else if (MasterError(result.status()) == MasterErrorPB::PARALLEL_SNAPSHOT_OPERATION) {
continue;
} else {
return result.status();
}
}
return STATUS_FORMAT(
TimedOut,
"Timed out trying to find or create a snapshot for schedule $0 that covers time $1",
schedule_id, restore_at);
}

Status RestoreSnapshotSchedule(
Expand Down

0 comments on commit b1b63ee

Please sign in to comment.