Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DocDB] Flacky test PgCloneColocationTest.NoColocatedChildTables #24549

Closed
1 task done
yamen-haddad opened this issue Oct 21, 2024 · 0 comments
Closed
1 task done

[DocDB] Flacky test PgCloneColocationTest.NoColocatedChildTables #24549

yamen-haddad opened this issue Oct 21, 2024 · 0 comments
Assignees
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue

Comments

@yamen-haddad
Copy link
Member

yamen-haddad commented Oct 21, 2024

Jira Link: DB-13584

Description

I saw this test being flacky in multiple builds with the following error:

../../src/yb/integration-tests/minicluster-snapshot-test.cc:1296
Failed
Bad status: Network error (yb/yql/pgwrapper/libpq_utils.cc:414): Execute of 'CREATE DATABASE testdb_clone1 TEMPLATE testdb AS OF 1729541942755653' failed: 7, message: ERROR:  Clone operation aborted: Illegal state (yb/master/snapshot_schedule_state.cc:166): Creating snapshot in progress: 021a5fb2-9537-4408-8711-75fe088688a7 (passed 0.002s): PARALLEL_SNAPSHOT_OPERATION (master error 25)
CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again. (pgsql error XX000) (aux msg ERROR:  Clone operation aborted: Illegal state (yb/master/snapshot_schedule_state.cc:166): Creating snapshot in progress: 021a5fb2-9537-4408-8711-75fe088688a7 (passed 0.002s): PARALLEL_SNAPSHOT_OPERATION (master error 25)
CONTEXT:  Catalog Version Mismatch: A DDL occurred while processing this query. Try again.)

I believe this error (despite very rare) is causing most clone tests to be flacky. We should address this.

Issue Type

kind/failing-test

Warning: Please confirm that this issue does not contain any sensitive information

  • I confirm this issue does not contain any sensitive information.
@yamen-haddad yamen-haddad added area/docdb YugabyteDB core features status/awaiting-triage Issue awaiting triage labels Oct 21, 2024
@yamen-haddad yamen-haddad self-assigned this Oct 21, 2024
@yugabyte-ci yugabyte-ci added kind/failing-test Tests and testing infra priority/medium Medium priority issue kind/enhancement This is an enhancement of an existing feature labels Oct 21, 2024
@yamen-haddad yamen-haddad removed kind/enhancement This is an enhancement of an existing feature status/awaiting-triage Issue awaiting triage labels Oct 21, 2024
@yugabyte-ci yugabyte-ci added the kind/enhancement This is an enhancement of an existing feature label Oct 21, 2024
SrivastavaAnubhav added a commit that referenced this issue Nov 12, 2024
Summary:
This diff re-adds the retry logic removed in https://phorge.dev.yugabyte.com/D35120 to
`GetSuitableSnapshotForRestore`. Without this logic, YbAdminSnapshotSchedule tests sometimes fail
with the following error:
```
Creating snapshot in progress: 0b84f78a-4a44-4f55-901e-78eccbefa8f0 (passed 0.003s): PARALLEL_SNAPSHOT_OPERATION (master error 25)
```
This mostly affects tests because we take snapshots much more frequently than in production
(currently, every 6s). I prefer the retry approach to checking for this error on the client because:
1. The approach was already used before I removed it
2. Checking the error on the client means we have to do so for clone and backup/restore separately,
and it's annoying to do so because the tests using the yb-admin interface for restore (so the error
output would have to be string-parsed)
3. This fixes the issue for production as well (even though there re-running is a simple fix).
Jira: DB-13584

Test Plan:
Created a YbAdminSnapshotScheduleTest which sets an artificial 1s delay in the create
snapshot path by setting `TEST_delay_tablet_export_metadata_ms=1000`, and with a snapshot interval
of 1s. The test just creates a table and restores to the current time. This reliably reproduced the
PARALLEL_SNAPSHOT_ERROR message.

```
TEST_P(YbAdminSnapshotScheduleTestWithYsqlColocationRestoreParam, Anubhav) {
  auto schedule_id = ASSERT_RESULT(PreparePgWithColocatedParam());
  ASSERT_OK(cluster_->SetFlagOnTServers("TEST_delay_tablet_export_metadata_ms", "1000"));
  auto conn = ASSERT_RESULT(PgConnect(client::kTableName.namespace_name()));
  ASSERT_OK(conn.Execute("CREATE TABLE test_table (key INT PRIMARY KEY, value TEXT)"));
  Timestamp time = ASSERT_RESULT(GetCurrentTime());
  ASSERT_OK(RestoreSnapshotSchedule(schedule_id, time));
}
```

This test passed -n 100 locally. I don't think it's worth committing this as a regression test.

Reviewers: zdrudi

Reviewed By: zdrudi

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D39850
SrivastavaAnubhav added a commit that referenced this issue Dec 1, 2024
…store tests

Summary:
Original commit: b1b63ee / D39850
This diff re-adds the retry logic removed in https://phorge.dev.yugabyte.com/D35120 to
`GetSuitableSnapshotForRestore`. Without this logic, YbAdminSnapshotSchedule tests sometimes fail
with the following error:
```
Creating snapshot in progress: 0b84f78a-4a44-4f55-901e-78eccbefa8f0 (passed 0.003s): PARALLEL_SNAPSHOT_OPERATION (master error 25)
```
This mostly affects tests because we take snapshots much more frequently than in production
(currently, every 6s). I prefer the retry approach to checking for this error on the client because:
1. The approach was already used before I removed it
2. Checking the error on the client means we have to do so for clone and backup/restore separately,
and it's annoying to do so because the tests using the yb-admin interface for restore (so the error
output would have to be string-parsed)
3. This fixes the issue for production as well (even though there re-running is a simple fix).
Jira: DB-13584

Test Plan:
Created a YbAdminSnapshotScheduleTest which sets an artificial 1s delay in the create
snapshot path by setting `TEST_delay_tablet_export_metadata_ms=1000`, and with a snapshot interval
of 1s. The test just creates a table and restores to the current time. This reliably reproduced the
PARALLEL_SNAPSHOT_ERROR message.

```
TEST_P(YbAdminSnapshotScheduleTestWithYsqlColocationRestoreParam, Anubhav) {
  auto schedule_id = ASSERT_RESULT(PreparePgWithColocatedParam());
  ASSERT_OK(cluster_->SetFlagOnTServers("TEST_delay_tablet_export_metadata_ms", "1000"));
  auto conn = ASSERT_RESULT(PgConnect(client::kTableName.namespace_name()));
  ASSERT_OK(conn.Execute("CREATE TABLE test_table (key INT PRIMARY KEY, value TEXT)"));
  Timestamp time = ASSERT_RESULT(GetCurrentTime());
  ASSERT_OK(RestoreSnapshotSchedule(schedule_id, time));
}
```

This test passed -n 100 locally. I don't think it's worth committing this as a regression test.

Reviewers: zdrudi

Reviewed By: zdrudi

Subscribers: ybase

Differential Revision: https://phorge.dev.yugabyte.com/D40254
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/docdb YugabyteDB core features kind/enhancement This is an enhancement of an existing feature kind/failing-test Tests and testing infra priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants