Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backupccl: TestExcludeDataFromBackupAndRestore is flaky #95350

Closed
pav-kv opened this issue Jan 17, 2023 · 1 comment · Fixed by #95528
Closed

backupccl: TestExcludeDataFromBackupAndRestore is flaky #95350

pav-kv opened this issue Jan 17, 2023 · 1 comment · Fixed by #95528
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.

Comments

@pav-kv
Copy link
Collaborator

pav-kv commented Jan 17, 2023

Describe the problem

--- FAIL: TestExcludeDataFromBackupAndRestore (2.26s)
    test_log_scope.go:161: test logs captured to: /Users/pavel/go/src/github.com/cockroachdb/cockroach/tmp/_tmp/ab3eff6b7877b2d0e35a44d562d13cda/logTestExcludeDataFromBackupAndRestore620557206
    test_log_scope.go:79: use -show-logs to present logs inline
    backup_test.go:9278: 
                Error Trace:    /private/var/tmp/_bazel_pavel/9feea0c38530dc9f20105ecc8935971a/sandbox/darwin-sandbox/533/execroot/com_github_cockroachdb_cockroach/bazel-out/darwin_arm64-fastbuild/bin/pkg/ccl/backupccl/backupccl_test_/backupccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/backupccl/backup_test.go:9278
                Error:          "[]" should have 10 item(s), but has 0
                Test:           TestExcludeDataFromBackupAndRestore
    testutils.go:199: no Invalid Descriptors
    testutils.go:199: no Invalid Descriptors
    panic.go:522: -- test log scope end --
FAIL
I230117 12:07:18.028974 1 (gostd) testmain.go:468  [T1] 1  Test //pkg/ccl/backupccl:backupccl_test exited with error code 1

Example failure in CI here.

To Reproduce

./dev test pkg/ccl/backupccl --filter=TestExcludeDataFromBackupAndRestore --stress

Environment:

  • CockroachDB version: master @ 761cf72, also tried at an older commit 01032c2 to make sure the failure isn't too recent
  • Both on my local MacOS, and in CI

Jira issue: CRDB-23470

@pav-kv pav-kv added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Jan 17, 2023
@pav-kv
Copy link
Collaborator Author

pav-kv commented Jan 17, 2023

@adityamaru Could you take a look? Added this test in #77406.

craig bot pushed a commit that referenced this issue Jan 21, 2023
94774: sql: fix mixed-version behaviour of create_tenant() r=postamar a=postamar

Previously, running create_tenant on the latest binary in
a mixed-version cluster would bootstrap the system schema using the
logic from the latest binary, in which all system database migration
upgrade steps have been "baked in".

This might not be compatible with the the tenant cluster's version which
is (correctly) set to the active host cluster version, the idea there
being that tenant clusters versions cannot be greater than host cluster
versions.

This commit fixes this by version-gating the tenant cluster system
schema bootstrapping, and using hard-coded values when bootstrapping
into the old version.

Informs #94773.

Release note: None

95355: server: always enable sql_instances maintenance r=knz a=dt

Fixes #95571.

Previously the system.sql_instances table was only maintained by SQL servers that were operating in "pod" mode, i.e. not in mixed KV and SQL process nodes, where KV-level liveness and gossip provides an alternative means of node discovery that can be used by the SQL layer when searching for other SQL instances. However this inconsistency makes writing correct remote-node discovery and interaction SQL-level code difficult: in some cases such code needs to consult the instances list, and in some the KV liveness store, which when combined with complexities of doing so around initialization, dependency-injection, etc can become hard to maintain.

Additionally such a design precludes a cluster where some SQL instances are in mixed KV nodes and some are not, as the non-KV nodes would have no way discover the KV ones. Such deployments are not currently possible but could be in the future.

Instead, this change enabled maintenance of the sql_instances table by all SQL servers, whether running in their own processes or embedded in a KV storage node process. This paves the way for making the means of discovery of SQL servers uniform across all SQL server types: they will all be able to simply consult the instances list, to find any other SQL servers, regardless of where those SQL servers are running.

A follow-up change could simplify DistSQLPhysicalPlanner, specifically the SetupAllNodesPlanning method that has two different implementations due to the previous inconsistency in the available APIs.

Release note: none.
Epic: CRDB-14537


95528: backupccl: fix flaky TestExcludeDataFromBackupAndRestore r=msbutler a=adityamaru

We don't need to wait for the table to split, inspecting the state of the leaseholders replica is adequate and a more correct source of truth to rely on.

In some cases the test would not wait for `data.bar` to split into its own range and so it would incorrectly be excluded from the backup resulting in 0 rows instead of 10 in the final assertion.

Fixes: #95350

Release note: None

Co-authored-by: Marius Posta <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in 8eaa89f Jan 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants