Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test: TestDataDriven failed #98038

Closed
cockroach-teamcity opened this issue Mar 5, 2023 · 1 comment · Fixed by #98082
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Mar 5, 2023

pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test.TestDataDriven failed with artifacts on master @ cf14ad694ee562676f53e36fa8495206c3aed61f:

        delete {source=1,target=2}
    datadriven_test.go:124: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3334/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test_/spanconfigreconcilerccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/testdata/protectedts:127:
        state [1 args]
        <no input to command>
        ----
        /{Min-System/NodeLiveness}                 ttl_seconds=3600 num_replicas=5
        /System/NodeLiveness{-Max}                 ttl_seconds=600 num_replicas=5
        ...
    datadriven_test.go:124: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3334/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test_/spanconfigreconcilerccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/testdata/protectedts:134:
        release [1 args]
        <no input to command>
        ----
    datadriven_test.go:124: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3334/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test_/spanconfigreconcilerccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/testdata/protectedts:137:
        release [1 args]
        <no input to command>
        ----
    datadriven_test.go:124: 
        /home/roach/.cache/bazel/_bazel_roach/c5a4e7d36696d9cd970af2045211a7df/sandbox/processwrapper-sandbox/3334/execroot/com_github_cockroachdb_cockroach/bazel-out/k8-fastbuild/bin/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/spanconfigreconcilerccl_test_/spanconfigreconcilerccl_test.runfiles/com_github_cockroachdb_cockroach/pkg/ccl/spanconfigccl/spanconfigreconcilerccl/testdata/protectedts:140:
         
        expected:
        delete /Table/10{6-7}
        upsert /Table/10{6-7}                      range default
        delete /Table/10{7-8}
        upsert /Table/10{7-8}                      range default
        
        found:
        delete /Table/10{6-7}
        upsert /Table/10{6-7}                      protection_policies=[{ts: 4}]
        delete /Table/10{7-8}
        upsert /Table/10{7-8}                      protection_policies=[{ts: 4}]
        delete /Table/10{6-7}
        upsert /Table/10{6-7}                      range default
        delete /Table/10{7-8}
        upsert /Table/10{7-8}                      range default
I230305 07:17:29.933256 6856383 testutils/testcluster/testcluster.go:148  [T1] 2905  TestCluster quiescing nodes
W230305 07:17:29.933747 6887737 jobs/registry.go:1080  [T1,n1] 2906  canceling all adopted jobs due to stopper quiescing
I230305 07:17:29.934606 6872550 server/start_listen.go:103  [T1,n1] 2907  server shutting down: instructing cmux to stop accepting
I230305 07:17:29.935300 7102142 jobs/registry.go:1495  [T1,n1] 2908  KEY VISUALIZER job 100: stepping through state succeeded
I230305 07:17:29.936114 6887421 sql/stats/automatic_stats.go:507  [T1,n1] 2909  quiescing auto stats refresher
W230305 07:17:29.937715 7510389 kv/kvserver/intentresolver/intent_resolver.go:826  [T1] 2910  failed to gc transaction record: could not GC completed transaction anchored at /Table/31/1/1/0: node unavailable; try another peer
W230305 07:17:29.941171 6881647 sql/sqlliveness/slinstance/slinstance.go:312  [T1,n1] 2911  exiting heartbeat loop
W230305 07:17:29.941313 6881647 sql/sqlliveness/slinstance/slinstance.go:297  [T1,n1] 2912  exiting heartbeat loop
E230305 07:17:29.941436 6881647 server/server_sql.go:477  [T1,n1] 2913  failed to run update of instance with new session ID: node unavailable; try another peer
E230305 07:17:29.948053 6887814 jobs/registry.go:1064  [T1,n1] 2914  error processing claimed jobs: could not query for claimed jobs: select-running/get-claimed-jobs: failed to read query result: query execution canceled
W230305 07:17:29.951681 7102142 jobs/adopt.go:457  [T1,n1] 2915  could not clear job claim: clear-job-claim: node unavailable; try another peer
W230305 07:17:29.953911 7129191 jobs/adopt.go:457  [T1,n1] 2916  could not clear job claim: clear-job-claim: failed to send RPC: sending to all replicas failed; last error: ba: Scan [/Table/15/1/101,/Table/15/1/102), [txn: 41f3b112], [can-forward-ts] RPC error: node unavailable; try another peer
    --- FAIL: TestDataDriven/protectedts (7.79s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

Jira issue: CRDB-25036

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Mar 5, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Mar 5, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Mar 5, 2023
@irfansharif irfansharif self-assigned this Mar 6, 2023
@irfansharif
Copy link
Contributor

Doesn't immediately reproduce under stress:

$ dev test pkg/ccl/spanconfigccl/spanconfigreconcilerccl -f TestDataDriven/protectedts --stress
...
560 runs so far, 0 failures, over 4m15s

But I think I see what's going on -- it's just raciness in the test itself rather an a real underlying bug. At some point in the test we have protection records over /Table/10{6-8} at ts=3 and ts=4:

/Table/10{6-7} protection_policies=[{ts: 3} {ts: 4}]
/Table/10{7-8} protection_policies=[{ts: 3} {ts: 4}]

Then we release those protection records:

# Release the schema object records.
release record-id=3
----
release record-id=4
----

And expect that for /Table/10{6-8}, we should observe mutations that replace the existing record with one that has no protected timestamps. That does happen, but its possible to observe the intermediate mutation getting rid of just one protected timestamp (the test is not getting rid of pts records atomically). We should just assert on the final state instead of intermediate mutations.

craig bot pushed a commit that referenced this issue Mar 6, 2023
98082: spanconfig: deflake spanconfigreconcilerccl/TestDataDriven r=irfansharif a=irfansharif

Fixes #98038. This test set up two protection records over two schema objects at two timestamps ts=3 and ts=4.
```
   /Table/10{6-7}  protection_policies=[{ts: 3} {ts: 4}]
   /Table/10{7-8}  protection_policies=[{ts: 3} {ts: 4}]
```
When it later released those protection records:
```
   release record-id=3
   release record-id=4
   ----
```
It asserted that the span config mutations showed that we did infact get rid of the protected state:
```
   mutations
   ----
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default
```
But since release of these protections was non-atomic, in #98038 we observed the following transition instead.
```
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      protection_policies=[{ts: 4}]
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      protection_policies=[{ts: 4}]
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default
```
That is, we first got rid of the record with ts=3 and only then got rid of ts=4. We just rewrite the test to assert on the final state of the records that show no remaining protections, instead of trying to add synchronization for mutations.

Release note: None

Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 0aee3d8 Mar 6, 2023
blathers-crl bot pushed a commit that referenced this issue Jun 16, 2023
Fixes #98038. This test set up two protection records over two schema
objects at two timestamps ts=3 and ts=4.

   /Table/10{6-7}  protection_policies=[{ts: 3} {ts: 4}]
   /Table/10{7-8}  protection_policies=[{ts: 3} {ts: 4}]

When it later released those protection records:

   release record-id=3
   release record-id=4
   ----

It asserted that the span config mutations showed that we did infact get
rid of the protected state:

   mutations
   ----
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default

But since release of these protections was non-atomic, in #98038 we
observed the following transition instead.

   delete /Table/10{6-7}
   upsert /Table/10{6-7}      protection_policies=[{ts: 4}]
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      protection_policies=[{ts: 4}]
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default

That is, we first got rid of the record with ts=3 and only then got rid
of ts=4. We just rewrite the test to assert on the final state of the
records that show no remaining protections, instead of trying to add
synchronization for mutations.

Release note: None
irfansharif added a commit that referenced this issue Jun 27, 2023
Fixes #98038. This test set up two protection records over two schema
objects at two timestamps ts=3 and ts=4.

   /Table/10{6-7}  protection_policies=[{ts: 3} {ts: 4}]
   /Table/10{7-8}  protection_policies=[{ts: 3} {ts: 4}]

When it later released those protection records:

   release record-id=3
   release record-id=4
   ----

It asserted that the span config mutations showed that we did infact get
rid of the protected state:

   mutations
   ----
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default

But since release of these protections was non-atomic, in #98038 we
observed the following transition instead.

   delete /Table/10{6-7}
   upsert /Table/10{6-7}      protection_policies=[{ts: 4}]
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      protection_policies=[{ts: 4}]
   delete /Table/10{6-7}
   upsert /Table/10{6-7}      range default
   delete /Table/10{7-8}
   upsert /Table/10{7-8}      range default

That is, we first got rid of the record with ts=3 and only then got rid
of ts=4. We just rewrite the test to assert on the final state of the
records that show no remaining protections, instead of trying to add
synchronization for mutations.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants