Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: TestPTSRecordProtectsTargetsAndSystemTables failed #135639

Open
andyyang890 opened this issue Nov 18, 2024 · 5 comments
Open

changefeedccl: TestPTSRecordProtectsTargetsAndSystemTables failed #135639

andyyang890 opened this issue Nov 18, 2024 · 5 comments
Assignees
Labels
A-cdc Change Data Capture branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). P-2 Issues/test failures with a fix SLA of 3 months T-cdc

Comments

@andyyang890
Copy link
Collaborator

andyyang890 commented Nov 18, 2024

This test failed under maybe_stressrace on Bazel Extended CI on #135613. Seems a little weird given we backported the PR protecting the zones table in #130622.

=== RUN   TestPTSRecordProtectsTargetsAndSystemTables
    test_log_scope.go:165: test logs captured to: /artifacts/tmp/_tmp/905465f697dc04c39f7d8a82cb057bfe/logTestPTSRecordProtectsTargetsAndSystemTables2369024512
    test_log_scope.go:76: use -show-logs to present logs inline
    protected_timestamps_test.go:500: updating PTS reader cache to 1731962155.517292301,0
    protected_timestamps_test.go:512: enqueuing range 7 for mvccGC
    protected_timestamps_test.go:500: updating PTS reader cache to 1731962158.896807831,0
    protected_timestamps_test.go:512: enqueuing range 9 for mvccGC
    protected_timestamps_test.go:500: updating PTS reader cache to 1731962162.143584093,0
    protected_timestamps_test.go:512: enqueuing range 27 for mvccGC
    protected_timestamps_test.go:500: updating PTS reader cache to 1731962165.419260368,0
    protected_timestamps_test.go:512: enqueuing range 26 for mvccGC
    protected_timestamps_test.go:500: updating PTS reader cache to 1731962168.663082802,0
    protected_timestamps_test.go:512: enqueuing range 8 for mvccGC
    protected_timestamps_test.go:544:
          Error Trace:  pkg/ccl/changefeedccl/protected_timestamps_test.go:544
          Error:        Received unexpected error:
                        batch timestamp 1731962150.678137806,0 must be after replica GC threshold 1731962159.471710070,0 (r9: /Table/{5-6})
          Test:         TestPTSRecordProtectsTargetsAndSystemTables
    panic.go:626: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/905465f697dc04c39f7d8a82cb057bfe/logTestPTSRecordProtectsTargetsAndSystemTables2369024512

See https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_BazelExtendedCi/17824471?hideProblemsFromDependencies=false&hideTestsFromDependencies=false&expandBuildProblemsSection=true&expandBuildChangesSection=true&expandBuildTestsSection=true

Jira issue: CRDB-44660

@andyyang890 andyyang890 added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). A-cdc Change Data Capture T-cdc branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels Nov 18, 2024
Copy link

blathers-crl bot commented Nov 18, 2024

cc @cockroachdb/cdc

@asg0451 asg0451 self-assigned this Nov 18, 2024
@stevendanna
Copy link
Collaborator

Took a look at this, here are some notes on what I found.

The error

     batch timestamp 1731962150.678137806,0 must be after       (Mon Nov 18 2024 20:35:50 GMT+0000)
replica GC threshold 1731962159.471710070,0 (r9: /Table/{5-6})  (Mon Nov 18 2024 20:35:59 GMT+0000)

We see that the span reconcilation job did start up:

I241118 20:35:45.268240 32893 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:241 â‹® [T1,Vsystem,n1,job=AUTO SPAN CONFIG RECONCILIATION id=1022006866334875649] 80  established range feed over system.descriptors starting at time 1731962144.750811125,0
I241118 20:35:45.268712 32893 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:300 â‹® [T1,Vsystem,n1,job=AUTO SPAN CONFIG RECONCILIATION id=1022006866334875649] 81  established range feed over system.zones starting at time 1731962144.750811125,0
I241118 20:35:45.269118 32893 spanconfig/spanconfigsqlwatcher/sqlwatcher.go:396 â‹® [T1,Vsystem,n1,job=AUTO SPAN CONFIG RECONCILIATION id=1022006866334875649] 82  established range feed over system.protected_ts_records starting at time 1731962144.750811125,0

Some evidence that we did run GC on table 23 (systemrole_members) as well:

I241118 20:36:08.574857 33793 kv/kvserver/mvcc_gc_queue.go:817 ⋮ [T1,Vsystem,n1,client=127.0.0.1:43304,hostssl,user=root,s1,r26/1:‹/Table/2{3-4}›] 304  GC still needed following GC, recomputing MVCC stats
I241118 20:36:08.575103 33793 kv/kvserver/mvcc_gc_queue.go:818 ⋮ [T1,Vsystem,n1,client=127.0.0.1:43304,hostssl,user=root,s1,r26/1:‹/Table/2{3-4}›] 305  old score ‹queue=false with 0.00/fuzz(0.97)=0.00=valScaleScore(0.00)*deadFrac(0.65)+intentScore(0.00)›
I241118 20:36:08.575103 33793 kv/kvserver/mvcc_gc_queue.go:818 ⋮ [T1,Vsystem,n1,client=127.0.0.1:43304,hostssl,user=root,s1,r26/1:‹/Table/2{3-4}›] 305 +‹likely last GC: never, 614 B non-live, curr. age -614 B*s, min exp. reduction: 0 B*s›
I241118 20:36:08.575428 33793 kv/kvserver/mvcc_gc_queue.go:819 ⋮ [T1,Vsystem,n1,client=127.0.0.1:43304,hostssl,user=root,s1,r26/1:‹/Table/2{3-4}›] 306  new score ‹queue=true with 10.07/fuzz(0.97)=10.35=valScaleScore(15.97)*deadFrac(0.65)+intentScore(0.00)›
I241118 20:36:08.575428 33793 kv/kvserver/mvcc_gc_queue.go:819 ⋮ [T1,Vsystem,n1,client=127.0.0.1:43304,hostssl,user=root,s1,r26/1:‹/Table/2{3-4}›] 306 +‹likely last GC: never, 614 B non-live, curr. age 9.6 KiB*s, min exp. reduction: 9.0 KiB*s›

It's not clear to me we know precisely what happened.

I have a feeling to get to the bottom of this further we'll need to
put in some increased logging. Namely, it would be good to enable logs
that show us:

  • When did GC run and what GC threshold was used.
  • When did updates to span_configuration happen and what was updated?
  • What was the replica's view of the protected timestamp?
  • What was the pts reader's view of the protected timestamp?

Now that I've written this out, it occurs to me that I think this test
is still subject to errors caused by the async processing of pts records.

Namely, this test calls TestingRefreshPTSState to pts state is
up-to-date, but I think this makes sure that the cache is up to date
from the perspective of the KVSubscriber. That is, it makes sure we
have an up-to-date view of the span_configuration table. But, that
doesn't ensure that span configuration table itself is up to date with
the state of the PTS tables.

@asg0451
Copy link
Contributor

asg0451 commented Nov 20, 2024

Thanks @stevendanna. Adding those logs is a bit out of CDC's comfort zone -- can we ask you or someone else more familiar to add them? Also, looks like we got a failure of this through regular CI last night -- #135805

Thanks!

@stevendanna
Copy link
Collaborator

@asg0451 I'm mostly thinking of two things here.

  1. Perhaps a vmodule=2 log message somewhere in this function:
    func updateSpanConfigRecords(
  2. Maybe make this also conditional on vmodule so that under some vmodule all span configuration updates are read:
    if found && curSpanConfig.HasConfigurationChange(nextSC) {
  3. Set the vmodule in the test so we definitely see this log message:
    log.VEventf(ctx, 2, "processing replica %s with score %s", repl.String(), r)

But, now that I'm looking at this again, perhaps we can make this test stable by installing a testing hook:

// JobPersistCheckpointInterceptor, if set, is invoked before the
// reconciliation job persists checkpoints.
JobOnCheckpointInterceptor func() error
so that we can wait for the span configuration reconciler to advance to timestamp where it should have seen our PTS update.

@wenyihu6 wenyihu6 added the P-2 Issues/test failures with a fix SLA of 3 months label Nov 25, 2024
@asg0451
Copy link
Contributor

asg0451 commented Jan 3, 2025

@stevendanna i've been poking at this again today
it does reliably fail with the sleeps you showed me. however, i've implemented what you recommended above and it still fails consistently -- though always for table 3.

changes in this branch: #138243

I notice that we DO see the changing the spanconfig for span .. logs now.

 fetching table descriptor 106: batch timestamp 1735939402.414327000,0 must be after replica GC threshold 1735939437.142929000,0 (r7: /Table/{3-4})
I250103 21:24:11.322169 493 13@spanconfig/spanconfigstore/store.go:366 ⋮ [T1,Vsystem,n1] 453  changing the spanconfig for span:‹/Table/{0-4}› from:{RangeMinBytes:134217728 RangeMaxBytes:536870912 GCPolicy:{TTLSeconds:1 ProtectionPolicies:[] IgnoreStrictEnforcement:true} GlobalReads:false NumReplicas:1 NumVoters:0 Constraints:[] VoterConstraints:[] LeasePreferences:[] RangefeedEnabled:true ExcludeDataFromBackup:false} to:{RangeMinBytes:134217728 RangeMaxBytes:536870912 GCPolicy:{TTLSeconds:1 ProtectionPolicies:[‹{ts: 1735939402414327000}›] IgnoreStrictEnforcement:true} GlobalReads:false NumReplicas:1 NumVoters:0 Constraints:[] VoterConstraints:[] LeasePreferences:[] RangefeedEnabled:true ExcludeDataFromBackup:false}

the protection policy ts mentioned is 1735939402.414327000, which is the same as the batch timestamp in the error msg. but the replica gc threshold in that msg is seconds later. any ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cdc Change Data Capture branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). P-2 Issues/test failures with a fix SLA of 3 months T-cdc
Projects
None yet
Development

No branches or pull requests

4 participants