Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: clearrange/checks=true failed #81429

Closed
cockroach-teamcity opened this issue May 18, 2022 · 5 comments
Closed

roachtest: clearrange/checks=true failed #81429

cockroach-teamcity opened this issue May 18, 2022 · 5 comments
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-unactionable This was closed because it was unactionable.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 18, 2022

roachtest.clearrange/checks=true failed with artifacts on release-21.1 @ d6c2fe5a1b3a1e80f7cbaee4eff55235996ad4db:

		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitor).WaitE
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2676
		  | main.(*monitor).Wait
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2684
		  | main.runClearRange
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:187
		  | main.registerClearRange.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/clearrange.go:34
		  | main.(*testRunner).runTest.func2
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/test_runner.go:733
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (2) monitor failure
		Wraps: (3) unexpected node event: 2: dead (exit status 7)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *errors.errorString

	cluster.go:1667,context.go:91,cluster.go:1656,test_runner.go:820: dead node detection: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/bin/roachprod monitor teamcity-5190006-1652852628-30-n10cpu16 --oneshot --ignore-empty-nodes: exit status 1 2: dead (exit status 7)
		8: 11438
		9: 11758
		1: 12591
		7: 12003
		4: 11441
		3: 11828
		5: 11319
		10: 11398
		6: 11638
		Error: UNCLASSIFIED_PROBLEM: 2: dead (exit status 7)
		(1) UNCLASSIFIED_PROBLEM
		Wraps: (2) attached stack trace
		  -- stack trace:
		  | main.glob..func14
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1147
		  | main.wrap.func1
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:271
		  | github.com/spf13/cobra.(*Command).execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:830
		  | github.com/spf13/cobra.(*Command).ExecuteC
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:914
		  | github.com/spf13/cobra.(*Command).Execute
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/vendor/github.com/spf13/cobra/command.go:864
		  | main.main
		  | 	/home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachprod/main.go:1889
		  | runtime.main
		  | 	/usr/local/go/src/runtime/proc.go:255
		  | runtime.goexit
		  | 	/usr/local/go/src/runtime/asm_amd64.s:1581
		Wraps: (3) 2: dead (exit status 7)
		Error types: (1) errors.Unclassified (2) *withstack.withStack (3) *errutil.leafError
Reproduce

To reproduce, try:

# From https://go.crdb.dev/p/roachstress, perhaps edited lightly.
caffeinate ./roachstress.sh clearrange/checks=true

/cc @cockroachdb/storage

This test on roachdash | Improve this report!

Jira issue: CRDB-15192

@cockroach-teamcity cockroach-teamcity added branch-release-21.1 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 18, 2022
@cockroach-teamcity cockroach-teamcity added this to the 21.1 milestone May 18, 2022
@blathers-crl blathers-crl bot added the T-storage Storage Team label May 18, 2022
@nicktrav
Copy link
Collaborator

nicktrav commented May 18, 2022

Looks like we picked up some replica inconsistency. Node 2 crashed with the following:

F220518 09:04:44.852680 4975995 kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1521  found a delta of {ContainsEstimates:0 LastUpdateNanos:1652863372086326722 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 SysBytes:56 SysCount:1 AbortSpanBytes:0}

@nicktrav
Copy link
Collaborator

I'm piecing together the events leading up to the crash.

r37 wants to merge with r47. From the range event log:

    {
      "event": {
        "timestamp": "2022-05-18T09:04:43.410946Z",
        "range_id": 37,
        "store_id": 2,
        "event_type": 3,
        "other_range_id": 47,
        "info": {
          "UpdatedDesc": {
            "range_id": 37,
            "start_key": "vQ==",
            "end_key": "vYn3fyA=",
            "internal_replicas": [
              {
                "node_id": 1,
                "store_id": 1,
                "replica_id": 1
              },
              {
                "node_id": 2,
                "store_id": 2,
                "replica_id": 2
              },
              {
                "node_id": 3,
                "store_id": 3,
                "replica_id": 3
              }
            ],
            "next_replica_id": 4,
            "generation": 18
          },
          "RemovedDesc": {
            "range_id": 47,
            "start_key": "vYn3G1g=",
            "end_key": "vYn3fyA=",
            "internal_replicas": [
              {
                "node_id": 1,
                "store_id": 1,
                "replica_id": 1
              },
              {
                "node_id": 2,
                "store_id": 2,
                "replica_id": 2
              },
              {
                "node_id": 3,
                "store_id": 3,
                "replica_id": 3
              }
            ],
            "next_replica_id": 6,
            "generation": 17
          }
        }
      },
      "pretty_info": {
        "updated_desc": "r37:/Table/53{-/1/32544} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=18]"
      }
    }

Logs on n2 - we see the pre-merge desc for r37 (n2,s2,r37/2:/Table/53{-/1/7000}), followed by the post-merge desc (n2,merge,s2,r37/2:/Table/53{-/1/32544}) which appears in the panic:

I220518 09:04:43.413751 4975995 kv/kvserver/replica_command.go:696 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/7000}›] 1499  initiating a merge of r47:‹/Table/53/1/{7000-32544}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=6, gen=17] into this range (‹lhs+rhs has (size=0 B+0 B=0 B qps=0.00+0.02=0.02qps) below threshold (size=128 MiB, qps=1250.00)›)
...snip...
I220518 09:04:43.437684 297 kv/kvserver/store_remove_replica.go:133 ⋮ [n2,s2,r37/2:‹/Table/53{-/1/7000}›] 1501  removing replica r47/2
I220518 09:04:44.851664 4975995 1@kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1502  the server is terminating due to a fatal error (see the DEV channel for details)
...snip...
F220518 09:04:44.852680 4975995 kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1521  found a delta of {ContainsEstimates:0 LastUpdateNanos:1652863372086326722 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 SysBytes:56 SysCount:1 AbortSpanBytes:0}

Logs on the other two nodes with a replica of r37 and r47:

n1:

I220518 09:04:43.440487 305 kv/kvserver/store_remove_replica.go:133 ⋮ [n1,s1,r37/1:‹/Table/53{-/1/7000}›] 3844  removing replica r47/1
E220518 09:04:43.441412 305 kv/kvserver/replica_proposal.go:317 ⋮ [n1,s1,r37/1:‹/Table/53{-/1/32544}›] 3845  could not run async checksum computation (ID = ‹e24d3421-ffa5-4657-8009-264210375d1e›): throttled on async limiting semaphore

n3:

I220518 09:04:43.438074 349 kv/kvserver/store_remove_replica.go:133 ⋮ [n3,s3,r37/3:‹/Table/53{-/1/7000}›] 2618  removing replica r47/3

I do see a lot of chatter in the logs with could not run async checksum computation - whether that be due to semaphore acquisition or rate limiting. I know we've had failure modes in the past (see #68303) caused by excessive consistency checker activity, so I wonder if that's somehow tickling a bug here.

I also did some archeology and found this comment pointing out that diffs in stats are less of a cause for concern than a diff in the KVs themselves, which makes me less concerned that this is a major issue. Based on that, I'm going to remove the release blocker label here. We can add back if necessary. Echoing the sentiment on that issue that we should probably try and get to the bottom of this.

I'm a little out of my depth on this one, so I'm calling in the cavalry. cc: @tbg, @erikgrinaker - pinging y'all as I know you have some context on previous clearrange test failures due to the consistency checker, and I know we've made some improvements there recently. Tobi - there's also a hypothesis on the same issue I mentioned that talks about this type of issue manifesting during merges. If you could help me triage this a little more that would be much appreciated!

@nicktrav nicktrav removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 18, 2022
@erikgrinaker
Copy link
Contributor

Yeah, this could just as well be a bug in MVCC stats management in CRDB. But it's interesting that we're only seeing it on one node. I don't think the checksum computations are really relevant here.

I'll have a closer look when I get a chance, but I'm also tempted to punt this unless we've seen this on a newer release since 21.1 is no longer supported as of today.

@nicktrav
Copy link
Collaborator

I'll have a closer look when I get a chance

Thanks Erik. Tempted to also de-prioritize this one for the reason you mentioned. Let us know if you see anything fishy when you get a chance, otherwise we can just let this one age out.

@jlinder jlinder added sync-me and removed sync-me labels May 20, 2022
@nicktrav
Copy link
Collaborator

Closing this one out, per the above.

@nicktrav nicktrav added the X-unactionable This was closed because it was unactionable. label Dec 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-storage Storage Team X-unactionable This was closed because it was unactionable.
Projects
None yet
Development

No branches or pull requests

4 participants