-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: clearrange/checks=true failed #81429
Comments
Looks like we picked up some replica inconsistency. Node 2 crashed with the following: F220518 09:04:44.852680 4975995 kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1521 found a delta of {ContainsEstimates:0 LastUpdateNanos:1652863372086326722 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 SysBytes:56 SysCount:1 AbortSpanBytes:0} |
I'm piecing together the events leading up to the crash.
{
"event": {
"timestamp": "2022-05-18T09:04:43.410946Z",
"range_id": 37,
"store_id": 2,
"event_type": 3,
"other_range_id": 47,
"info": {
"UpdatedDesc": {
"range_id": 37,
"start_key": "vQ==",
"end_key": "vYn3fyA=",
"internal_replicas": [
{
"node_id": 1,
"store_id": 1,
"replica_id": 1
},
{
"node_id": 2,
"store_id": 2,
"replica_id": 2
},
{
"node_id": 3,
"store_id": 3,
"replica_id": 3
}
],
"next_replica_id": 4,
"generation": 18
},
"RemovedDesc": {
"range_id": 47,
"start_key": "vYn3G1g=",
"end_key": "vYn3fyA=",
"internal_replicas": [
{
"node_id": 1,
"store_id": 1,
"replica_id": 1
},
{
"node_id": 2,
"store_id": 2,
"replica_id": 2
},
{
"node_id": 3,
"store_id": 3,
"replica_id": 3
}
],
"next_replica_id": 6,
"generation": 17
}
}
},
"pretty_info": {
"updated_desc": "r37:/Table/53{-/1/32544} [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=4, gen=18]"
}
} Logs on I220518 09:04:43.413751 4975995 kv/kvserver/replica_command.go:696 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/7000}›] 1499 initiating a merge of r47:‹/Table/53/1/{7000-32544}› [(n1,s1):1, (n2,s2):2, (n3,s3):3, next=6, gen=17] into this range (‹lhs+rhs has (size=0 B+0 B=0 B qps=0.00+0.02=0.02qps) below threshold (size=128 MiB, qps=1250.00)›)
...snip...
I220518 09:04:43.437684 297 kv/kvserver/store_remove_replica.go:133 ⋮ [n2,s2,r37/2:‹/Table/53{-/1/7000}›] 1501 removing replica r47/2
I220518 09:04:44.851664 4975995 1@kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1502 the server is terminating due to a fatal error (see the DEV channel for details)
...snip...
F220518 09:04:44.852680 4975995 kv/kvserver/replica_consistency.go:246 ⋮ [n2,merge,s2,r37/2:‹/Table/53{-/1/32544}›] 1521 found a delta of {ContainsEstimates:0 LastUpdateNanos:1652863372086326722 IntentAge:0 GCBytesAge:0 LiveBytes:0 LiveCount:0 KeyBytes:0 KeyCount:0 ValBytes:0 ValCount:0 IntentBytes:0 IntentCount:0 SeparatedIntentCount:0 SysBytes:56 SysCount:1 AbortSpanBytes:0} Logs on the other two nodes with a replica of
I220518 09:04:43.440487 305 kv/kvserver/store_remove_replica.go:133 ⋮ [n1,s1,r37/1:‹/Table/53{-/1/7000}›] 3844 removing replica r47/1
E220518 09:04:43.441412 305 kv/kvserver/replica_proposal.go:317 ⋮ [n1,s1,r37/1:‹/Table/53{-/1/32544}›] 3845 could not run async checksum computation (ID = ‹e24d3421-ffa5-4657-8009-264210375d1e›): throttled on async limiting semaphore
I220518 09:04:43.438074 349 kv/kvserver/store_remove_replica.go:133 ⋮ [n3,s3,r37/3:‹/Table/53{-/1/7000}›] 2618 removing replica r47/3 I do see a lot of chatter in the logs with I also did some archeology and found this comment pointing out that diffs in stats are less of a cause for concern than a diff in the KVs themselves, which makes me less concerned that this is a major issue. Based on that, I'm going to remove the release blocker label here. We can add back if necessary. Echoing the sentiment on that issue that we should probably try and get to the bottom of this. I'm a little out of my depth on this one, so I'm calling in the cavalry. cc: @tbg, @erikgrinaker - pinging y'all as I know you have some context on previous |
Yeah, this could just as well be a bug in MVCC stats management in CRDB. But it's interesting that we're only seeing it on one node. I don't think the checksum computations are really relevant here. I'll have a closer look when I get a chance, but I'm also tempted to punt this unless we've seen this on a newer release since 21.1 is no longer supported as of today. |
Thanks Erik. Tempted to also de-prioritize this one for the reason you mentioned. Let us know if you see anything fishy when you get a chance, otherwise we can just let this one age out. |
Closing this one out, per the above. |
roachtest.clearrange/checks=true failed with artifacts on release-21.1 @ d6c2fe5a1b3a1e80f7cbaee4eff55235996ad4db:
Reproduce
To reproduce, try:
# From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh clearrange/checks=true
This test on roachdash | Improve this report!
Jira issue: CRDB-15192
The text was updated successfully, but these errors were encountered: