Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv/kvserver: TestReplicaStateMachineChangeReplicas failed #69229

Closed
cockroach-teamcity opened this issue Aug 23, 2021 · 0 comments · Fixed by #69730
Closed

kv/kvserver: TestReplicaStateMachineChangeReplicas failed #69229

cockroach-teamcity opened this issue Aug 23, 2021 · 0 comments · Fixed by #69730
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.

Comments

@cockroach-teamcity
Copy link
Member

kv/kvserver.TestReplicaStateMachineChangeReplicas failed with artifacts on master @ 61bd543ba7288c8f0eed6cddded7b219c9d1fcd4:

Fatal error:

panic: runtime error: invalid memory address or nil pointer dereference [recovered]
	panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x19a7299]

Stack:

goroutine 11681 [running]:
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).Recover(0xc000826e00, 0x5ce2578, 0xc000b85530)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:231 +0x126
panic(0x41d31c0, 0x7bb0a30)
	/usr/local/go/src/runtime/panic.go:965 +0x1b9
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).getBreaker(0xc0011043c0, 0x7f0100000002, 0xc00074d380)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:236 +0xb9
github.com/cockroachdb/cockroach/pkg/rpc/nodedialer.(*Dialer).GetCircuitBreaker(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/rpc/nodedialer/nodedialer.go:229
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*RaftTransport).SendAsync(0xc0014a01c0, 0xc000d3c540, 0x1, 0x499100)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/raft_transport.go:553 +0xf4
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendRaftMessageRequest(0xc000254000, 0x5ce2578, 0xc000ff9e90, 0xc000d3c540, 0x2)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:1335 +0x7c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendRaftMessage(0xc000254000, 0x5ce2578, 0xc000ff9e90, 0x3, 0x2, 0x1, 0x6, 0x6, 0xc, 0xc001012db0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:1306 +0x49b
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).sendRaftMessages(0xc000254000, 0x5ce2578, 0xc000ff9e90, 0xc000c1d8c0, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:1244 +0x1c5
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc000254000, 0x5ce2578, 0xc000ff9e90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:724 +0x84a
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc000254000, 0x5ce2578, 0xc000ff9e90, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:487 +0x11c
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc0003c0000, 0x5ce2578, 0xc000b85530, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:515 +0x145
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc0001ae280, 0x5ce2578, 0xc000b85530)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:284 +0x2c2
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx.func2(0xc000826e00, 0x5ce2578, 0xc000b85530, 0x0, 0x0, 0xc0008258e0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:442 +0xf3
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTaskEx
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:434 +0x22b
Log preceding fatal error

=== RUN   TestReplicaStateMachineChangeReplicas/add_replica=true/deprecated=false

Reproduce

To reproduce, try:

make stressrace TESTS=TestReplicaStateMachineChangeReplicas PKG=./pkg/kv/kvserver TESTTIMEOUT=5m STRESSFLAGS='-timeout 5m' 2>&1

Parameters in this failure:

  • GOFLAGS=-parallel=4

/cc @cockroachdb/kv nvanbenschoten

This test on roachdash | Improve this report!

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Aug 23, 2021
@nvanbenschoten nvanbenschoten self-assigned this Sep 2, 2021
craig bot pushed a commit that referenced this issue Sep 2, 2021
69720: build: increase pebble nightly metamorphic test runtime r=nicktrav a=nicktrav

Increase the runtime of the Pebble nightly metamorphic test run from 1h
to 3h. More test time increases the chance of catching a test failure.

See #69414.

Release justification: non-production code change

Release note: None

69730: kv: deflake TestReplicaStateMachineChangeReplicas r=nvanbenschoten a=nvanbenschoten

Fixes #69229.

The test was reaching below Raft and adding a peer on a node that does not
exist. With a large enough pause between `r.raftMu.Unlock()` and `testContext`
teardown, it was possible for the replica to be ticked, which would cause it to
attempt to reach out and heartbeat this node. This could cause issues. In fact,
the `testContext` uses a dummy `RaftTransport`, so any attempt to send traffic
to another Raft peer would panic.

This commit resolves the issue by giving testContext a slightly more real
version of a RaftTransport (i.e. one configured with a nodedialer). This feels a
bit wrong, as a testContext should really never need to use a RaftTransport, but
given how little impact this has on the dependency structure of a testContext
and given that the testContext was already doing something similar with gossip,
I think this is fine.

Release justification: test fix.

69780: ui: Combining Type for Aggregated Node Status r=nathanstilwell a=nathanstilwell

This change is to alleviate an uncaught error thown by when
applying an aggregated status to a group of nodes. When nodes are aggregated
into rows a summary status is applied to a row. To apply
the appropriate style to the badge that represents the status, the
function getBadgeTypeByNodeStatus is called. This function was being
called with a status of type AggregatedNodeStatus rather than
LivenessStatus causing the error.

To fix this I combined the two functions getBadgeTypeByNodeStatus and
getBadgeTypeByNodeStatusAggregated into a single function that
compares the status parameter to the two enum types to determine status.
Instead of throwing an error for the default case, I am simply returning
default for the badge type. While this may cause a visual bug under
the right circumstances, at least it won't crash the app.

While I was in here, I also added more TypeScript annontations to remove
warnings.

fixes #69767

69788: server: add NodeIds to DatabaseDetails and TableStats responses r=xinhaoz a=xinhaoz

Related to: #63391

Previously, DatabaseDetails and TableStats did not return a list
of nodes that store their data. This commit adds an ordered
list of node ids to each of the responses. This is necessary
to report node/region information in the DB console databases
pages. For the database details endpoint, we return the list
of nodes only if stats are requested.

Release justification: low risk, high benefit changes to existing
functionality

Release note (api change): A list of node ids representing the
nodes that store data for the database has been added to the
stats field in the databases details endpoint under `nodeIds`.
Database details must be requested with include_stats set to true,
e.g. `/_admin/v1/databases/{database}?include_stats=true`

Similarly, `nodeIds` has also been added to the table stats
endpoint, which is an ordered list of node ids which stores
the table data:
`/_admin/v1/databases/{database}/tables/{table}/stats`

69800: logictestccl: restore the flake fix r=yuzefovich a=yuzefovich

Restore the fix from fa4e16d which was
just lost in b86022e.

Addresses: #68395 (comment).

Release note: None

Release justification: testing only change.

69802: vendor: bump Pebble to 1f862845897e r=bananabrick a=jbowens

```
1f86284 vfs/atomicfs: add Marker type
8731fd6 sstable: Fix bug in dynamic readahead
085aaac sstable: support block checksum validation for entire SSTables
6236eba pebble: Unflake TestCacheEvict
38dd75e ci: remove support for Go v1.14; remove TravisCI
```

Release note: none

Release justification: non-production code changes, low-risk high
benefit to existing functionality (8731fd6)

Co-authored-by: Nick Travers <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Nathan Stilwell <[email protected]>
Co-authored-by: Xin Hao Zhang <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Jackson Owens <[email protected]>
@craig craig bot closed this as completed in afa908e Sep 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants