Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: replicate/wide failed #111084

Closed
cockroach-teamcity opened this issue Sep 21, 2023 · 7 comments · Fixed by #111206
Closed

roachtest: replicate/wide failed #111084

cockroach-teamcity opened this issue Sep 21, 2023 · 7 comments · Fixed by #111206
Assignees
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Sep 21, 2023

roachtest.replicate/wide failed with artifacts on master @ 49eef749235f15d57153cb00f48dfd3b27be6c54:

(allocator.go:390).func4: pq: node descriptor with node ID 7 was not found
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-31755

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Sep 21, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.2 milestone Sep 21, 2023
@kvoli kvoli self-assigned this Sep 22, 2023
@kvoli
Copy link
Collaborator

kvoli commented Sep 22, 2023

Fails on this query, similar to the previous failure:

countMisreplicated := func(width int) int {
var count int
if err := db.QueryRow(
"SELECT count(*) FROM crdb_internal.ranges WHERE array_length(replicas,1) != $1",
width,
).Scan(&count); err != nil {
t.Fatal(err)
}
return count
}

I'll hold off investigating further, but 9cc17d7 looks related.

@kvoli kvoli added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Sep 22, 2023
@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 6a2097e4c2b6537be84f265bd55ac23d2cf83351:

(allocator.go:390).func4: pq: node descriptor with node ID 7 was not found
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ b1a641f1d27653a4e93154926be57d0cae864bd6:

(allocator.go:396).func4: pq: node descriptor with node ID 7 was not found
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 43326b23c9057251908704088cbb04854a6c4b57:

(allocator.go:396).func4: pq: node descriptor with node ID 7 was not found
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@kvoli
Copy link
Collaborator

kvoli commented Sep 25, 2023

This failed consistently on each run over the weekend. I'll dig further now.

@kvoli
Copy link
Collaborator

kvoli commented Sep 25, 2023

Posted a fix #111206. This was a bug with the error type comparison within crdb_internal.ranges.

@cockroach-teamcity
Copy link
Member Author

roachtest.replicate/wide failed with artifacts on master @ 43326b23c9057251908704088cbb04854a6c4b57:

(allocator.go:396).func4: pq: node descriptor with node ID 7 was not found
test artifacts and logs in: /artifacts/replicate/wide/run_1

Parameters: ROACHTEST_arch=amd64 , ROACHTEST_cloud=gce , ROACHTEST_cpu=1 , ROACHTEST_encrypted=false , ROACHTEST_fs=ext4 , ROACHTEST_localSSD=true , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Sep 25, 2023
…111232

110967: asim: enable random zone config event generation r=kvoli a=wenyihu6

Previously, zone config event generation used hardcoded span configurations.
This limits our ability to test the allocator more thoroughly.

To improve this, this patch enables random span configs to be generated and
applied as part of the simulation. These configurations are generated by
randomly selecting the primary region, region V.S. zone survival goal, and
leaseholder preference.

```
The following command is now supported:
"rand_events" [cycle_via_random_survival_goals]
```

Part of: #106192
Release Note: none
Epic: none

111192: bulk: allow caller-configurable priority in SSTBatcher r=adityamaru a=stevendanna

This adds the ability for some callers to use a higher admission priority in SSTBatcher. This is helpful for streaming where we want to run at a priority that isn't subject to the elastic admission regime.

Epic: none

Release note: None

111206: kv: fix (store|node) not found err checking r=erikgrinaker a=kvoli

`StoreNotFoundError` and `NodeNotFoundError` errors were moved to the `kvpb` pkg in #110374. As part of the move, `crdb_internal` functions which checked if the error were `DescNotFoundError` were also updated so that node/store not found errors would be recognized e.g.

```
errors.Is(kvpb.NewNodeNotFoundError(nodeID), &kvpb.DescNotFoundError{})
```

This didn't work, because the error doesn't match the reference error variable being given. It does match the type. Update these error assertions to use `HasType` instead.

Resolves: #111084
Epic: none
Release note: None

111214: release: fix roachtest artifacts name r=srosenberg a=rail

This fixes the roachtest artifacts directory name.

Epic: none
Release note: None

111217: cloud/azure: Fix azure schemes r=benbardin a=benbardin

Part of: https://cockroachlabs.atlassian.net/browse/CRDB-31120

Release note (bug fix): Fixes azure schemes in storage, kms and external conns.

111223: storage: populate timestamp field in lock table values r=nvanbenschoten a=nvanbenschoten

Informs #109645.

This commit starts writing the `Timestamp` field in lock table `MVCCMetadata` values for shared and exclusive locks. This mirrors the behavior of intent locks. This is not strictly needed, as the timestamp is always equal to `Txn.WriteTimestamp`, but it is cheap to do and helps unify some stats logic, which uses this field to compute "lock age".

Maybe we'll get rid of this for all lock strengths one day...

Release note: None

111230: authors: add xhesikam to authors r=xhesikam a=xhesikam

Release note: None
Epic: None

111231: backupccl: add missing ctx cancel check r=msbutler a=adityamaru

In #111159 we deduced from the stacks a
situation in which the goroutine draining `spanCh` had exited due to a context cancelation but the
writer was not listening for a ctx cancelation.
This manifests as a stuck restore when using
the non-default make simple import spans implementation.

Fixes: #111159
Release note: None

111232: kv: deflake TestLeaseExpirationBasedDrainTransferWithExtension r=nvanbenschoten a=nvanbenschoten

Informs #110715, which will be fixed by a non-clean backport (see 42e45b4) of this commit.

This commit deflakes `TestLeaseExpirationBasedDrainTransferWithExtension` by disabling the node suspect timer in leaseTransferTest tests. These tests manually control the clock and have a habit of inducing destabilizing clock jumps. In this case, if n2 looked at liveness immediately after one of these manual clock jumps, it would mark n1 as suspect and refuse to transfer it the lease for the 30 second `server.time_after_store_suspect`, which is longer than the 5 second `server.shutdown.lease_transfer_wait`. This would cause the test to fail.

Before this patch, the test would fail under stress race in about 8 minutes. Since the patch, it hasn't failed in over 30 minutes.

Release note: None

Co-authored-by: wenyihu6 <[email protected]>
Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Ben Bardin <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: xhesikam <[email protected]>
Co-authored-by: adityamaru <[email protected]>
@craig craig bot closed this as completed in 49d0ffc Sep 25, 2023
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants