Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/multiregionccl: TestMultiRegionDatabaseStats failed #96163

Closed
cockroach-teamcity opened this issue Jan 30, 2023 · 16 comments · Fixed by #108037
Closed

ccl/multiregionccl: TestMultiRegionDatabaseStats failed #96163

cockroach-teamcity opened this issue Jan 30, 2023 · 16 comments · Fixed by #108037
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jan 30, 2023

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ 69dd453d0e61e258f402c5751de310405743cd18:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats3303760444
    test_log_scope.go:79: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: expected node-ids=[4 5 6], got [1 2 3 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats3303760444
--- FAIL: TestMultiRegionDatabaseStats (97.91s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

/cc @cockroachdb/multiregion

This test on roachdash | Improve this report!

Jira issue: CRDB-23975

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. labels Jan 30, 2023
@cockroach-teamcity cockroach-teamcity added this to the 23.1 milestone Jan 30, 2023
@ajstorm ajstorm added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Jan 30, 2023
@rafiss rafiss removed the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Feb 1, 2023
@rafiss
Copy link
Collaborator

rafiss commented Feb 1, 2023

Adding to @cockroachdb/sql-observability since that team wrote this test.

@ericharmeling
Copy link
Contributor

Update: I'm unable to reproduce this test failure ATM. Just ran pkg/ccl/multiregionccl under stress and got through 50 runs with no failures until TestMrSystemDatabase/QueryByEnum failed (there's an issue open for that #93863 and it's pretty rare). I'm thinking this failure is pretty rare as well.

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ 286b3e235171a39b8f9910555affcc7ce310741a:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats3400295038
    test_log_scope.go:79: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats3400295038
--- FAIL: TestMultiRegionDatabaseStats (51.23s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@j82w
Copy link
Contributor

j82w commented Feb 22, 2023

Can we modify the test to maybe log the entire crdb_internal.ranges table if it fails to see what the ranges are to help isolate the issue?

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ fcea283ebca17a6d923c5d4b0401697438b77dbd:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats2351839834
    test_log_scope.go:79: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats2351839834
--- FAIL: TestMultiRegionDatabaseStats (51.19s)

Parameters: TAGS=bazel,gss

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@ericharmeling
Copy link
Contributor

I'm having trouble reproducing this error. 0 failures after 15000+ runs.

@ericharmeling ericharmeling removed their assignment Mar 29, 2023
@THardy98 THardy98 assigned THardy98 and unassigned THardy98 Apr 5, 2023
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ 85e41ca8d3d9edacf5ee3061a2591b159b9b0502:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:161: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats2626672983
    test_log_scope.go:79: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats2626672983
--- FAIL: TestMultiRegionDatabaseStats (97.44s)

Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ 8124bff03d073f35f4d2b6a2048c7f4417d757d9:


goroutine 99083522 lock 0xc015ab5db0
github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:575 status.(*MetricsRecorder).WriteNodeStatus ??? <<<<<
github.com/cockroachdb/cockroach/pkg/server/status/recorder.go:574 status.(*MetricsRecorder).WriteNodeStatus ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1087 server.(*Node).writeNodeStatus.func1 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:305 stop.(*Stopper).RunTask ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1088 server.(*Node).writeNodeStatus ???
github.com/cockroachdb/cockroach/pkg/server/node.go:1039 server.(*Node).startWriteNodeStatus.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 99431539 lock 0xc0061e19d0
github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts/sidetransport/sender.go:350 sidetransport.(*Sender).publish ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts/sidetransport/sender.go:349 sidetransport.(*Sender).publish ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/closedts/sidetransport/sender.go:251 sidetransport.(*Sender).Run.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 99192116 lock 0xc01a8912d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:712 kvserver.(*Replica).handleRaftReady ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:711 kvserver.(*Replica).handleRaftReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:645 kvserver.(*Store).processReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:394 kvserver.(*raftSchedulerShard).worker ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:299 kvserver.(*raftScheduler).Start.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 98485617 lock 0xc014ab80d8
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:712 kvserver.(*Replica).handleRaftReady ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:711 kvserver.(*Replica).handleRaftReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:645 kvserver.(*Store).processReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:394 kvserver.(*raftSchedulerShard).worker ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:299 kvserver.(*raftScheduler).Start.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 99192161 lock 0xc022d5ad58
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:712 kvserver.(*Replica).handleRaftReady ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:711 kvserver.(*Replica).handleRaftReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:645 kvserver.(*Store).processReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:394 kvserver.(*raftSchedulerShard).worker ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:299 kvserver.(*raftScheduler).Start.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???

goroutine 98995755 lock 0xc01b046d58
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:712 kvserver.(*Replica).handleRaftReady ??? <<<<<
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/replica_raft.go:711 kvserver.(*Replica).handleRaftReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/store_raft.go:645 kvserver.(*Store).processReady ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:394 kvserver.(*raftSchedulerShard).worker ???
github.com/cockroachdb/cockroach/pkg/kv/kvserver/pkg/kv/kvserver/scheduler.go:299 kvserver.(*raftScheduler).Start.func2 ???
github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:470 stop.(*Stopper).RunAsyncTaskEx.func2 ???



Parameters: TAGS=bazel,gss,deadlock

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@THardy98
Copy link

Closing as duplicate of: #102761

@DrewKimball DrewKimball reopened this Jul 20, 2023
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ f295bd861a3a427652b19c2254d2401ebb4a3c8e:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1726902072
    test_log_scope.go:81: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: from region_test.go:90: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
--- FAIL: TestMultiRegionDatabaseStats (64.76s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ fcfa15d9694c0b840ac0c5c1b1489fa90a62efd8:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1495832430
    test_log_scope.go:81: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: from region_test.go:90: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1495832430
--- FAIL: TestMultiRegionDatabaseStats (61.95s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

THardy98 pushed a commit to THardy98/cockroach that referenced this issue Aug 2, 2023
Resolves: cockroachdb#96163

This change makes the admin API endpoint getting database statistics
scan KV for span statistics instead of using the range descriptor cache.
This provides authoritative output, helping deflake
`TestMultiRegionDatabaseStats`.

Release note (sql change): admin API database details endpoint now
returns authoritative range statistics.
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ 1382b26a97bf6f70a07d363dc319283c173359eb:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats551336208
    test_log_scope.go:81: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: from region_test.go:90: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats551336208
--- FAIL: TestMultiRegionDatabaseStats (63.93s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

@pav-kv
Copy link
Collaborator

pav-kv commented Aug 8, 2023

Flaked again here.

@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ a598a75435f6426236dde569191fbd95871f24ae:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1686121623
    test_log_scope.go:81: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: from region_test.go:90: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1686121623
--- FAIL: TestMultiRegionDatabaseStats (62.51s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

THardy98 pushed a commit to THardy98/cockroach that referenced this issue Aug 11, 2023
Resolves: cockroachdb#96163

This change makes the admin API endpoint getting database statistics
scan KV for span statistics instead of using the range descriptor cache.
This provides authoritative output, helping deflake
`TestMultiRegionDatabaseStats`.

Release note (sql change): admin API database details endpoint now
returns authoritative range statistics.
@cockroach-teamcity
Copy link
Member Author

ccl/multiregionccl.TestMultiRegionDatabaseStats failed with artifacts on master @ c13bf7633cbb416d9e43f8c57b1e309fab1110ce:

=== RUN   TestMultiRegionDatabaseStats
    test_log_scope.go:167: test logs captured to: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1072905535
    test_log_scope.go:81: use -show-logs to present logs inline
    region_test.go:66: condition failed to evaluate within 45s: from region_test.go:90: expected node-ids=[4 5 6], got [2 3 4 5 6]
    panic.go:522: -- test log scope end --
test logs left over in: /artifacts/tmp/_tmp/e558fc8050776f4c54ea39ba371b49da/logTestMultiRegionDatabaseStats1072905535
--- FAIL: TestMultiRegionDatabaseStats (61.44s)

Parameters: TAGS=bazel,gss , stress=true

Help

See also: How To Investigate a Go Test Failure (internal)

This test on roachdash | Improve this report!

craig bot pushed a commit that referenced this issue Aug 14, 2023
107965: roachtest: roachprod: add test name and run id vm labels for metrics r=herkolategan a=smg260

These 2 commits add labels to clusters running roachtests, so that metrics can be better filtered in various dashboards. 

1. Adds `test_name` label to each cluster, and removes the label at the end of the test. Thus, each cluster would have this label updated for each test that it runs during a particular roachtest invocation. The test name will be simplified to conform to cloud labelling rules `[a-zA-Z-]`

2. Adds `test_run_id` label to each VM, *once*, for the duration of the run. Thus, each cluster would have this label added once at the beginning of a roachtest run (which would include multiple tests), and removed only after deregistration at the end.
\
In TeamCity this would take the form `<TC_USER>-<TC_BUILD_ID>`, and run locally `<USER>-<UNIX_TS>`

These 2 labels combined will allow it easy for a user to find metrics for a particular run of roachtest. (e.g. a specific GCE nightly)

Here is a [copy of an existing dashboard](https://grafana.testeng.crdb.io/d/qdkBruq4k/crdb-console-runtime-by-test?orgId=1&from=now-3h&to=now), modified to utilise the new labels.

Epic: None
Fixes: #98658
Release note: None

108037: server: return authoritative span statistics for db details endpoint r=THardy98 a=THardy98

Resolves: #96163

This change makes the admin API endpoint getting database statistics scan KV for span statistics instead of using the range descriptor cache. This provides authoritative output, helping deflake `TestMultiRegionDatabaseStats`.

Release note (sql change): admin API database details endpoint now returns authoritative range statistics.

108711: upgrades: deflake TestRoleMembersIDMigration1500Users r=rafiss a=rafiss

TeamCity has a new machine type where this test has started to time out more, so this change will make it take less time.

fixes #108539
Release note: None

Co-authored-by: Miral Gadani <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
@craig craig bot closed this as completed in b2faa2b Aug 14, 2023
THardy98 pushed a commit to THardy98/cockroach that referenced this issue Aug 14, 2023
Resolves: cockroachdb#96163

This change makes the admin API endpoint getting database statistics
scan KV for span statistics instead of using the range descriptor cache.
This provides authoritative output, helping deflake
`TestMultiRegionDatabaseStats`.

Release note (sql change): admin API database details endpoint now
returns authoritative range statistics.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants