Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: kv/splits/nodes=3/quiesce=true failed #85112

Closed
cockroach-teamcity opened this issue Jul 27, 2022 · 4 comments · Fixed by #86592
Closed

roachtest: kv/splits/nodes=3/quiesce=true failed #85112

cockroach-teamcity opened this issue Jul 27, 2022 · 4 comments · Fixed by #86592
Assignees
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Jul 27, 2022

roachtest.kv/splits/nodes=3/quiesce=true failed with artifacts on master @ 0f100f3a09246bc80ef2c3119e4ec1c24ef36801:

		  | I220727 05:58:30.260862 430 workload/workloadsql/workloadsql.go:208  [-] 283  finished 281000 of 300000 splits
		  | I220727 05:58:41.062888 430 workload/workloadsql/workloadsql.go:208  [-] 284  finished 282000 of 300000 splits
		  | I220727 05:58:47.422176 430 workload/workloadsql/workloadsql.go:208  [-] 285  finished 283000 of 300000 splits
		  | I220727 05:58:54.533832 430 workload/workloadsql/workloadsql.go:208  [-] 286  finished 284000 of 300000 splits
		  | I220727 05:59:02.495491 430 workload/workloadsql/workloadsql.go:208  [-] 287  finished 285000 of 300000 splits
		  | I220727 05:59:12.357460 430 workload/workloadsql/workloadsql.go:208  [-] 288  finished 286000 of 300000 splits
		  | I220727 05:59:20.338105 430 workload/workloadsql/workloadsql.go:208  [-] 289  finished 287000 of 300000 splits
		  | I220727 05:59:26.275056 430 workload/workloadsql/workloadsql.go:208  [-] 290  finished 288000 of 300000 splits
		  | I220727 05:59:34.791868 430 workload/workloadsql/workloadsql.go:208  [-] 291  finished 289000 of 300000 splits
		  | I220727 05:59:42.846466 430 workload/workloadsql/workloadsql.go:208  [-] 292  finished 290000 of 300000 splits
		  | I220727 05:59:50.475135 430 workload/workloadsql/workloadsql.go:208  [-] 293  finished 291000 of 300000 splits
		  | I220727 06:00:01.912166 430 workload/workloadsql/workloadsql.go:208  [-] 294  finished 292000 of 300000 splits
		  | I220727 06:00:06.923602 430 workload/workloadsql/workloadsql.go:208  [-] 295  finished 293000 of 300000 splits
		  | I220727 06:00:15.769178 430 workload/workloadsql/workloadsql.go:208  [-] 296  finished 294000 of 300000 splits
		  | W220727 06:00:20.479610 156 workload/workloadsql/workloadsql.go:179  [-] 297  ALTER TABLE kv SCATTER FROM (-4903650887037796352) TO (-4903650887037796352): dial tcp 10.142.0.228:26257: connect: connection refused
		  | W220727 06:00:20.479736 108 workload/workloadsql/workloadsql.go:179  [-] 298  ALTER TABLE kv SCATTER FROM (-4915518252833996800) TO (-4915518252833996800): dial tcp 10.142.0.228:26257: connect: connection refused
		  |
		  | stdout:
		Wraps: (4) secondary error attachment
		  | UNCLASSIFIED_PROBLEM: context canceled
		  | (1) UNCLASSIFIED_PROBLEM
		  | Wraps: (2) Node 4. Command with error:
		  |   | ``````
		  |   | ./workload run kv --init --max-ops=1 --concurrency=192 --splits=300000 {pgurl:1-3}
		  |   | ``````
		  | Wraps: (3) context canceled
		  | Error types: (1) errors.Unclassified (2) *hintdetail.withDetail (3) *errors.errorString
		Wraps: (5) context canceled
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *cluster.WithCommandDetails (4) *secondary.withSecondaryError (5) *errors.errorString

	monitor.go:127,kv.go:713,test_runner.go:896: monitor failure: monitor command failure: unexpected node event: 1: dead (exit status 137)
		(1) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).WaitE
		  | 	main/pkg/cmd/roachtest/monitor.go:115
		  | main.(*monitorImpl).Wait
		  | 	main/pkg/cmd/roachtest/monitor.go:123
		  | github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests.registerKVSplits.func1
		  | 	github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/kv.go:713
		  | [...repeated from below...]
		Wraps: (2) monitor failure
		Wraps: (3) attached stack trace
		  -- stack trace:
		  | main.(*monitorImpl).wait.func3
		  | 	main/pkg/cmd/roachtest/monitor.go:202
		  | runtime.goexit
		  | 	GOROOT/src/runtime/asm_amd64.s:1571
		Wraps: (4) monitor command failure
		Wraps: (5) unexpected node event: 1: dead (exit status 137)
		Error types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *errors.errorString

Parameters: ROACHTEST_cloud=gce , ROACHTEST_cpu=4 , ROACHTEST_ssd=0

Help

See: roachtest README

See: How To Investigate (internal)

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-18046

Epic CRDB-18656

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels Jul 27, 2022
@cockroach-teamcity cockroach-teamcity added this to the 22.2 milestone Jul 27, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Jul 27, 2022
@irfansharif
Copy link
Contributor

Artifacts are lost. Is this a resurgence of #82334 (closed by stale bot)? Going off the analysis there, I'm guessing this is an OOM. @kvoli, let's add a sync.Pool around this for now and call it a day. To me this is as likely a culprit as any other.

func newReplicaStatsRecord() *replicaStatsRecord {
return &replicaStatsRecord{
localityCounts: make(PerLocalityCounts),
max: -math.MaxFloat64,
min: math.MaxFloat64,
}
}

@irfansharif
Copy link
Contributor

Better yet, see if you can avoid the heap allocations altogether.

@kvoli
Copy link
Collaborator

kvoli commented Aug 18, 2022

I reduced the number of replica stats structs slightly in #85178, by 2.

I'll take a look and rethink/rework the memory alloc for this.

@irfansharif
Copy link
Contributor

It's possible that PR is why we're no longer seeing failures here in this roachtest. But yes, reducing allocs anyway seems a good idea.

kvoli added a commit to kvoli/cockroach that referenced this issue Aug 23, 2022
This patch removes some unused fields within the replica stats object.
It also opts to allocate all the memory needed upfront for a replica
stats object for better cache locality and less GC overhead.

resolves cockroachdb#85112

Release justification: low risk, lowers memory footprint to avoid oom.
Release note: None
kvoli added a commit to kvoli/cockroach that referenced this issue Aug 25, 2022
This patch removes some unused fields within the replica stats object.
It also opts to allocate all the memory needed upfront for a replica
stats object for better cache locality and less GC overhead.

resolves cockroachdb#85112

Release justification: low risk, lowers memory footprint to avoid oom.
Release note: None
craig bot pushed a commit that referenced this issue Aug 31, 2022
…87158

85354: sql: notices for NotVisible Indexes r=wenyihu6 a=wenyihu6

Optimizer now supports creating invisible indexes after this
[PR](#85794). An important use case
for not visible indexes is to test the behaviour of dropping an index by marking
the index invisible. However, there are certain cases where users cannot expect
dropping an index to behave exactly the same as marking an index invisible. More
specifically, NotVisible indexes may still be used to police unique or foreign
key constraint check behind the scene. In those cases, dropping the index might
behave different from marking the index invisible. Prior to this commit, users
do not know about this without reading the documentation. This commit adds some
user-friendly notices when users are dropping or changing a not visible index
that might be helpful for constraint check.

There are two cases where we are giving this notice: 1. if this index is unique.
2. if this index is on child table and may help with FK check.

More details on how this decision was made in
docs/RFCS/20220628_invisible_index.md.

Assists: #72576

See also: #85794

Release justification: low risk to the existing functionality; this commit just
adds notices.

Release note: none

86592: kvserver: rework memory allocation in replicastats r=kvoli a=kvoli

This patch removes some unused fields within the replica stats object.
It also opts to allocate all the memory needed upfront for a replica
stats object for better cache locality and less GC overhead.

This patch also removes locality tracking for the other throughput trackers
to reduce per-replica memory footprint.

resolves #85112

Release justification: low risk, lowers memory footprint to avoid oom.
Release note: None

87024: sql: Prevent primary region being same as secondary region r=rafiss a=e-mbrown

fixes #86879

We found that the primary region could be assigned the same region as the secondary region. This commit adds an error to prevent that.

Release justification: Low risk high benefit change to existing functionality
Release note: None

87110: ui: fixes to high contention copy in insight workload pages r=ericharmeling a=ericharmeling

Previously, the High Contention insight type was labeled
"High Contention Time", and the waiting transactions list
was labeled in the incorrect tense. This commit fixes those
typos.

Release justification: bug fix
Release note: None

87135: build: remove newly-added node_modules/ trees in ui-maintainer-clean r=rickystewart a=sjbarag

A few recent features [1, 2] introduced new node_modules/ trees for
dependencies, but didn't update the ui-maintainer-clean Make target to
remove them. This allowed those directories to leak between TeamCity
builds with Docker user permissions, preventing a `yarn install` in
those packages from properly laying out a `node_modules/.bin` directory
for executables like `tsc`. Remove the recently-introduced
`node_modules/` directories as part of `make ui-maintainer-clean`, to
restore a clean state between jobs.

[1] d28c072 (ui: add eslint-plugin-crdb package with custom eslint rules, 2022-05-27)
[2] c58279d (ui: reintroduce end-to-end UI tests with cypress, 2022-08-12)

Release justification: Non-production code changes

87149: sql: clean up physical planning for system tenant r=yuzefovich a=yuzefovich

This commit audits a couple of methods around the health and version of
DistSQL nodes that are used only for the system tenant to make that more
explicit. Additionally, it unexports `NodeStatuses` map from the
planning context as well as removes some unnecessary short-circuiting
behavior around checking the node health and version (it was unnecessary
because we already short-circuit in
`checkInstanceHealthAndVersionSystem`).

Release justification: low-risk cleanup.

Release note: None

87153: ui: ux improvements on stmt details page r=maryliag a=maryliag

This commit adds a few improvements and bug fixes:

- Handles the case where we hit a
timeout on statement details, so it doesn't crash
anymore and you can still see the time picker to
be able to select a new time interval.

- Updates the error message, to
clarify it was a timeout error and increase the
timeout from 30s to 30m on the details endpoint.
Fixes #78979

- Updates the last error for statement
details with the proper value, which previously
was using the error for all statements endpoint,
instead of the specific for that fingerprint id.

- Adds a message when page takes longer to load.

- Uses a proper count formatting for
execution count.

Release justification: bug fixes and smaller improvements
Release note (ui change): Proper formatting of execution count
under Statement Details page.
Increase timeout for Statement Details page and shows
proper timeout error when it happens, no longer
crashing the page.

87155: github-post: allow for finding the test in a parent directory of the pkg r=srosenberg,rail a=rickystewart

In some cases the Bazel test runner "incorrectly" reports the package
path for tests. For example, we have [issues](#85376) where the name of
the test is reported as `pkg/.../package/package_test` rather than
`pkg/.../package` as we might expect. I suspect this is confusing
`github-post` when it tries to find tests in the `package_test`
directory rather than the `package` directory.

We address this by allowing `github-post` to search up the directory
tree for the test rather than expecting it to be in one particular
directory.

Also update a repro command to use `dev test` rather than
`make stressrace`.

Closes #85420.

Release justification: Non-production code changes
Release note: None

87156: ci: disable sharding in random syntax tests r=srosenberg a=rickystewart

The different shards were trampling each other's test.json.txt,
preventing failures from being reported accurately.

Release justification: Non-production code changes
Release note: None

87158: sql: clean up node dialer fields r=yuzefovich a=yuzefovich

This commit removes no longer used `nodeDialer` field (for SQL - KV
communication) as well as renames some of the similarly named fields to
`podNodeDialer` to indicate that its only a SQL - SQL dialer.

Release justification: low-risk cleanup.

Release note: None

Co-authored-by: wenyihu3 <[email protected]>
Co-authored-by: Austen McClernon <[email protected]>
Co-authored-by: e-mbrown <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Sean Barag <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marylia Gutierrez <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@craig craig bot closed this as completed in 8b56916 Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants