Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

admission: fix overflow bug in ioLoadListener #69567

Merged
merged 1 commit into from
Aug 30, 2021

Conversation

sumeerbhola
Copy link
Collaborator

Fixes #69461,#69463

Release justification: High-severity bug fix to new funtionality.

Release note: None

Fixes cockroachdb#69461,cockroachdb#69463

Release justification: High-severity bug fix to new funtionality.

Release note: None
@sumeerbhola sumeerbhola requested review from tbg and RaduBerinde August 30, 2021 13:44
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you say a few words here about how the bug triggered? Feels like when the code was originally written, we assumed that we would only hit the MaxInt64 endpoint and not MaxInt64 - small_number. (I lack the understanding of what the context of this code is).

Also, I think this fixes all of these: https://github.com/cockroachdb/cockroach/issues?q=is%3Aopen+is%3Aissue+label%3AC-test-failure+assignee%3Asumeerbhola+

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde)

@sumeerbhola
Copy link
Collaborator Author

TFTR!

Could you say a few words here about how the bug triggered? Feels like when the code was originally written, we assumed that we would only hit the MaxInt64 endpoint and not MaxInt64 - small_number. (I lack the understanding of what the context of this code is).

I still don't know. I've been trying to reproduce the roachtest failures with additional instrumentation, but haven't been successful so far. I'll keep trying this afternoon, since it might uncover some other badness. Meanwhile I'll merge this.

@sumeerbhola
Copy link
Collaborator Author

bors r+

@sumeerbhola
Copy link
Collaborator Author

btw, do you know what exit code 134 is? I could not find it in cli/exit/codes.go and the roachtest consistently fails with that code. When I look in the stderr log, all nodes failed with 134, but there is nothing in cockroach.log indicating any problem.

16:04:03 sysbench.go:115: test status: preparing workload
1: 11603
2: 11126
3: 11141
3: dead (exit status 134)
16:10:51 test_impl.go:323: test failure: 	cluster.go:1893,sysbench.go:117,monitor.go:106,errgroup.go:57: output in run_160405.550649000_n4_sysbench: /Users/sumeer/go/src/github.com/cockroachdb/cockroach/bin/roachprod run sumeer-1630339241-01-n4cpu32:4 -- sysbench \: context canceled

@tbg
Copy link
Member

tbg commented Aug 30, 2021

You need to look at cockroach-stderr.log (or cockroach.stderr.log, I forget which one it is). I'm not sure exactly where exit status 134 comes from but the process crashes because of a runtime panic (the one you're fixing now) and so it exits nonzero.

@tbg
Copy link
Member

tbg commented Aug 30, 2021

Ah, 134 is just 128+SIGABRT

@RaduBerinde
Copy link
Member

I don't get it.. How would totalTokens be close to but not quite MaxInt64 (aka unlimitedTokens)? We're either setting it to unlimitedTokens or we're calculating what should be a reasonable value.

@RaduBerinde
Copy link
Member

I'd add more assertions: check that bytesAdded >= 0 and that io.totalTokens > 0 in adjustTokens.

@sumeerbhola
Copy link
Collaborator Author

sumeerbhola commented Aug 30, 2021

Ah, 134 is just 128+SIGABRT

So is it code 7 which is FatalError? I don't see where/whether these are being used as bit masks.

I'd add more assertions: check that bytesAdded >= 0 and that io.totalTokens > 0 in adjustTokens.

I can't even get the current assertion to fire, at least based on looking at the logs. And the extra logging I added shows sane values.

@craig
Copy link
Contributor

craig bot commented Aug 30, 2021

Build succeeded:

@craig craig bot merged commit e946415 into cockroachdb:master Aug 30, 2021
@tbg
Copy link
Member

tbg commented Aug 30, 2021

I'm not sure where the exit code comes from. On a panic like the one hit in these tests, the exit code should be 2. We invoke CRDB through systemd via a shell script, this may make a difference, but the exit code grabbed from the crdb process within that script is also 134 as evidenced by the log files.

"${BINARY}" "${START_CMD}" "${ARGS[@]}" >> "${LOG_DIR}/cockroach.stdout.log" 2>> "${LOG_DIR}/cockroach.stderr.log" || CODE="$?"
if [[ -z "${LOCAL}" || "${CODE}" -ne 0 ]]; then
echo "cockroach exited with code ${CODE}: $(date)" | tee -a "${LOG_DIR}"/{roachprod,cockroach.{exit,std{out,err}}}.log
fi
exit "${CODE}"

@sumeerbhola
Copy link
Collaborator Author

I have a reproduction -- this PR is not going to fix it since ioLoadListener.totalTokens is negative.

@RaduBerinde
Copy link
Member

Is it because bytesAdded is negative? That was the only possibility I could find in the code when I looked earlier.

@sumeerbhola
Copy link
Collaborator Author

Is it because bytesAdded is negative?

It was simpler than that -- bytesAdded can be 0, so we get +inf for numAdmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: sysbench/oltp_delete/nodes=3/cpu=32/conc=128 failed
4 participants