Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Require conn_quota::units stay on a single shard #10681

Merged
merged 3 commits into from
Jun 2, 2023

Conversation

rockwotj
Copy link
Contributor

@rockwotj rockwotj commented May 10, 2023

This makes the API contract more explicit since we now require the
connection quota units doesn't change shards.

Related: #10544

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.1.x
  • v22.3.x
  • v22.2.x

Release Notes

  • none

@rockwotj rockwotj force-pushed the rockwood/conn_quota branch 3 times, most recently from dbaa003 to 6baa484 Compare May 11, 2023 03:53
@rockwotj rockwotj requested a review from dlex May 11, 2023 03:53
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
@rockwotj
Copy link
Contributor Author

/ci-repeat 5
skip-units
dt-repeat=100
rptest.tests.services_self_test.py::KgoVerifierSelfTest.test_kgo_verifier

src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
@rockwotj
Copy link
Contributor Author

/ci-repeat 10
debug
skip-units
dt-repeat=30
tests/rptest/tests/services_self_test.py

@rockwotj
Copy link
Contributor Author

/ci-repeat 1
release
skip-units
dt-repeat=10
dt-log-level=TRACE
tests/rptest/tests/services_self_test.py

@vbotbuildovich
Copy link
Collaborator

Invalid ducktape log level TRACE. Must be one of trace,debug,info,warn,error.

Workflow run logs.

@rockwotj
Copy link
Contributor Author

/ci-repeat 1
release
skip-units
dt-repeat=10
dt-log-level=trace
tests/rptest/tests/services_self_test.py

@rockwotj
Copy link
Contributor Author

/ci-repeat 5
skip-units
dt-repeat=20
rptest/tests/services_self_test.py::KgoVerifierSelfTest.test_kgo_verifier

Copy link
Contributor

@dlex dlex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please correct me from wrong, do_put() explicitly supports returning units cross-shard. If there is a suspiction that this is somehow correlated with segfaults, maybe it's safer to add warns/errors to the log about that.

src/v/net/conn_quota.h Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
Copy link
Contributor Author

@rockwotj rockwotj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The while conn_quota supports cross shard, it does not support being used on a different shard than it's home shard. The stacktrace in the linked bug makes it look like total_home is a bad pointer or pointing to junk. conn_quota::units captures a reference to a single shard's conn_quota which is a way the stacktrace could happen.

total_home is only defined on shard0, if another shard's service is called on shard0 then it will assume that total_home is a valid pointer and then dereference nullptr. The log statement I added should help determine this too.

src/v/net/conn_quota.h Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
src/v/net/conn_quota.cc Outdated Show resolved Hide resolved
@rockwotj rockwotj requested a review from dlex May 26, 2023 19:34
@rockwotj rockwotj force-pushed the rockwood/conn_quota branch 3 times, most recently from da6ba58 to 65c048b Compare May 29, 2023 02:09
src/v/vlog.h Outdated Show resolved Hide resolved
Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a theory total_home (only set on shard0's service) could be ran
using a service from another shard, but on shard0.

do you have a candidate execution schedule that would lead to the violation? i recall thinking the same thing a long time ago, but was never able to find a counterexample. i still remained a bit suspicious.

src/v/bytes/oncore.h Outdated Show resolved Hide resolved
@rockwotj
Copy link
Contributor Author

The stack trace in #10544 (comment) hints that the total_home pointer on shard0 is bad from the logs. And we do grab a reference in units to a local shard, so seems like it could be a smoking gun. From my reading of the code it doesn't look like we switch shards however.

@rockwotj rockwotj force-pushed the rockwood/conn_quota branch 8 times, most recently from 35d7a10 to 7445914 Compare May 31, 2023 02:13
@dotnwat
Copy link
Member

dotnwat commented May 31, 2023

The stack trace in #10544 (comment) hints that the total_home pointer on shard0 is bad from the logs. And we do grab a reference in units to a local shard, so seems like it could be a smoking gun. From my reading of the code it doesn't look like we switch shards however.

this is wild. do you think maybe we aren't switching shards, but a pointer is leaking across shards some how?

@rockwotj
Copy link
Contributor Author

rockwotj commented May 31, 2023

@rockwotj
Copy link
Contributor Author

this is wild. do you think maybe we aren't switching shards, but a pointer is leaking across shards some how?

Hrm, that seems unlikely? I'm not thinking that directly I supposed. This is more to prevent this from accidentally happening in the future, and guarding against any weird bugs here.

@rockwotj rockwotj requested a review from dotnwat May 31, 2023 12:01
Upgrades oncore and vlog to use std::source_location, removing
__FILE__ and __LINE__ macro usage. This means it's easier to use oncore
without a macro.

Signed-off-by: Tyler Rockwood <[email protected]>
This makes the API contract more explicit since we now require the
connection quota doesn't change shards.

Related: redpanda-data#10544

Signed-off-by: Tyler Rockwood <[email protected]>
I have a theory total_home (only set on shard0's service) could be ran
using a service from another shard, but on shard0. We should add logs
for what total_allowance is because that's what the stacktrace reported
as crashing.

Signed-off-by: Tyler Rockwood <[email protected]>
@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 1, 2023

/ci-repeat 10
skip-units
dt-repeat=10
rptest/tests/services_self_test.py::KgoVerifierSelfTest.test_kgo_verifier

@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 1, 2023

/ci-repeat 10
skip-units
skip-rebase
dt-repeat=10
tests/rptest/tests/services_self_test.py::KgoVerifierSelfTest.test_kgo_verifier

@rockwotj
Copy link
Contributor Author

rockwotj commented Jun 1, 2023

/ci-repeat 10
skip-units
skip-rebase
dt-repeat=200
tests/rptest/tests/services_self_test.py::KgoVerifierSelfTest.test_kgo_verifier

Copy link
Member

@dotnwat dotnwat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!

src/v/vlog.h Show resolved Hide resolved
@rockwotj rockwotj merged commit 24dddb8 into redpanda-data:dev Jun 2, 2023
@rockwotj rockwotj deleted the rockwood/conn_quota branch June 2, 2023 13:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants