Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: introduce limit on number of retries in IntExec.Exec{Ex} methods #114398

Merged
merged 1 commit into from
Nov 16, 2023

Conversation

yuzefovich
Copy link
Member

@yuzefovich yuzefovich commented Nov 14, 2023

This commit introduces a limit on the number of times the InternalExecutor machinery can retry errors in Exec{Ex} methods. That logic was introduced in c09860b in order to reduce the impact of fixes in other commits in #101477. However, there is no limit in the number of retries, and we hit the stack overflow twice in our own testing recently seemingly in this code path. Thus, this commit adds a limit, 5 by default.

Note that I'm not sure exactly what bug, if any, can lead to the stack overflow. One seemingly-unlikely theory is that there is no bug, meaning that we were simply retrying forever because the stmts were pushed by higher priority txns every time. Still, this change seems beneficial on its own and should prevent stack overflows even if we don't fully understand the root cause.

An additional improvement is that we now track the depth of recursion, and once it exceeds 1000, we'll return an error. This should prevent the stack overflows due to other reasons.

There is no release note given we've only seen this twice in our own testing and it involved cluster-to-cluster streaming.

Fixes: #109197.

Release note: None

@yuzefovich yuzefovich added the backport-23.2.x Flags PRs that need to be backported to 23.2. label Nov 14, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich yuzefovich marked this pull request as ready for review November 14, 2023 06:36
@yuzefovich yuzefovich requested review from a team as code owners November 14, 2023 06:36
@yuzefovich yuzefovich requested review from rharding6373 and rafiss and removed request for a team November 14, 2023 06:36
Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)

@yuzefovich
Copy link
Member Author

TFTR Rachael!

@rafiss could you double-check my understanding that we do need to have a cluster setting to be able to tweak the session variable value used by the internal executors?

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/settings/registry.go line 240 at r1 (raw file):

// changed with ALTER ROLE ... SET.
var sqlDefaultSettings = map[InternalKey]struct{}{
	// PLEASE DO NOT ADD NEW SETTINGS TO THIS MAP. THANK YOU.

curious if there's a reason to add the setting in spite of this warning?

Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)


pkg/settings/registry.go line 240 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

curious if there's a reason to add the setting in spite of this warning?

That's what I wanted to confirm with you: IIUC we currently need to have a cluster setting to influence the default value (i.e. GlobalDefault) for session variables that only impact the internal executor. It doesn't have to have a sql.defaults. prefix (I can change that, and then we won't have to add it to the registry), but I think doing ALTER ROLE ALL SET internal_executor_rows_affected_retry_limit = ... doesn't do anything for session data used for the internal executors. Does that sound right?

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/settings/registry.go line 240 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

That's what I wanted to confirm with you: IIUC we currently need to have a cluster setting to influence the default value (i.e. GlobalDefault) for session variables that only impact the internal executor. It doesn't have to have a sql.defaults. prefix (I can change that, and then we won't have to add it to the registry), but I think doing ALTER ROLE ALL SET internal_executor_rows_affected_retry_limit = ... doesn't do anything for session data used for the internal executors. Does that sound right?

ah sorry for missing your question

yeah, that is a great point. for internal executor usages that are invoked by a background job, and not delegated by a user session, there is no session variable that could get configured ahead of time.

i agree with your suggestion to avoid a sql.defaults name. maybe we can add a new namespace like sql.internal_executor.rows_affected_retry_limit to cover future settings that might fall under this. since it's hidden, i don't feel strongly about the name

rest of the code lgtm!


pkg/sql/internal.go line 1332 at r1 (raw file):

	// rowsAffectedState is only used in rowsAffectedIEExecutionMode.
	rowsAffectedState struct {

sanity check: is this struct only accessed from one goroutine? my read of the code says yes, but given how tricky the internal executor can get, i figured i'd ask

@yuzefovich yuzefovich force-pushed the ie-exec-limit branch 2 times, most recently from 1b61254 to dfef014 Compare November 15, 2023 23:13
@yuzefovich yuzefovich requested a review from rafiss November 15, 2023 23:14
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another minor change (based on a suggestion from Stan) to add a limit to the recursion depth, would appreciate another quick look.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss)


pkg/sql/internal.go line 1332 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

sanity check: is this struct only accessed from one goroutine? my read of the code says yes, but given how tricky the internal executor can get, i figured i'd ask

Yes, this state is only accessed by the connExecutor goroutine (i.e. the separate goroutine that is spawned up to evaluate an internal query, the "writer").


pkg/settings/registry.go line 240 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

ah sorry for missing your question

yeah, that is a great point. for internal executor usages that are invoked by a background job, and not delegated by a user session, there is no session variable that could get configured ahead of time.

i agree with your suggestion to avoid a sql.defaults name. maybe we can add a new namespace like sql.internal_executor.rows_affected_retry_limit to cover future settings that might fall under this. since it's hidden, i don't feel strongly about the name

rest of the code lgtm!

Makes sense, thanks, renamed.

Copy link
Collaborator

@rafiss rafiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the new check also lgtm

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)


pkg/settings/registry.go line 240 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

Makes sense, thanks, renamed.

oh hm, i guess since this is all hidden, we may as well have only the cluster setting, and not have the session variable. i don't feel strongly; it was just an idea to reduce the amount of code, but if keeping the session variable feels natural, i'm fine with that.

This commit introduces a limit on the number of times the
InternalExecutor machinery can retry errors in `Exec{Ex}` methods. That
logic was introduced in c09860b in
order to reduce the impact of fixes in other commits in cockroachdb#101477.
However, there is no limit in the number of retries, and we hit the
stack overflow twice in our own testing recently seemingly in this code
path. Thus, this commit adds a limit, 5 by default.

Note that I'm not sure exactly what bug, if any, can lead to the stack
overflow. One seemingly-unlikely theory is that there is no bug, meaning
that we were simply retrying forever because the stmts were pushed by
higher priority txns every time. Still, this change seems beneficial on
its own and should prevent stack overflows even if we don't fully
understand the root cause.

An additional improvement is that we now track the depth of recursion,
and once it exceeds 1000, we'll return an error. This should prevent the
stack overflows due to other reasons.

There is no release note given we've only seen this twice in our own
testing and it involved cluster-to-cluster streaming.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss)


pkg/settings/registry.go line 240 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

oh hm, i guess since this is all hidden, we may as well have only the cluster setting, and not have the session variable. i don't feel strongly; it was just an idea to reduce the amount of code, but if keeping the session variable feels natural, i'm fine with that.

Oh, yeah, good point. I thought we didn't have access to the settings.SV object and was just following the example with overrides, but we do have it easily accessible. Kept only the cluster setting.

@craig
Copy link
Contributor

craig bot commented Nov 16, 2023

Build succeeded:

@craig craig bot merged commit f94233d into cockroachdb:master Nov 16, 2023
3 checks passed
@yuzefovich yuzefovich deleted the ie-exec-limit branch November 16, 2023 00:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.2.x Flags PRs that need to be backported to 23.2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: stack overflow in InternalExecutor.Exec{Ex}
4 participants