-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: introduce limit on number of retries in IntExec.Exec{Ex} methods #114398
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)
TFTR Rachael! @rafiss could you double-check my understanding that we do need to have a cluster setting to be able to tweak the session variable value used by the internal executors? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/settings/registry.go
line 240 at r1 (raw file):
// changed with ALTER ROLE ... SET. var sqlDefaultSettings = map[InternalKey]struct{}{ // PLEASE DO NOT ADD NEW SETTINGS TO THIS MAP. THANK YOU.
curious if there's a reason to add the setting in spite of this warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @rafiss)
pkg/settings/registry.go
line 240 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
curious if there's a reason to add the setting in spite of this warning?
That's what I wanted to confirm with you: IIUC we currently need to have a cluster setting to influence the default value (i.e. GlobalDefault
) for session variables that only impact the internal executor. It doesn't have to have a sql.defaults.
prefix (I can change that, and then we won't have to add it to the registry), but I think doing ALTER ROLE ALL SET internal_executor_rows_affected_retry_limit = ...
doesn't do anything for session data used for the internal executors. Does that sound right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/settings/registry.go
line 240 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
That's what I wanted to confirm with you: IIUC we currently need to have a cluster setting to influence the default value (i.e.
GlobalDefault
) for session variables that only impact the internal executor. It doesn't have to have asql.defaults.
prefix (I can change that, and then we won't have to add it to the registry), but I think doingALTER ROLE ALL SET internal_executor_rows_affected_retry_limit = ...
doesn't do anything for session data used for the internal executors. Does that sound right?
ah sorry for missing your question
yeah, that is a great point. for internal executor usages that are invoked by a background job, and not delegated by a user session, there is no session variable that could get configured ahead of time.
i agree with your suggestion to avoid a sql.defaults name. maybe we can add a new namespace like sql.internal_executor.rows_affected_retry_limit
to cover future settings that might fall under this. since it's hidden, i don't feel strongly about the name
rest of the code lgtm!
pkg/sql/internal.go
line 1332 at r1 (raw file):
// rowsAffectedState is only used in rowsAffectedIEExecutionMode. rowsAffectedState struct {
sanity check: is this struct only accessed from one goroutine? my read of the code says yes, but given how tricky the internal executor can get, i figured i'd ask
1b61254
to
dfef014
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made another minor change (based on a suggestion from Stan) to add a limit to the recursion depth, would appreciate another quick look.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss)
pkg/sql/internal.go
line 1332 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
sanity check: is this struct only accessed from one goroutine? my read of the code says yes, but given how tricky the internal executor can get, i figured i'd ask
Yes, this state is only accessed by the connExecutor goroutine (i.e. the separate goroutine that is spawned up to evaluate an internal query, the "writer").
pkg/settings/registry.go
line 240 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
ah sorry for missing your question
yeah, that is a great point. for internal executor usages that are invoked by a background job, and not delegated by a user session, there is no session variable that could get configured ahead of time.
i agree with your suggestion to avoid a sql.defaults name. maybe we can add a new namespace like
sql.internal_executor.rows_affected_retry_limit
to cover future settings that might fall under this. since it's hidden, i don't feel strongly about the namerest of the code lgtm!
Makes sense, thanks, renamed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the new check also lgtm
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
pkg/settings/registry.go
line 240 at r1 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
Makes sense, thanks, renamed.
oh hm, i guess since this is all hidden, we may as well have only the cluster setting, and not have the session variable. i don't feel strongly; it was just an idea to reduce the amount of code, but if keeping the session variable feels natural, i'm fine with that.
This commit introduces a limit on the number of times the InternalExecutor machinery can retry errors in `Exec{Ex}` methods. That logic was introduced in c09860b in order to reduce the impact of fixes in other commits in cockroachdb#101477. However, there is no limit in the number of retries, and we hit the stack overflow twice in our own testing recently seemingly in this code path. Thus, this commit adds a limit, 5 by default. Note that I'm not sure exactly what bug, if any, can lead to the stack overflow. One seemingly-unlikely theory is that there is no bug, meaning that we were simply retrying forever because the stmts were pushed by higher priority txns every time. Still, this change seems beneficial on its own and should prevent stack overflows even if we don't fully understand the root cause. An additional improvement is that we now track the depth of recursion, and once it exceeds 1000, we'll return an error. This should prevent the stack overflows due to other reasons. There is no release note given we've only seen this twice in our own testing and it involved cluster-to-cluster streaming. Release note: None
dfef014
to
f36bb29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @rafiss)
pkg/settings/registry.go
line 240 at r1 (raw file):
Previously, rafiss (Rafi Shamim) wrote…
oh hm, i guess since this is all hidden, we may as well have only the cluster setting, and not have the session variable. i don't feel strongly; it was just an idea to reduce the amount of code, but if keeping the session variable feels natural, i'm fine with that.
Oh, yeah, good point. I thought we didn't have access to the settings.SV
object and was just following the example with overrides, but we do have it easily accessible. Kept only the cluster setting.
Build succeeded: |
This commit introduces a limit on the number of times the InternalExecutor machinery can retry errors in
Exec{Ex}
methods. That logic was introduced in c09860b in order to reduce the impact of fixes in other commits in #101477. However, there is no limit in the number of retries, and we hit the stack overflow twice in our own testing recently seemingly in this code path. Thus, this commit adds a limit, 5 by default.Note that I'm not sure exactly what bug, if any, can lead to the stack overflow. One seemingly-unlikely theory is that there is no bug, meaning that we were simply retrying forever because the stmts were pushed by higher priority txns every time. Still, this change seems beneficial on its own and should prevent stack overflows even if we don't fully understand the root cause.
An additional improvement is that we now track the depth of recursion, and once it exceeds 1000, we'll return an error. This should prevent the stack overflows due to other reasons.
There is no release note given we've only seen this twice in our own testing and it involved cluster-to-cluster streaming.
Fixes: #109197.
Release note: None