-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: re-execute distributed query as local for some errors #105451
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)
pkg/sql/distsql_physical_planner.go
line 4590 at r2 (raw file):
) *PlanningCtx { distribute := distributionType == DistributionTypeAlways || (distributionType == DistributionTypeSystemTenantOnly && evalCtx.Codec.ForSystemTenant()) planCtx := dsp.newPlanningCtxForLocal(evalCtx, planner, localityFiler)
Why does this always build a local planning context, even when distribute
is true
?
pkg/sql/distsql_running.go
line 1238 at r2 (raw file):
r.status = execinfra.NeedMoreRows r.closed = false r.stats = stats
Please reset r.progressAtomic
. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)
Also, if r.discardRows
is true
, we could be retrying with a non-initialized egress counter, so maybe add something like this:
if r.egressCounter != nil {
r.egressCounter = NewTenantNetworkEgressCounter()
}
It appears discardRows
is only used for testing, but we'd still want the counter to be correct for tests.
pkg/sql/distsql_running.go
line 1963 at r2 (raw file):
// brand-new processors that aren't affiliated to the distributed plan // that was just cleaned up. It's worth mentioning that the planNode // tree couldn't haven been reused in this way, but if we needed to
nit: "couldn't haven" --> "couldn't have"
pkg/sql/distsql_running.go
line 1996 at r2 (raw file):
return } finalizePlanWithRowCount(ctx, localPlanCtx, localPhysPlan, localPlanCtx.planner.curPlan.mainRowCount)
Would it be useful to check again for context cancellation here before calling dsp.Run
, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)
This commit fixes `json_populate_record` builtin in an edge case. In particular, this generator builtin calls `eval.PopulateRecordWithJSON` which modifies the passed-in tuple in-place, and right now the builtin passes the input tuple. This leads to modification of the Datum which is not allowed. However, this is mostly philosophical bug that doesn't lead to any actual issues since from a single input tuple the builtin only generates a single output tuple. I noticed this problem when tried to re-execute the distributed query as local, but the tuple was corrupted for that second local execution. Release note: None
52bc423
to
336ec0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for taking a look!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)
pkg/sql/distsql_running.go
line 1238 at r2 (raw file):
Previously, msirek (Mark Sirek) wrote…
Please reset
r.progressAtomic
. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)
Also, if
r.discardRows
istrue
, we could be retrying with a non-initialized egress counter, so maybe add something like this:if r.egressCounter != nil { r.egressCounter = NewTenantNetworkEgressCounter() }It appears
discardRows
is only used for testing, but we'd still want the counter to be correct for tests.
Nice catch, done. For progressAtomic
I somewhat deliberately was being lazy (this progress tracking is broken in the vectorized engine due #55758), but it's better to be thorough here.
pkg/sql/distsql_running.go
line 1996 at r2 (raw file):
Previously, msirek (Mark Sirek) wrote…
Would it be useful to check again for context cancellation here before calling
dsp.Run
, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)
Updated expectedRowsRead
(I think it shouldn't matter since this field remains unchanged from the distributed run and I'd guess the local estimate would be the same, but in any case it's better to be explicit).
Checking context cancellation doesn't seem that useful - we don't do that in the distributed run, and this local execution will detect it sooner if the context has just been canceled during the plan generation.
pkg/sql/distsql_physical_planner.go
line 4590 at r2 (raw file):
Previously, msirek (Mark Sirek) wrote…
Why does this always build a local planning context, even when
distribute
istrue
?
I thought it would be more clear to have this helper method that sets up the planning context for local execution, and then the caller can extend it for distributed execution if necessary. It seems like it is confusing, so I removed the new helper in favor of using the existing constructor method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)
Will this fix #102839 I wonder? |
I think it'll fix Tommy, I'll wait for your review here too since this change deserves extra scrutiny. |
@@ -1486,6 +1504,7 @@ func (r *DistSQLReceiver) Push( | |||
if commErr := r.resultWriterMu.row.AddRow(r.ctx, r.row); commErr != nil { | |||
r.handleCommErr(commErr) | |||
} | |||
r.dataPushed = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about the pushMeta early return path above?
// TestDistributedQueryErrorIsRetriedLocally verifies that if a query with a | ||
// distributed plan results in a SQL retryable error, then it is rerun as local | ||
// transparently. | ||
func TestDistributedQueryErrorIsRetriedLocally(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add a test that makes sure the cluster setting works to disable this?
336ec0f
to
ea85049
Compare
Thank you for updating your pull request. Before a member of our team reviews your PR, I have some potential action items for you:
🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
ea85049
to
5214763
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)
pkg/sql/distsql_running.go
line 1507 at r4 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
What about the pushMeta early return path above?
I audited the code, and I think it'd be ok to ignore all metadata types for the purposes of the retry since it doesn't make it to the result writer, the only exception is this MetadataResultWriter
, and users of that don't call PlanAndRun
, so they aren't getting this automatic retry. That said, it seems like a good idea to consider any piece of non-error metadata as "data pushed" as long as it makes to the result writer, so I adjusted dataPushed
boolean in that case. We need to exclude the error metadata because we know it'll be communicated when we want this retry to kick in, and we'll override it with the result of the local execution.
pkg/sql/distsql_running_test.go
line 980 at r4 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Can we add a test that makes sure the cluster setting works to disable this?
Done.
5214763
to
7e22342
Compare
// successful local execution. | ||
return | ||
} | ||
log.VEventf(ctx, 1, "encountered an error when running the distributed plan, re-running it as local: %v", distributedErr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this enough observability to see how frequently this is happening or should we have some telemetry too?
7e22342
to
1afdc09
Compare
This commit teaches the main query code path (i.e. ignoring sub- and post-queries) to retry distributed plans as local in some cases. In particular, we use this retry mechanism if: - the error is SQL retryable (i.e. it'll have a high chance of success during the local execution) - no data has been pushed to the result writer by the distributed query (this shouldn't be a frequent scenario since most SQL retryable errors are likely to occur during the plan setup / before any data can be produced by the query). This retry mechanism allows us to hide transient network problems, and - more importantly - in the multi-tenant model it allows us to go around the problem when "not ready" SQL instance is being used for DistSQL planning (e.g. the instance might have been brought down, but the cache on top of the system table hasn't been updated accordingly). I believe that no matter the improvements that we can make to the instance cache, there will also be a window (which should hopefully getting smaller - according to David T it's currently 45s but he hopes to bring it down to 7s or so) during which the instance cache is stale, so DistSQL planner could use "not ready" instances. The rationale for why it is ok to do this retry is that we create brand-new processors that aren't affiliated to the distributed plan that was just cleaned up. It's worth mentioning that the planNode tree couldn't have been reused in this way, but if we needed to execute any planNodes directly, then we would have to run such a plan in a local fashion. In other words, the fact that we had a distributed plan initially guarantees that we don't have any planNodes to be concerned about. Possible downside to this approach is that it increases overall query latency, so ideally we wouldn't plan on "not ready" instances in the first place (and we have issues to improve there), but given that we now fully parallelize the setup of distributed plans, the latency increase should be bound, assuming that most retryable errors occur during the distributed plan setup. Release note: None
1afdc09
to
a85dbd9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)
pkg/sql/distsql_running.go
line 1993 at r5 (raw file):
Previously, cucaroach (Tommy Reilly) wrote…
Is this enough observability to see how frequently this is happening or should we have some telemetry too?
Hm, not sure how we're going to use it, but why not, added a couple of counters.
Build succeeded: |
Should we backport this to 23.1? |
Seems a bit risky, what's the rationale? |
I was suffering under the delusion that 23.1 had distsql turned on for internal executor and this was the cause of some 23.1 flakes but I've recovered ;-) |
after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes cockroachdb#105451 Release note: None
after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes cockroachdb#105451 Release note: None
In cockroachdb#105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes cockroachdb#111327 Release note: None
111713: sql: fix nil-pointer error in local retry r=DrewKimball a=DrewKimball #### tree: return correct parse error for pg_lsn This patch changes the error returned upon failing to parse a PG_LSN value to match postgres. Previously, the error was an internal error. Informs #111327 Release note: None #### sql: fix nil-pointer error in local retry In #105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes #111327 Release note: None 112654: opt: fix inverted index constrained scans for equality filters r=mgartner a=mgartner #### opt: fix inverted index constrained scans for equality filters This commit fixes a bug introduced in #101178 that allows the optimizer to generated inverted index scans on columns that are not filtered by the query. For example, an inverted index over the column `j1` could be scanned for a filter involving a different column, like `j2 = '5'`. The bug is caused by a simple omission of code that must check that the column in the filter is an indexed column. Fixes #111963 There is no release note because this bug is not present in any releases. Release note: None #### randgen: generate single-column indexes more often This commit makes `randgen` more likely to generate single-column indexes. It is motivated by the bug #111963, which surprisingly lived on the master branch for sixth months without being detected. It's not entirely clear why TLP or other randomized tests did not catch the bug, which has such a simple reproduction. One theory is that indexes tend to be multi-column and constrained scans on multi-column inverted indexes are not commonly planned for randomly generated queries because the set of requirements to generate the scan are very specific: the query must hold each prefix column constant, e.g. `a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating such an expression may be so low that the bug was not caught. By making 10% of indexes single-column, this bug may have been more likely to be caught because only the inverted index column needs to be constrained by an equality filter. Release note: None 112690: sql: disallow invocation of procedures outside of CALL r=mgartner a=mgartner #### sql: disallow invocation of procedures outside of CALL This commit adds some missing checks to ensure that procedures cannot be invoked in any context besides as the root expression in `CALL` statements. Epic: CRDB-25388 Release note: None #### sql: add tests with function invocation in procedure argument This commit adds a couple of tests that show that functions can be used in procedure argument expressions. Release note: None 112698: sql: clarify comments/naming of descriptorChanged flag r=rafiss a=rafiss fixes #110727 Release note: None 112701: sql/logictest: fix flakes in select_for_update_read_committed r=mgartner a=mgartner The `select_for_update_read_committed` tests were flaking because not all statements were being run under READ COMMITTED isolation. The logic test infrastructure does not allow fine-grained control of sessions, and setting the isolation level in one statement would only apply to a single session. Subsequent statements are not guaranteed to run in the same session because they could run in any session in the connection pool. This commit wraps each statement in an explicitly transaction with an explicit isolation level to ensure READ COMMITTED is used. In the future, we should investigate allowing fine-grained and explicit control of sessions in logic tests. Fixes #112677 Release note: None 112726: sql: make tests error if a leaf txn is not created when expected r=rharding6373 a=rharding6373 This adds a test-only error if a leaf transaction is expected to be used by a plan but a root transaction is used instead. Epic: none Informs: #111097 Release note: None 112767: log: fix stacktrace test goroutine counts r=rickystewart a=dhartunian Previously, we would use the count of the string `goroutine ` as a proxy for the number of goroutines in the stacktrace. This stopped working in go 1.21 due to this change: golang/go@51225f6 We should consider using a stacktrace parser in the future. Supports #112088 Epic: None Release note: None Co-authored-by: Drew Kimball <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Rafi Shamim <[email protected]> Co-authored-by: rharding6373 <[email protected]> Co-authored-by: David Hartunian <[email protected]>
In #105451, we added logic to locally retry a distributed query after an error. However, the retry logic unconditionally updated a field of `DistSQLReceiver` that may be nil, which could cause a nil-pointer error in some code paths (e.g. apply-join). This patch adds a check that the field is non-nil, as is done for other places where it is updated. There is no release note because the change has not yet made it into a release. Fixes #111327 Release note: None
This commit teaches the main query code path (i.e. ignoring sub- and
post-queries) to retry distributed plans as local in some cases. In
particular, we use this retry mechanism if:
during the local execution)
(this shouldn't be a frequent scenario since most SQL retryable errors
are likely to occur during the plan setup / before any data can be
produced by the query).
This retry mechanism allows us to hide transient network problems,
and - more importantly - in the multi-tenant model it allows us to go
around the problem when "not ready" SQL instance is being used for
DistSQL planning (e.g. the instance might have been brought down, but
the cache on top of the system table hasn't been updated accordingly).
I believe that no matter the improvements that we can make to the
instance cache, there will also be a window (which should hopefully
getting smaller - according to David T it's currently 45s but he hopes
to bring it down to 7s or so) during which the instance cache is stale,
so DistSQL planner could use "not ready" instances.
The rationale for why it is ok to do this retry is that we create
brand-new processors that aren't affiliated to the distributed plan
that was just cleaned up. It's worth mentioning that the planNode
tree couldn't haven been reused in this way, but if we needed to
execute any planNodes directly, then we would have to run such a plan
in a local fashion. In other words, the fact that we had a
distributed plan initially guarantees that we don't have any
planNodes to be concerned about.
Possible downside to this approach is that it increases overall query
latency, so ideally we wouldn't plan on "not ready" instances in the
first place (and we have issues to improve there), but given that we now
fully parallelize the setup of distributed plans, the latency increase
should be bound, assuming that most retryable errors occur during the
distributed plan setup.
Addresses: #100578.
Epic: None
Release note: None