Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: re-execute distributed query as local for some errors #105451

Merged
merged 2 commits into from
Jun 28, 2023

Conversation

yuzefovich
Copy link
Member

This commit teaches the main query code path (i.e. ignoring sub- and
post-queries) to retry distributed plans as local in some cases. In
particular, we use this retry mechanism if:

  • the error is SQL retryable (i.e. it'll have a high chance of success
    during the local execution)
  • no data has been pushed to the result writer by the distributed query
    (this shouldn't be a frequent scenario since most SQL retryable errors
    are likely to occur during the plan setup / before any data can be
    produced by the query).

This retry mechanism allows us to hide transient network problems,
and - more importantly - in the multi-tenant model it allows us to go
around the problem when "not ready" SQL instance is being used for
DistSQL planning (e.g. the instance might have been brought down, but
the cache on top of the system table hasn't been updated accordingly).
I believe that no matter the improvements that we can make to the
instance cache, there will also be a window (which should hopefully
getting smaller - according to David T it's currently 45s but he hopes
to bring it down to 7s or so) during which the instance cache is stale,
so DistSQL planner could use "not ready" instances.

The rationale for why it is ok to do this retry is that we create
brand-new processors that aren't affiliated to the distributed plan
that was just cleaned up. It's worth mentioning that the planNode
tree couldn't haven been reused in this way, but if we needed to
execute any planNodes directly, then we would have to run such a plan
in a local fashion. In other words, the fact that we had a
distributed plan initially guarantees that we don't have any
planNodes to be concerned about.

Possible downside to this approach is that it increases overall query
latency, so ideally we wouldn't plan on "not ready" instances in the
first place (and we have issues to improve there), but given that we now
fully parallelize the setup of distributed plans, the latency increase
should be bound, assuming that most retryable errors occur during the
distributed plan setup.

Addresses: #100578.
Epic: None

Release note: None

@yuzefovich yuzefovich requested review from msirek, cucaroach and a team June 23, 2023 18:20
@yuzefovich yuzefovich requested review from a team as code owners June 23, 2023 18:20
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @yuzefovich)


pkg/sql/distsql_physical_planner.go line 4590 at r2 (raw file):

) *PlanningCtx {
	distribute := distributionType == DistributionTypeAlways || (distributionType == DistributionTypeSystemTenantOnly && evalCtx.Codec.ForSystemTenant())
	planCtx := dsp.newPlanningCtxForLocal(evalCtx, planner, localityFiler)

Why does this always build a local planning context, even when distribute is true?


pkg/sql/distsql_running.go line 1238 at r2 (raw file):

	r.status = execinfra.NeedMoreRows
	r.closed = false
	r.stats = stats

Please reset r.progressAtomic. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)

Also, if r.discardRows is true, we could be retrying with a non-initialized egress counter, so maybe add something like this:

		if r.egressCounter != nil {
			r.egressCounter = NewTenantNetworkEgressCounter()
		}

It appears discardRows is only used for testing, but we'd still want the counter to be correct for tests.


pkg/sql/distsql_running.go line 1963 at r2 (raw file):

		// brand-new processors that aren't affiliated to the distributed plan
		// that was just cleaned up. It's worth mentioning that the planNode
		// tree couldn't haven been reused in this way, but if we needed to

nit: "couldn't haven" --> "couldn't have"


pkg/sql/distsql_running.go line 1996 at r2 (raw file):

			return
		}
		finalizePlanWithRowCount(ctx, localPlanCtx, localPhysPlan, localPlanCtx.planner.curPlan.mainRowCount)

Would it be useful to check again for context cancellation here before calling dsp.Run, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)

This commit fixes `json_populate_record` builtin in an edge case. In
particular, this generator builtin calls `eval.PopulateRecordWithJSON`
which modifies the passed-in tuple in-place, and right now the builtin
passes the input tuple. This leads to modification of the Datum which is
not allowed. However, this is mostly philosophical bug that doesn't lead
to any actual issues since from a single input tuple the builtin only
generates a single output tuple. I noticed this problem when tried to
re-execute the distributed query as local, but the tuple was corrupted
for that second local execution.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking a look!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @cucaroach and @msirek)


pkg/sql/distsql_running.go line 1238 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Please reset r.progressAtomic. Maybe something like:
atomic.StoreUint64(r.progressAtomic, 0)

Also, if r.discardRows is true, we could be retrying with a non-initialized egress counter, so maybe add something like this:

		if r.egressCounter != nil {
			r.egressCounter = NewTenantNetworkEgressCounter()
		}

It appears discardRows is only used for testing, but we'd still want the counter to be correct for tests.

Nice catch, done. For progressAtomic I somewhat deliberately was being lazy (this progress tracking is broken in the vectorized engine due #55758), but it's better to be thorough here.


pkg/sql/distsql_running.go line 1996 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Would it be useful to check again for context cancellation here before calling dsp.Run, in case it happened while generating the new physical plan?
Also, please add the line:
recv.expectedRowsRead = int64(localPhysPlan.TotalEstimatedScannedRows)

Updated expectedRowsRead (I think it shouldn't matter since this field remains unchanged from the distributed run and I'd guess the local estimate would be the same, but in any case it's better to be explicit).

Checking context cancellation doesn't seem that useful - we don't do that in the distributed run, and this local execution will detect it sooner if the context has just been canceled during the plan generation.


pkg/sql/distsql_physical_planner.go line 4590 at r2 (raw file):

Previously, msirek (Mark Sirek) wrote…

Why does this always build a local planning context, even when distribute is true?

I thought it would be more clear to have this helper method that sets up the planning context for local execution, and then the caller can extend it for distributed execution if necessary. It seems like it is confusing, so I removed the new helper in favor of using the existing constructor method.

Copy link
Contributor

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @cucaroach)

@cucaroach
Copy link
Contributor

Will this fix #102839 I wonder?

@yuzefovich
Copy link
Member Author

yuzefovich commented Jun 27, 2023

I think it'll fix select-job failures (where the internal executor uses DistSQL to run a query and it fails), but this failure is just wrong (perhaps it was another manifestation of the bug in the IE already fixed by #101477 although seems unlikely).

Tommy, I'll wait for your review here too since this change deserves extra scrutiny.

@@ -1486,6 +1504,7 @@ func (r *DistSQLReceiver) Push(
if commErr := r.resultWriterMu.row.AddRow(r.ctx, r.row); commErr != nil {
r.handleCommErr(commErr)
}
r.dataPushed = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about the pushMeta early return path above?

// TestDistributedQueryErrorIsRetriedLocally verifies that if a query with a
// distributed plan results in a SQL retryable error, then it is rerun as local
// transparently.
func TestDistributedQueryErrorIsRetriedLocally(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that makes sure the cluster setting works to disable this?

@blathers-crl
Copy link

blathers-crl bot commented Jun 27, 2023

Thank you for updating your pull request.

Before a member of our team reviews your PR, I have some potential action items for you:

  • We notice you have more than one commit in your PR. We try break logical changes into separate commits, but commits such as "fix typo" or "address review commits" should be squashed into one commit and pushed with --force
  • When CI has completed, please ensure no errors have appeared.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@blathers-crl blathers-crl bot added the O-community Originated from the community label Jun 27, 2023
@yuzefovich yuzefovich removed the O-community Originated from the community label Jun 27, 2023
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)


pkg/sql/distsql_running.go line 1507 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

What about the pushMeta early return path above?

I audited the code, and I think it'd be ok to ignore all metadata types for the purposes of the retry since it doesn't make it to the result writer, the only exception is this MetadataResultWriter, and users of that don't call PlanAndRun, so they aren't getting this automatic retry. That said, it seems like a good idea to consider any piece of non-error metadata as "data pushed" as long as it makes to the result writer, so I adjusted dataPushed boolean in that case. We need to exclude the error metadata because we know it'll be communicated when we want this retry to kick in, and we'll override it with the result of the local execution.


pkg/sql/distsql_running_test.go line 980 at r4 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Can we add a test that makes sure the cluster setting works to disable this?

Done.

// successful local execution.
return
}
log.VEventf(ctx, 1, "encountered an error when running the distributed plan, re-running it as local: %v", distributedErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this enough observability to see how frequently this is happening or should we have some telemetry too?

This commit teaches the main query code path (i.e. ignoring sub- and
post-queries) to retry distributed plans as local in some cases. In
particular, we use this retry mechanism if:
- the error is SQL retryable (i.e. it'll have a high chance of success
during the local execution)
- no data has been pushed to the result writer by the distributed query
(this shouldn't be a frequent scenario since most SQL retryable errors
are likely to occur during the plan setup / before any data can be
produced by the query).

This retry mechanism allows us to hide transient network problems,
and - more importantly - in the multi-tenant model it allows us to go
around the problem when "not ready" SQL instance is being used for
DistSQL planning (e.g. the instance might have been brought down, but
the cache on top of the system table hasn't been updated accordingly).
I believe that no matter the improvements that we can make to the
instance cache, there will also be a window (which should hopefully
getting smaller - according to David T it's currently 45s but he hopes
to bring it down to 7s or so) during which the instance cache is stale,
so DistSQL planner could use "not ready" instances.

The rationale for why it is ok to do this retry is that we create
brand-new processors that aren't affiliated to the distributed plan
that was just cleaned up. It's worth mentioning that the planNode
tree couldn't have been reused in this way, but if we needed to
execute any planNodes directly, then we would have to run such a plan
in a local fashion. In other words, the fact that we had a
distributed plan initially guarantees that we don't have any
planNodes to be concerned about.

Possible downside to this approach is that it increases overall query
latency, so ideally we wouldn't plan on "not ready" instances in the
first place (and we have issues to improve there), but given that we now
fully parallelize the setup of distributed plans, the latency increase
should be bound, assuming that most retryable errors occur during the
distributed plan setup.

Release note: None
Copy link
Member Author

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @cucaroach and @msirek)


pkg/sql/distsql_running.go line 1993 at r5 (raw file):

Previously, cucaroach (Tommy Reilly) wrote…

Is this enough observability to see how frequently this is happening or should we have some telemetry too?

Hm, not sure how we're going to use it, but why not, added a couple of counters.

@craig
Copy link
Contributor

craig bot commented Jun 28, 2023

Build succeeded:

@craig craig bot merged commit 06a051f into cockroachdb:master Jun 28, 2023
@cucaroach
Copy link
Contributor

Should we backport this to 23.1?

@yuzefovich
Copy link
Member Author

Seems a bit risky, what's the rationale?

@cucaroach
Copy link
Contributor

Seems a bit risky, what's the rationale?

I was suffering under the delusion that 23.1 had distsql turned on for internal executor and this was the cause of some 23.1 flakes but I've recovered ;-)

DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Oct 4, 2023
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes cockroachdb#105451

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Oct 18, 2023
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes cockroachdb#105451

Release note: None
DrewKimball added a commit to DrewKimball/cockroach that referenced this pull request Oct 19, 2023
In cockroachdb#105451, we added logic to locally retry a distributed query
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes cockroachdb#111327

Release note: None
craig bot pushed a commit that referenced this pull request Oct 20, 2023
111713: sql: fix nil-pointer error in local retry r=DrewKimball a=DrewKimball

#### tree: return correct parse error for pg_lsn

This patch changes the error returned upon failing to parse a PG_LSN
value to match postgres. Previously, the error was an internal error.

Informs #111327

Release note: None

#### sql: fix nil-pointer error in local retry

In #105451, we added logic to locally retry a distributed query
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes #111327

Release note: None

112654: opt: fix inverted index constrained scans for equality filters r=mgartner a=mgartner

#### opt: fix inverted index constrained scans for equality filters

This commit fixes a bug introduced in #101178 that allows the optimizer
to generated inverted index scans on columns that are not filtered by
the query. For example, an inverted index over the column `j1` could be
scanned for a filter involving a different column, like `j2 = '5'`. The
bug is caused by a simple omission of code that must check that the
column in the filter is an indexed column.

Fixes #111963

There is no release note because this bug is not present in any
releases.

Release note: None

#### randgen: generate single-column indexes more often

This commit makes `randgen` more likely to generate single-column
indexes. It is motivated by the bug #111963, which surprisingly lived on
the master branch for sixth months without being detected. It's not
entirely clear why TLP or other randomized tests did not catch the bug,
which has such a simple reproduction.

One theory is that indexes tend to be multi-column and constrained scans
on multi-column inverted indexes are not commonly planned for randomly
generated queries because the set of requirements to generate the scan
are very specific: the query must hold each prefix column constant, e.g.
`a=1 AND b=2 AND j='5'::JSON`. The likelihood of randomly generating
such an expression may be so low that the bug was not caught.

By making 10% of indexes single-column, this bug may have been more
likely to be caught because only the inverted index column needs to be
constrained by an equality filter.

Release note: None


112690: sql: disallow invocation of procedures outside of CALL r=mgartner a=mgartner

#### sql: disallow invocation of procedures outside of CALL

This commit adds some missing checks to ensure that procedures cannot be
invoked in any context besides as the root expression in `CALL`
statements.

Epic: CRDB-25388

Release note: None

#### sql: add tests with function invocation in procedure argument

This commit adds a couple of tests that show that functions can be used
in procedure argument expressions.

Release note: None


112698: sql: clarify comments/naming of descriptorChanged flag r=rafiss a=rafiss

fixes #110727
Release note: None

112701: sql/logictest: fix flakes in select_for_update_read_committed r=mgartner a=mgartner

The `select_for_update_read_committed` tests were flaking because not
all statements were being run under READ COMMITTED isolation. The logic
test infrastructure does not allow fine-grained control of sessions, and
setting the isolation level in one statement would only apply to a
single session. Subsequent statements are not guaranteed to run in the
same session because they could run in any session in the connection
pool. This commit wraps each statement in an explicitly transaction with
an explicit isolation level to ensure READ COMMITTED is used.

In the future, we should investigate allowing fine-grained and explicit
control of sessions in logic tests.

Fixes #112677

Release note: None


112726: sql: make tests error if a leaf txn is not created when expected r=rharding6373 a=rharding6373

This adds a test-only error if a leaf transaction is expected to be used by a plan but a root transaction is used instead.

Epic: none
Informs: #111097

Release note: None

112767: log: fix stacktrace test goroutine counts r=rickystewart a=dhartunian

Previously, we would use the count of the string `goroutine ` as a proxy for the number of goroutines in the stacktrace. This stopped working in go 1.21 due to this change:
golang/go@51225f6

We should consider using a stacktrace parser in the future.

Supports #112088

Epic: None
Release note: None

Co-authored-by: Drew Kimball <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Rafi Shamim <[email protected]>
Co-authored-by: rharding6373 <[email protected]>
Co-authored-by: David Hartunian <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Oct 20, 2023
In #105451, we added logic to locally retry a distributed query
after an error. However, the retry logic unconditionally updated a
field of `DistSQLReceiver` that may be nil, which could cause a
nil-pointer error in some code paths (e.g. apply-join). This patch
adds a check that the field is non-nil, as is done for other places
where it is updated.

There is no release note because the change has not yet made it into
a release.

Fixes #111327

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants