Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: optimizer panics during index recommendation generation should not crash nodes #88246

Closed
mgartner opened this issue Sep 20, 2022 · 2 comments · Fixed by #91457
Closed

sql: optimizer panics during index recommendation generation should not crash nodes #88246

mgartner opened this issue Sep 20, 2022 · 2 comments · Fixed by #91457
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-sql-queries SQL Queries Team

Comments

@mgartner
Copy link
Collaborator

mgartner commented Sep 20, 2022

See #88037.

If the optimizer panics while generating index recommendations, a node can crash. Below is the crash log from #88037 that shows an optimizer panic crashing the node.

E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125  encountered internal error:
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +expected required columns to be a subset of output columns
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +(1) assertion failure
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +Wraps: (2) attached stack trace
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  -- stack trace:
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/ordering.projectBuildProvided
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/ordering/project.go:87
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/ordering.BuildProvided
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/ordering/ordering.go:66
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/xform.(*Optimizer).setLowestCostTree
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/xform/optimizer.go:778
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/xform.(*Optimizer).setLowestCostTree
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/xform/optimizer.go:765
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/xform.(*Optimizer).setLowestCostTree
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/xform/optimizer.go:765
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/opt/xform.(*Optimizer).Optimize
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/opt/xform/optimizer.go:266
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).makeQueryIndexRecommendation
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:745
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).buildExecMemo
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:570
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*planner).makeOptimizerPlan
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/plan_opt.go:233
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).makeExecPlan
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1432
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).dispatchToExecutionEngine
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:1058
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmtInOpenState
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:687
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt.func1
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:129
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execWithProfiling
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:2382
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execStmt
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor_exec.go:128
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd.func1
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1919
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).execCmd
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1923
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*connExecutor).run
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:1845
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql.(*Server).ServeConn
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/conn_executor.go:827
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | github.com/cockroachdb/cockroach/pkg/sql/pgwire.(*conn).processCommandsAsync.func1
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 	github.com/cockroachdb/cockroach/pkg/sql/pgwire/conn.go:728
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | runtime.goexit
E220916 06:44:56.948222 960 sql/sqltelemetry/report.go:57 ⋮ [n1,client=34.74.3.67:51630,user=root] 125 +  | 

Notice the line:

... | github.com/cockroachdb/cockroach/pkg/sql.(*optPlanningCtx).makeQueryIndexRecommendation

We need to correctly catch optimizer panics so that they do not crash nodes.

Jira issue: CRDB-19745

@mgartner mgartner added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss labels Sep 20, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Sep 20, 2022
@mgartner mgartner self-assigned this Sep 20, 2022
@rytaft
Copy link
Collaborator

rytaft commented Oct 27, 2022

We should address this soon.

@mgartner
Copy link
Collaborator Author

I think we need panic-catching code in makeQueryIndexRecommendation because functions like Factory.CopyAndReplace can panic. So the panic-catcher in Optimizer.Optimize isn't enough.

michae2 added a commit to michae2/cockroach that referenced this issue Nov 11, 2022
In the optimizer we have the convention of using panics for error
propagation, to avoid adding error handling code everywhere in deeply
recursive call stacks. This is safe because the optimizer is single-
threaded and not manipulating locks, shared state, etc. But it means
that wherever we call into the optimizer we need to add panic recovery.

We were duplicating the same basic block of panic recovery code in all
these places. Factor it into a "catch" function, somewhat similar to the
vectorized engine's `CatchVectorizedRuntimeError`.`

Assists: cockroachdb#88246

Release note: None
craig bot pushed a commit that referenced this issue Nov 11, 2022
91457: sql: catch panics in makeQueryIndexRecommendation r=DrewKimball,yuzefovich a=michae2

In the optimizer we use the convention of panic-based error handling for a performance advantage. This means at the boundaries we must remember to recover and return errors instead of letting panics escape.

`sql.(*optPlanningCtx).makeQueryIndexRecommendation` calls some optimizer helper functions in addition to calling
`xform.(*Optimizer).Optimize`. While `Optimize` has the panic-recovery code, these helper functions do not, so we need to add panic recovery to `makeQueryIndexRecommendation` as well.

Fixes: #88246

Release note: None

Co-authored-by: Michael Erickson <[email protected]>
@craig craig bot closed this as completed in e7871ea Nov 11, 2022
blathers-crl bot pushed a commit that referenced this issue Nov 11, 2022
In the optimizer we use the convention of panic-based error handling for
a performance advantage. This means at the boundaries we must remember
to recover and return errors instead of letting panics escape.

`sql.(*optPlanningCtx).makeQueryIndexRecommendation` calls some
optimizer helper functions in addition to calling
`xform.(*Optimizer).Optimize`. While `Optimize` has the panic-recovery
code, these helper functions do not, so we need to add panic recovery
to `makeQueryIndexRecommendation` as well.

Fixes: #88246

Release note: None
@mgartner mgartner moved this to Done in SQL Queries Jul 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. S-1 High impact: many users impacted, serious risk of high unavailability or data loss T-sql-queries SQL Queries Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants