-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
opt: remove race-causing lazy calculation of properties #36148
Comments
I haven't investigated this at all, but is there some way we can rig this up to capture more info? Like having each goroutine running queries keep the last |
I don't think such a rig will help much, as it's probably pretty random when something like this triggers. It looks like a case of memory reuse gone wrong. |
Is there even a way to get a callback into our own Go code after the data race detector triggers? It doesn't panic the program, it keeps on running. That is, even if we kept this log, how would we know when to print it? Alternatively we could just always write queries and timestamps to files and then try to backtrack from when the data race occurred what was running? |
@mjibson do you know what sha this was on? |
We are lazily building the scalar props for In any case, the code path is through this case (in plan_opt.go:425) so we aren't using a cached memo:
|
I ran this last night on master. |
My guess is that the race is caused by two queries getting planned on the same
This code was reworked in #35776. I'm not sure if the problem remains. |
I'll run the test again with that PR and see what happens. |
After talking to Andrei and thinking about it some more, I don't think that PR would make a difference. This is where the executor goroutine is expected to be spawned from. The question is why would these two instances have anything in common with each other. |
Indeed. I confirmed the race is still present after that PR. |
If we had a normal goroutine dump from this time, we could compare the |
I've been trying to repro but had no luck. I wanted to try reproing with the diff below to see if we are indeed planning with the same memo: diff --git a/pkg/sql/opt/memo/memo.go b/pkg/sql/opt/memo/memo.go
index 80b5015..b9df367 100644
--- a/pkg/sql/opt/memo/memo.go
+++ b/pkg/sql/opt/memo/memo.go
@@ -16,6 +16,8 @@ package memo
import (
"context"
+ "fmt"
+ "sync/atomic"
"github.com/cockroachdb/cockroach/pkg/sql/opt"
"github.com/cockroachdb/cockroach/pkg/sql/opt/cat"
@@ -144,6 +146,21 @@ type Memo struct {
curID opt.ScalarID
// WARNING: if you add more members, add initialization code in Init.
+ useCount int32
+}
+
+func (m *Memo) Lock() {
+ val := atomic.AddInt32(&m.useCount, 1)
+ if val != 1 {
+ panic(fmt.Sprintf("already in use!! val: %d", val))
+ }
+}
+
+func (m *Memo) Unlock() {
+ val := atomic.AddInt32(&m.useCount, -1)
+ if val != 0 {
+ panic(fmt.Sprintf("unexpected value %d after unlock", val))
+ }
}
diff --git a/pkg/sql/plan_opt.go b/pkg/sql/plan_opt.go
index e943177..ae73627 100644
--- a/pkg/sql/plan_opt.go
+++ b/pkg/sql/plan_opt.go
@@ -150,6 +150,9 @@ func (p *planner) makeOptimizerPlan(ctx context.Context) (_ *planTop, isCorrelat
opc := &p.optPlanningCtx
opc.reset()
+ m := opc.optimizer.Factory().Memo()
+ m.Lock()
+ defer m.Unlock()
execMemo, isCorrelated, err := opc.buildExecMemo(ctx)
if err != nil { |
Still can't repro. Ran for 1hr on gceworker. |
This only reproduced two times for me. One time was immediately on startup. Maybe run the test with a 1m timeout in a loop? Sometimes sqlsmith generates very long running queries and all go routines get stuck waiting for those. Running it in a loop might help things? Unclear. |
Good call, I was able to hit it once. This was with some additional checks so I ruled out that we're sharing planners. |
I hit it again and we are definitely not using the same memo. I think this race is due to the singletons like The race is innocuous so this shouldn't be a problem in production. |
This commit fixes a race condition where two threads could be simultaneously trying to build the logical properties of a filters item and stepping on each others toes. In particular, one thread could set scalar.Constraints to nil, causing a panic when another thread tries to check whether scalar.Constraints.IsUnconstrained(). This commit fixes the issue by using a local variable to check whether the constraint set is unconstrained. Fixes cockroachdb#37951 Informs cockroachdb#37073 Informs cockroachdb#36148 Release note (bug fix): Fixed a race condition that could cause a panic during query planning.
37972: opt: fix data race when building filters item props r=rytaft a=rytaft This commit fixes a race condition where two threads could be simultaneously trying to build the logical properties of a filters item and stepping on each others toes. In particular, one thread could set `scalar.Constraints` to nil, causing a panic when another thread tries to check whether `scalar.Constraints.IsUnconstrained()`. This commit fixes the issue by using a local variable to check whether the constraint set is unconstrained. Fixes #37951 Informs #37073 Informs #36148 Release note (bug fix): Fixed a race condition that could cause a panic during query planning. Co-authored-by: Rebecca Taft <[email protected]>
This commit fixes a race condition where two threads could be simultaneously trying to build the logical properties of a filters item and stepping on each others toes. In particular, one thread could set scalar.Constraints to nil, causing a panic when another thread tries to check whether scalar.Constraints.IsUnconstrained(). This commit fixes the issue by using a local variable to check whether the constraint set is unconstrained. Fixes cockroachdb#37951 Informs cockroachdb#37073 Informs cockroachdb#36148 Release note (bug fix): Fixed a race condition that could cause a panic during query planning.
See some discussion in #37974 (comment) |
Previously, ListItem scalar operators had a method like this: ScalarProps(mem *memo.Memo) It lazily constructed scalar properties when called. The problem was that this required the Memo to be passed to many contexts simply for the purpose of calling ScalarProps, which was inconvenient. In addition, there will be thread-safety issues if we ever call this method after the expression tree becomes immutable (e.g. after it's been added to the query cache). To fix this, this patch constructs the ScalarProps greedily rather than lazily. All locations that directly constructed ListItem structs now call into new Factory methods that both construct the structs and populate them with scalar properties. This way, the scalar properties are immutable and always available. Fixes cockroachdb#36148 Release note: none
Found while running multiple sqlsmiths:
The text was updated successfully, but these errors were encountered: