-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changefeedccl: add schema change events to nemeses #41842
changefeedccl: add schema change events to nemeses #41842
Conversation
c98c944
to
fbc668e
Compare
fbc668e
to
2f2497a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
high level first pass
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
eventMix: map[fsm.Event]int{ // eventOpenTxn opens an UPSERT or DELETE transaction. eventOpenTxn{}: 10,
why split this up? (not objecting, i just don't see offhand why it was necessary)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
// eventFeedMessage reads a message from the feed, or if the state machine // thinks there will be no message available, it falls back to // eventTransact.
nit: eventTransact is gone
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
// eventResume RESUMEs the changefeed. eventResume{}: 1,
ditto here, why split this up?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
} } else { if err := abort(fsm.Args{Ctx: ctx, Extended: ns}); err != nil {
abort is not quite the opposite of commit, we should add a rollback event too
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
if ns.currentTestColumnCount >= ns.maxTestColumnCount { // Do nothing if the table already has the maximum allowed columns.
bummer, i wonder if we can avoid generating this event by adding more state. IIRC there's some crazy pattern matching stuff in the fsm library that helps with the transition explosion, but I don't remember if it would be useful here. if that doesn't work out then i agree it's probably not worth the extra state
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
// Adding a column should trigger a full table scan. ns.db.QueryRow(`SELECT count(*) FROM foo`).Scan(&rows) ns.availableRows += rows
it should be twice this, right? the schema change itself does a kv backfill (which generates uninteresting rows at the changefeed level that we don't have a good way to filter) and then the changefeed runs a backfill after the schema change finishes
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 452 at r2 (raw file):
} if ns.currentTestColumnCount == 0 { // Do nothing if the table has no test columns.
ditto unfortunate
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 471 at r2 (raw file):
if ns.availableRows <= 0 { // No-op if there are no available rows to read.
if we're seeing this, it's because the availableRows accounting is wrong. we should probably look into that
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
ns := a.Extended.(*nemeses) defer func() { ns.txn = nil }() return ns.txn.Rollback()
hmm, i don't think we want to do this change? it was intentionally aborting a txn without its knowledge before
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
// table and then apply the row update. We do it this way to ensure that // the set of non-null columns in the scratch table is equal to the set of // columns in the original table.
I don't quite follow, do you have an example?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
// Ignore the fingerprint mismatch if there was an in-progress schema change job // on the table. // TODO(aayush): This is pretty hacky, how do we improve this?
it'd be nice to address this TODO before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 291 at r2 (raw file):
func (eventFinished) Event() {} var compatibleStateTransitions = fsm.Pattern{
nit: just call this stateTransitions
?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 300 at r2 (raw file):
stateRunning{Paused: fsm.False, TxnOpen: fsm.False}: { eventPause{}: { Next: stateRunning{Paused: fsm.True, TxnOpen: fsm.False},
This comment is more for my own edification than asking for a change. I feel like we lack principle about which transitions we're making explicitly possible in the FSM vs checking inside the implementation. For example, we're keeping track of row counts and column counts inside the extended
state and then changing behavior inside of actions based on that extended state.
Moving more things into the state machine itself is heavy-weight due to the transition explosion but not moving it into the state machine makes the logic more implicit and harder to understand.
I wonder if it would be good to have some sort of mechanism to create predicates over the extended state and use them as pre-conditions for transitions.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 356 at r2 (raw file):
} var txnStateTransitions = fsm.Compile(compatibleStateTransitions)
nit: I know this is the old name but I don't love it. What about compiledStateTransitions
?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
if deleteID == noDeleteSentinel { if err := txn.QueryRow( `UPSERT INTO foo VALUES ((random() * $1)::int, cluster_logical_timestamp()::string) RETURNING id, ts`,
This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be cluster_logical_timestamp()
? Could it just be now()
?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 514 at r2 (raw file):
// Don't error out if we got pushed, but don't increment availableRows no // matter what error was hit. if strings.Contains(err.Error(), `restart transaction`) {
Just out of curiosity I wonder if there's a more principled way to check if this is a serializable restart. Maybe by unwrapping the error into something which could hand you an error code? I have no idea. This comment is mostly just my own curiosity.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
it'd be nice to address this TODO before merge
It's honestly surprising to me (in a good way) that this works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
Previously, ajwerner wrote…
This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be
cluster_logical_timestamp()
? Could it just benow()
?
it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, ajwerner wrote…
It's honestly surprising to me (in a good way) that this works.
yeah, ditto. the workaround is incredibly clever
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
why split this up? (not objecting, i just don't see offhand why it was necessary)
It seemed easier to reason about if we split transact
into an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
ditto here, why split this up?
This is just so I can avoid some code duplication. In nextEvent
I'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from this eventMix
map. Adding eventResume
here allows me to do the same with pause/resume without separate logic.
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I don't quite follow, do you have an example?
Imagine that we have 4 columns in foo
and 4 (non-null) columns in fprint
. Now there is a removeColumn
event that drops one of the columns in foo
. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simply UPSERT
them into fprint
because that won't nullify the 4th column. So, we instead do a transactional DELETE
, followed by an UPSERT
to apply every row update.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
yeah, ditto. the workaround is incredibly clever
We could go one step further here and make sure that the table in question is indeed foo
, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.
I explored ways to perhaps look at foo
's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.
Are there any other ideas I can explore?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I thought those "uninteresting" rows were being filtered out in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
It seemed easier to reason about if we split
transact
into an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.
wfm. I had this originally and for some reason switched it to this, but I can't remember why. As long as this doesn't cause any other compromised I prefer it to
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
This is just so I can avoid some code duplication. In
nextEvent
I'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from thiseventMix
map. AddingeventResume
here allows me to do the same with pause/resume without separate logic.
👍
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I thought those "uninteresting" rows were being filtered out in
cloudFeed.Next()
(assuming they're not being emitted at a new timestamp), is this incorrect?
Oh! Yes you're right. That's really a bug in testFeed, so sounds like a good reminder/TODO to add to this comment.
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Imagine that we have 4 columns in
foo
and 4 (non-null) columns infprint
. Now there is aremoveColumn
event that drops one of the columns infoo
. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simplyUPSERT
them intofprint
because that won't nullify the 4th column. So, we instead do a transactionalDELETE
, followed by anUPSERT
to apply every row update.
we can't UPSERT (id, ts, col1, col2, col3) VALUES (...)
?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
We could go one step further here and make sure that the table in question is indeed
foo
, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.I explored ways to perhaps look at
foo
's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.Are there any other ideas I can explore?
I'm missing some background. Why do the fingerprints not match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
we can't
UPSERT (id, ts, col1, col2, col3) VALUES (...)
?
So if fprint
contains col1, col2, col3, col4
and we UPSERT (id, ts, col1, col2, col3) VALUES (...)
this only performs an update, and doesn't nullify col4
. This would cause the validation to fail, even though it shouldn't.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I'm missing some background. Why do the fingerprints not match?
Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the fprint
table still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.
The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
Could you explain why it's a bug? |
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
I see. Would we say that the resultant state has |
2f2497a
to
d56d02d
Compare
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file): Previously, danhhz (Daniel Harrison) wrote…
So there is a Let's say we add a boolean attribute to @ajwerner any ideas? |
d56d02d
to
ae651a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
So there is a
fsm.Var("some_variable_name")
that we can use for something like this. SeeeventPause
andeventResume
. The problem is that the fsm library only supports bools.Let's say we add a boolean attribute to
stateRunning
to keep track of whether test columns exist, call itTestColumnsExist
. TheeventRemoveColumn
, wouldn't always flip theTestColumnsExist
tofalse
. So the result state becomes non-deterministic.@ajwerner any ideas?
can we use fsm.Any somehow?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Could you explain why it's a bug?
the testFeeds are intended to remove (key,ts) duplicates, but I was lazy and removed (key, value) duplicates. see the numerous // TODO(dan): Track duplicates more precisely in sinklessFeed/tableFeed.
comments in TestChangefeedSchemaChangeAllowBackfill
pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
So if
fprint
containscol1, col2, col3, col4
and weUPSERT (id, ts, col1, col2, col3) VALUES (...)
this only performs an update, and doesn't nullifycol4
. This would cause the validation to fail, even though it shouldn't.
oh because fprint is a fixed number of columns, right. Then how about filling out all of col4...col10 with NULLs in the UPSERT?
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the
fprint
table still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.
I'm worried about the hack because we lose coverage in a really tricky spot (during schema change backfill) so it'd be nice to address it if we can.
I'm still trying to wrap my head around why these don't match. If the original table is at version 3 and hasn't had the column dropped, shouldn't the fprint table also still have the column, and thus a matching fingerprint.
Side node: I'm incredibly surprised that the fingerprint code is actually resilient to the two tables having a different number of columns (as long as the one with extra columns has them filled with NULLs?). I did not intend that when writing it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
can we use fsm.Any somehow?
We can make this work by providing variables in the event. For example we can include in eventAddColumn
and eventRemoveColumn
whether they will change the state.
For example:
type stateRunning struct {
...
CanRemoveColumn fsm.Bool
CanAddColumn fsm.Bool
}
type eventAddColumn struct {
CanAddColumnAfter fsm.Bool
}
type eventRemoveColumn struct {
CanRemoveColumnAfter fsm.Bool
}
Then in the transitions:
stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: {
eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: {
Next: stateRunning{
Paused: fsm.Var("paused"),
TxnOpen: fsm.False,
CanRemoveColumn: fsm.True,
CanAddColumn: fsm.Var("can_add_after"),
},
Action: logEvent(addColumn),
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, ajwerner wrote…
We can make this work by providing variables in the event. For example we can include in
eventAddColumn
andeventRemoveColumn
whether they will change the state.For example:
type stateRunning struct { ... CanRemoveColumn fsm.Bool CanAddColumn fsm.Bool } type eventAddColumn struct { CanAddColumnAfter fsm.Bool } type eventRemoveColumn struct { CanRemoveColumnAfter fsm.Bool }
Then in the transitions:
stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: { eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: { Next: stateRunning{ Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.True, CanAddColumn: fsm.Var("can_add_after"), }, Action: logEvent(addColumn), } }
This works because we can know when we create the event whether we'll reach the threshold
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, 1 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: eventTransact is gone
nit: eventOpenTxn or eventCommit
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
abort is not quite the opposite of commit, we should add a rollback event too
+1
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review
I don't have any memory of this, but it does seem likely that this is to prevent the transaction from being auto-retried when being pushed.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, ajwerner wrote…
This works because we can know when we create the event whether we'll reach the threshold
Yeah, that would work. It still might be too heavyweight to justify though.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I see. Would we say that the resultant state has
TxnOpen = false
still?
You mean TxnOpen = true
? I think we would leave the transaction open.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):
eventResume{}: 1, // eventCommit commits the outstanding transaction.
nit: do you mind grouping all of the transaction-related events together?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 142 at r3 (raw file):
return nil, err } // Now push everything to make sure the initial scan can complete, otherwise
Why were we able to get rid of this?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 143 at r3 (raw file):
for i := 0; i < ns.maxTestColumnCount; i++ { testCol := fmt.Sprintf(", test%d STRING DEFAULT NULL", i) createFprintStmtBuf.WriteString(testCol)
fmt.Fprintf
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):
} // Filter out `eventFinished` and restore at the end of the method.
Woah, this is super gross. If eventFinished
has a weight of 0, do we even need this?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 261 at r3 (raw file):
} // If there are no available rows, openTxn or commit outstanding txn instead of
nit: move this into the if _, ok := event.(eventFeedMessage); ok {
block and panic("unreachable")
below the for loop.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 264 at r3 (raw file):
// reading. if ns.availableRows < 1 { if ns.txn == nil {
Can't we get this from the state
? Seems cleaner.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 273 at r3 (raw file):
type stateRunning struct { Paused fsm.Bool
FeedPaused
?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 321 at r3 (raw file):
Action: logEvent(noteFeedMessage), }, eventSplit{}: {
I'm surprised we don't want to allow some of these while a transaction is ongoing.
pkg/ccl/changefeedccl/cdctest/validator.go, line 133 at r3 (raw file):
failures []string numColumns int
nit: we seem to keep failures []string
separate, so much this up to the first group of fields.
pkg/sql/lease.go, line 989 at r2 (raw file):
// because of its use of `singleflight.Group`. See issue #41780 for how this has // happened. newCtx, cancel := m.stopper.WithCancelOnQuiesce(logtags.WithTags(context.Background(), logtags.FromContext(ctx)))
WithCancelOnQuiesce
is a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?
pkg/util/fsm/fsm.go, line 145 at r3 (raw file):
} // GetExpandedStateTransitions returns the underlying expanded pattern map.
Doesn't this belong as a method on Transitions
?
ae651a0
to
2a6c897
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Woah, this is super gross. If
eventFinished
has a weight of 0, do we even need this?
Ah I didn't think of that. Done.
pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):
If the original table is at version 3 and hasn't had the column dropped
At version 3, the original table will have had its last column dropped, but fprint
would still have non-null values in that column. Those values for fprint
would get nullified once the backfill is complete. Hence the mismatch.
2a6c897
to
2297b9a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit:
eventOpenTxn or eventCommit
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
+1
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Yeah, that would work. It still might be too heavyweight to justify though.
I added this, I personally think this is pretty clean.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
nit: do you mind grouping all of the transaction-related events together?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 245 at r4 (raw file):
} log.Infof(context.TODO(), "curstate: %#v", state)
remove this debug line?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 269 at r4 (raw file):
// If there are no available rows, openTxn or commit outstanding txn instead // of reading. if ns.availableRows < 1 {
I don't think that you should do this as it's pretty heavy but we could avoid all of this special casing if we made the state machine aware of whether or not there were available rows. Unfortunately to do that we'd need to track even more things like probably turn openTransaction into explicit delete and upsert events and only delete if you have rows and well ... you get the picture.
pkg/sql/lease.go, line 989 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
WithCancelOnQuiesce
is a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?
The old context was the clients context. That was the bug. Dan and I didn't have any suggestions on what a good timeout would be but if these calls hang (due to say the cluster being unavailable) we'll hang on shutdown. How do you see with cancel on quiesce as being problematic / a footgun here?
f025137
to
dc8c39c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
You mean
TxnOpen = true
? I think we would leave the transaction open.
Done.
pkg/util/fsm/fsm.go, line 145 at r3 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
Doesn't this belong as a method on
Transitions
?
I wanted to avoid reCompile
ing the state transitions map. I restructured this bit so its a little bit cleaner. Let me know if this still seems lacking.
fc03b87
to
ec538d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
unnecessary now right?
also, why can't we add/remove columns while a txn is open? seems like a nice case to have tested if we can
I believe a schema change will block forever if a transaction is open.
3964be4
to
2f69548
Compare
This is RFAL! |
2f69548
to
9cb3a13
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo the one substantive comment about fingerprint(row.updated.Prev())
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):
Previously, aayushshah15 (Aayush Shah) wrote…
I believe a schema change will block forever if a transaction is open.
gotcha
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):
eventMix: map[fsm.Event]int{ // We don't want `eventFinished` to ever be returned by `nextEvent` so we set // it's weight to 0.
nit: its
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):
} var txnOpenBeforeInitialScan bool
nit: slightly more obvious to put txnOpenBeforeInitialScan := false
here
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
}, stateRunning{ FeedPaused: fsm.False,
any reason this one and the next can't work while feed is paused?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):
Action: logEvent(abort), }, eventRollback{}: {
group this with eventCommit
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):
} func push(a fsm.Args) error {
move this down right above/below abort
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):
return err } ns.availableRows += 2 * rows
comment plz, something out each row once for kv backfill once for changefeed backfill
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):
return err } ns.availableRows += 2 * rows
ditto comment here plz
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):
} func commit(a fsm.Args) error {
group this and rollback up under openTxn
pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):
// exist before calling this constructor. func NewFingerprintValidator( sqlDB *gosql.DB, origTable, fprintTable string, partitions []string, testColumns int,
testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc
pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):
} // If we have processed all the row updates belonging to a certain timestamp,
"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."
pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):
if len(v.buffer) == 0 || v.buffer[0].updated != row.updated { lastFingerprintedAt = row.updated if err := v.fingerprint(row.updated); err != nil {
I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?
8f05236
to
3a50e29
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: its
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: slightly more obvious to put
txnOpenBeforeInitialScan := false
here
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
any reason this one and the next can't work while feed is paused?
I'd disabled it because it made each individual run of the nemeses last substantially longer (up to like 2mins in some cases, as opposed to ~5s before), and there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
I suspect the reason it takes this takes much longer is because it now stays in FeedPaused=true
much longer (because eventAddColumn
and eventRemoveColumn
now contend with eventResume
in that state). We could tune the event weights to fix this but I'm not sure this is even a problem as of now. What do you think?
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
group this with eventCommit
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
move this down right above/below abort
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
comment plz, something out each row once for kv backfill once for changefeed backfill
Done
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
ditto comment here plz
Done.
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
group this and rollback up under openTxn
Done.
pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc
Changed the name and added some explanation around what that parameter is. Let me know if it seems lacking in any way.
pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."
Done
pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?
Done
This is ready for a final look. Please let me know if the way I've handled the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO
there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should
pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):
// If we've processed all row updates belonging to the previous row's timestamp, // we fingerprint at `updated.Prev()` since want to catch cases where one or more
nit: since we want
1576f7a
to
2ce3894
Compare
Currently, the changefeed nemeses system does not support schema changes. This has previously allowed some fairly critical bugs to creep past our testing. This PR adds support for schema change events to the changefeed nemeses. Release note: None.
2ce3894
to
c82d570
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO
there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?
all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should
It seems to be taking less time after increasing the weight for eventResume
, which makes sense.
pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):
Previously, danhhz (Daniel Harrison) wrote…
nit: since we want
Done.
pkg/sql/lease.go, line 989 at r2 (raw file):
Previously, ajwerner wrote…
In that case we should either use context.Background or hang a context off of the lease manager so there's only one that it tied to the stopper.
Going to send out a new PR to address this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to see this going!
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)
TFTRs! bors r+ |
41842: changefeedccl: add schema change events to nemeses r=aayushshah15 a=aayushshah15 Currently, the changefeed nemeses system does not support schema changes. This has previously allowed some fairly critical bugs to creep past our testing. This PR adds support for schema change events to the changefeed nemeses. Release note: None. Co-authored-by: Aayush Shah <[email protected]>
Build succeeded |
\o/ |
Currently, the changefeed nemeses system does not support schema
changes. This has previously allowed some fairly critical bugs to creep past our testing.
This PR adds support for schema change events to the changefeed nemeses.
Release note: None.