Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changefeedccl: add schema change events to nemeses #41842

Merged

Conversation

aayushshah15
Copy link
Contributor

Currently, the changefeed nemeses system does not support schema
changes. This has previously allowed some fairly critical bugs to creep past our testing.

This PR adds support for schema change events to the changefeed nemeses.

Release note: None.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from c98c944 to fbc668e Compare October 23, 2019 03:29
@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from fbc668e to 2f2497a Compare October 23, 2019 16:00
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high level first pass

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):

		eventMix: map[fsm.Event]int{
			// eventOpenTxn opens an UPSERT or DELETE transaction.
			eventOpenTxn{}: 10,

why split this up? (not objecting, i just don't see offhand why it was necessary)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):

			// eventFeedMessage reads a message from the feed, or if the state machine
			// thinks there will be no message available, it falls back to
			// eventTransact.

nit: eventTransact is gone


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):

			// eventResume RESUMEs the changefeed.
			eventResume{}: 1,

ditto here, why split this up?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):

			}
		} else {
			if err := abort(fsm.Args{Ctx: ctx, Extended: ns}); err != nil {

abort is not quite the opposite of commit, we should add a rollback event too


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

	if ns.currentTestColumnCount >= ns.maxTestColumnCount {
		// Do nothing if the table already has the maximum allowed columns.

bummer, i wonder if we can avoid generating this event by adding more state. IIRC there's some crazy pattern matching stuff in the fsm library that helps with the transition explosion, but I don't remember if it would be useful here. if that doesn't work out then i agree it's probably not worth the extra state


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):

	// Adding a column should trigger a full table scan.
	ns.db.QueryRow(`SELECT count(*) FROM foo`).Scan(&rows)
	ns.availableRows += rows

it should be twice this, right? the schema change itself does a kv backfill (which generates uninteresting rows at the changefeed level that we don't have a good way to filter) and then the changefeed runs a backfill after the schema change finishes


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 452 at r2 (raw file):

	}
	if ns.currentTestColumnCount == 0 {
		// Do nothing if the table has no test columns.

ditto unfortunate


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 471 at r2 (raw file):

	if ns.availableRows <= 0 {
		// No-op if there are no available rows to read.

if we're seeing this, it's because the availableRows accounting is wrong. we should probably look into that


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):

	ns := a.Extended.(*nemeses)
	defer func() { ns.txn = nil }()
	return ns.txn.Rollback()

hmm, i don't think we want to do this change? it was intentionally aborting a txn without its knowledge before


pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):

	// table and then apply the row update. We do it this way to ensure that
	// the set of non-null columns in the scratch table is equal to the set of
	// columns in the original table.

I don't quite follow, do you have an example?


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

		// Ignore the fingerprint mismatch if there was an in-progress schema change job
		// on the table.
		// TODO(aayush): This is pretty hacky, how do we improve this?

it'd be nice to address this TODO before merge

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, and @danhhz)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 291 at r2 (raw file):

func (eventFinished) Event()     {}

var compatibleStateTransitions = fsm.Pattern{

nit: just call this stateTransitions?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 300 at r2 (raw file):

	stateRunning{Paused: fsm.False, TxnOpen: fsm.False}: {
		eventPause{}: {
			Next:   stateRunning{Paused: fsm.True, TxnOpen: fsm.False},

This comment is more for my own edification than asking for a change. I feel like we lack principle about which transitions we're making explicitly possible in the FSM vs checking inside the implementation. For example, we're keeping track of row counts and column counts inside the extended state and then changing behavior inside of actions based on that extended state.

Moving more things into the state machine itself is heavy-weight due to the transition explosion but not moving it into the state machine makes the logic more implicit and harder to understand.

I wonder if it would be good to have some sort of mechanism to create predicates over the extended state and use them as pre-conditions for transitions.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 356 at r2 (raw file):

}

var txnStateTransitions = fsm.Compile(compatibleStateTransitions)

nit: I know this is the old name but I don't love it. What about compiledStateTransitions?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):

	if deleteID == noDeleteSentinel {
		if err := txn.QueryRow(
			`UPSERT INTO foo VALUES ((random() * $1)::int, cluster_logical_timestamp()::string) RETURNING id, ts`,

This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be cluster_logical_timestamp()? Could it just be now()?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 514 at r2 (raw file):

		// Don't error out if we got pushed, but don't increment availableRows no
		// matter what error was hit.
		if strings.Contains(err.Error(), `restart transaction`) {

Just out of curiosity I wonder if there's a more principled way to check if this is a serializable restart. Maybe by unwrapping the error into something which could hand you an error code? I have no idea. This comment is mostly just my own curiosity.


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

it'd be nice to address this TODO before merge

It's honestly surprising to me (in a good way) that this works.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):

Previously, ajwerner wrote…

This is interesting to me. Doesn't it mean that all updates are unpushable? Is there a reason that the timestamp needs to be cluster_logical_timestamp()? Could it just be now()?

it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, ajwerner wrote…

It's honestly surprising to me (in a good way) that this works.

yeah, ditto. the workaround is incredibly clever

Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

why split this up? (not objecting, i just don't see offhand why it was necessary)

It seemed easier to reason about if we split transact into an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

ditto here, why split this up?

This is just so I can avoid some code duplication. In nextEvent I'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from this eventMix map. Adding eventResume here allows me to do the same with pause/resume without separate logic.


pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I don't quite follow, do you have an example?

Imagine that we have 4 columns in foo and 4 (non-null) columns in fprint. Now there is a removeColumn event that drops one of the columns in foo. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simply UPSERT them into fprint because that won't nullify the 4th column. So, we instead do a transactional DELETE, followed by an UPSERT to apply every row update.


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

yeah, ditto. the workaround is incredibly clever

We could go one step further here and make sure that the table in question is indeed foo, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.

I explored ways to perhaps look at foo's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.

Are there any other ideas I can explore?

@aayushshah15
Copy link
Contributor Author


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

it should be twice this, right? the schema change itself does a kv backfill (which generates uninteresting rows at the changefeed level that we don't have a good way to filter) and then the changefeed runs a backfill after the schema change finishes

I thought those "uninteresting" rows were being filtered out in cloudFeed.Next() (assuming they're not being emitted at a new timestamp), is this incorrect?

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

It seemed easier to reason about if we split transact into an event that simply opens a transaction and then events that committed or rolled-back the transaction. I'm not super tied to this idea but I'd argue its a bit cleaner this way.

wfm. I had this originally and for some reason switched it to this, but I can't remember why. As long as this doesn't cause any other compromised I prefer it to


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 74 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

This is just so I can avoid some code duplication. In nextEvent I'm trying to use the state transitions map to get all the "compatible" state transitions and then randomly selecting one among them by using weights from this eventMix map. Adding eventResume here allows me to do the same with pause/resume without separate logic.

👍


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I thought those "uninteresting" rows were being filtered out in cloudFeed.Next() (assuming they're not being emitted at a new timestamp), is this incorrect?

Oh! Yes you're right. That's really a bug in testFeed, so sounds like a good reminder/TODO to add to this comment.


pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Imagine that we have 4 columns in foo and 4 (non-null) columns in fprint. Now there is a removeColumn event that drops one of the columns in foo. Now, the row updates that are streamed because of this will only contain 3 columns, but we can't simply UPSERT them into fprint because that won't nullify the 4th column. So, we instead do a transactional DELETE, followed by an UPSERT to apply every row update.

we can't UPSERT (id, ts, col1, col2, col3) VALUES (...)?


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

We could go one step further here and make sure that the table in question is indeed foo, but that would be a bit redundant since this is the only table in this test cluster that undergoes schema changes.

I explored ways to perhaps look at foo's table descriptor somehow and check if it has pending mutations, but that seems even worse. The challenging part is that any approach we pick here must disambiguate false negatives and "real" failures.

Are there any other ideas I can explore?

I'm missing some background. Why do the fingerprints not match?

Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

we can't UPSERT (id, ts, col1, col2, col3) VALUES (...)?

So if fprint contains col1, col2, col3, col4 and we UPSERT (id, ts, col1, col2, col3) VALUES (...) this only performs an update, and doesn't nullify col4. This would cause the validation to fail, even though it shouldn't.


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I'm missing some background. Why do the fingerprints not match?

Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the fprint table still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.

The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.

@aayushshah15
Copy link
Contributor Author


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

Oh! Yes you're right. That's really a bug in testFeed, so sounds like a good reminder/TODO to add to this comment.

Could you explain why it's a bug?

@aayushshah15
Copy link
Contributor Author


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

hmm, i don't think we want to do this change? it was intentionally aborting a txn without its knowledge before

I see. Would we say that the resultant state has TxnOpen = false still?

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from 2f2497a to d56d02d Compare October 23, 2019 22:09
@aayushshah15
Copy link
Contributor Author


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

bummer, i wonder if we can avoid generating this event by adding more state. IIRC there's some crazy pattern matching stuff in the fsm library that helps with the transition explosion, but I don't remember if it would be useful here. if that doesn't work out then i agree it's probably not worth the extra state

So there is a fsm.Var("some_variable_name") that we can use for something like this. See eventPause and eventResume. The problem is that the fsm library only supports bools.

Let's say we add a boolean attribute to stateRunning to keep track of whether test columns exist, call it TestColumnsExist. The eventRemoveColumn, wouldn't always flip the TestColumnsExist to false. So the result state becomes non-deterministic.

@ajwerner any ideas?

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from d56d02d to ae651a0 Compare October 23, 2019 22:17
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

So there is a fsm.Var("some_variable_name") that we can use for something like this. See eventPause and eventResume. The problem is that the fsm library only supports bools.

Let's say we add a boolean attribute to stateRunning to keep track of whether test columns exist, call it TestColumnsExist. The eventRemoveColumn, wouldn't always flip the TestColumnsExist to false. So the result state becomes non-deterministic.

@ajwerner any ideas?

can we use fsm.Any somehow?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 441 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Could you explain why it's a bug?

the testFeeds are intended to remove (key,ts) duplicates, but I was lazy and removed (key, value) duplicates. see the numerous // TODO(dan): Track duplicates more precisely in sinklessFeed/tableFeed. comments in TestChangefeedSchemaChangeAllowBackfill


pkg/ccl/changefeedccl/cdctest/validator.go, line 216 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

So if fprint contains col1, col2, col3, col4 and we UPSERT (id, ts, col1, col2, col3) VALUES (...) this only performs an update, and doesn't nullify col4. This would cause the validation to fail, even though it shouldn't.

oh because fprint is a fixed number of columns, right. Then how about filling out all of col4...col10 with NULLs in the UPSERT?


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

Essentially, because of the fact that schema changes with backfill increment the table descriptor version 3 times (the last one happening after the backfill is "complete"). Lets say we're at table descriptor version 1, a schema change (with backfill) begins and it increments it to version 2, then to version 3 (but not to version 4, which it would do once the backfill is complete). If there is a resolved timestamp when the original table is at version 3, it would not contain the dropped (or added) column, but since the backfill has not finished yet, the fprint table still contains that about-to-be-dropped (or doesn't contain the newly added) column. So when we validate at this aforementioned resolved timestamp, we get a mismatch.

The "hack" here ignores the validation failure if it happened at a time where there was a pending schema change on the original table.

I'm worried about the hack because we lose coverage in a really tricky spot (during schema change backfill) so it'd be nice to address it if we can.

I'm still trying to wrap my head around why these don't match. If the original table is at version 3 and hasn't had the column dropped, shouldn't the fprint table also still have the column, and thus a matching fingerprint.

Side node: I'm incredibly surprised that the fingerprint code is actually resilient to the two tables having a different number of columns (as long as the one with extra columns has them filled with NULLs?). I did not intend that when writing it

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

can we use fsm.Any somehow?

We can make this work by providing variables in the event. For example we can include in eventAddColumn and eventRemoveColumn whether they will change the state.

For example:

type stateRunning struct {
    ...
    CanRemoveColumn fsm.Bool
    CanAddColumn fsm.Bool
}

type eventAddColumn struct {
    CanAddColumnAfter fsm.Bool
}

type eventRemoveColumn struct {
    CanRemoveColumnAfter fsm.Bool
}

Then in the transitions:

stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: {
    eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: { 
        Next:   stateRunning{
            Paused: fsm.Var("paused"), 
            TxnOpen: fsm.False, 
            CanRemoveColumn: fsm.True,
            CanAddColumn: fsm.Var("can_add_after"),
        },
        Action: logEvent(addColumn),
    } 
}

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, ajwerner wrote…

We can make this work by providing variables in the event. For example we can include in eventAddColumn and eventRemoveColumn whether they will change the state.

For example:

type stateRunning struct {
    ...
    CanRemoveColumn fsm.Bool
    CanAddColumn fsm.Bool
}

type eventAddColumn struct {
    CanAddColumnAfter fsm.Bool
}

type eventRemoveColumn struct {
    CanRemoveColumnAfter fsm.Bool
}

Then in the transitions:

stateRunning{Paused: fsm.Var("paused"), TxnOpen: fsm.False, CanRemoveColumn: fsm.Var("can_remove"), CanAddColumn: fsm.True}: {
    eventAddColumn{CanAddColumnAfter: fsm.Var("can_add_after")}: { 
        Next:   stateRunning{
            Paused: fsm.Var("paused"), 
            TxnOpen: fsm.False, 
            CanRemoveColumn: fsm.True,
            CanAddColumn: fsm.Var("can_add_after"),
        },
        Action: logEvent(addColumn),
    } 
}

This works because we can know when we create the event whether we'll reach the threshold

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r1, 1 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: eventTransact is gone

nit: eventOpenTxn or eventCommit


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

abort is not quite the opposite of commit, we should add a rollback event too

+1


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 403 at r2 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

it appears i neglected to write this down. @tbg @nvanbenschoten do you remember why we did this? I feel like an obviously unnecessary use of cluster_logical_timestamp() would have come up in code review

I don't have any memory of this, but it does seem likely that this is to prevent the transaction from being auto-retried when being pushed.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, ajwerner wrote…

This works because we can know when we create the event whether we'll reach the threshold

Yeah, that would work. It still might be too heavyweight to justify though.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I see. Would we say that the resultant state has TxnOpen = false still?

You mean TxnOpen = true? I think we would leave the transaction open.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):

			eventResume{}: 1,

			// eventCommit commits the outstanding transaction.

nit: do you mind grouping all of the transaction-related events together?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 142 at r3 (raw file):

		return nil, err
	}
	// Now push everything to make sure the initial scan can complete, otherwise

Why were we able to get rid of this?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 143 at r3 (raw file):

	for i := 0; i < ns.maxTestColumnCount; i++ {
		testCol := fmt.Sprintf(", test%d STRING DEFAULT NULL", i)
		createFprintStmtBuf.WriteString(testCol)

fmt.Fprintf


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):

	}

	// Filter out `eventFinished` and restore at the end of the method.

Woah, this is super gross. If eventFinished has a weight of 0, do we even need this?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 261 at r3 (raw file):

	}

	// If there are no available rows, openTxn or commit outstanding txn instead of

nit: move this into the if _, ok := event.(eventFeedMessage); ok { block and panic("unreachable") below the for loop.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 264 at r3 (raw file):

	// reading.
	if ns.availableRows < 1 {
		if ns.txn == nil {

Can't we get this from the state? Seems cleaner.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 273 at r3 (raw file):

type stateRunning struct {
	Paused  fsm.Bool

FeedPaused?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 321 at r3 (raw file):

			Action: logEvent(noteFeedMessage),
		},
		eventSplit{}: {

I'm surprised we don't want to allow some of these while a transaction is ongoing.


pkg/ccl/changefeedccl/cdctest/validator.go, line 133 at r3 (raw file):

	failures   []string
	numColumns int

nit: we seem to keep failures []string separate, so much this up to the first group of fields.


pkg/sql/lease.go, line 989 at r2 (raw file):

		// because of its use of `singleflight.Group`. See issue #41780 for how this has
		// happened.
		newCtx, cancel := m.stopper.WithCancelOnQuiesce(logtags.WithTags(context.Background(), logtags.FromContext(ctx)))

WithCancelOnQuiesce is a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?


pkg/util/fsm/fsm.go, line 145 at r3 (raw file):

}

// GetExpandedStateTransitions returns the underlying expanded pattern map.

Doesn't this belong as a method on Transitions?

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from ae651a0 to 2a6c897 Compare October 24, 2019 19:35
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 231 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Woah, this is super gross. If eventFinished has a weight of 0, do we even need this?

Ah I didn't think of that. Done.


pkg/ccl/changefeedccl/cdctest/validator.go, line 350 at r2 (raw file):

If the original table is at version 3 and hasn't had the column dropped

At version 3, the original table will have had its last column dropped, but fprint would still have non-null values in that column. Those values for fprint would get nullified once the backfill is complete. Hence the mismatch.

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from 2a6c897 to 2297b9a Compare October 24, 2019 19:40
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 67 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: eventOpenTxn or eventCommit

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 123 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

+1

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 429 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Yeah, that would work. It still might be too heavyweight to justify though.

I added this, I personally think this is pretty clean.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 76 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: do you mind grouping all of the transaction-related events together?

Done.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 245 at r4 (raw file):

	}

	log.Infof(context.TODO(), "curstate: %#v", state)

remove this debug line?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 269 at r4 (raw file):

			// If there are no available rows, openTxn or commit outstanding txn instead
			// of reading.
			if ns.availableRows < 1 {

I don't think that you should do this as it's pretty heavy but we could avoid all of this special casing if we made the state machine aware of whether or not there were available rows. Unfortunately to do that we'd need to track even more things like probably turn openTransaction into explicit delete and upsert events and only delete if you have rows and well ... you get the picture.


pkg/sql/lease.go, line 989 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

WithCancelOnQuiesce is a pretty big footgun and I'm not sure that we even need it. Was the old context being canceled on quiesce?

The old context was the clients context. That was the bug. Dan and I didn't have any suggestions on what a good timeout would be but if these calls hang (due to say the cluster being unavailable) we'll hang on shutdown. How do you see with cancel on quiesce as being problematic / a footgun here?

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch 3 times, most recently from f025137 to dc8c39c Compare October 24, 2019 20:14
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 529 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You mean TxnOpen = true? I think we would leave the transaction open.

Done.


pkg/util/fsm/fsm.go, line 145 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Doesn't this belong as a method on Transitions?

I wanted to avoid reCompileing the state transitions map. I restructured this bit so its a little bit cleaner. Let me know if this still seems lacking.

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from fc03b87 to ec538d3 Compare October 29, 2019 15:57
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

unnecessary now right?

also, why can't we add/remove columns while a txn is open? seems like a nice case to have tested if we can

I believe a schema change will block forever if a transaction is open.

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch 4 times, most recently from 3964be4 to 2f69548 Compare October 31, 2019 18:13
@aayushshah15
Copy link
Contributor Author

This is RFAL!

@aayushshah15 aayushshah15 self-assigned this Oct 31, 2019
@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from 2f69548 to 9cb3a13 Compare October 31, 2019 18:57
Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm_strong: modulo the one substantive comment about fingerprint(row.updated.Prev())

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 479 at r5 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I believe a schema change will block forever if a transaction is open.

gotcha


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):

		eventMix: map[fsm.Event]int{
			// We don't want `eventFinished` to ever be returned by `nextEvent` so we set
			// it's weight to 0.

nit: its


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):

	}

	var txnOpenBeforeInitialScan bool

nit: slightly more obvious to put txnOpenBeforeInitialScan := false here


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):

	},
	stateRunning{
		FeedPaused:      fsm.False,

any reason this one and the next can't work while feed is paused?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):

			Action: logEvent(abort),
		},
		eventRollback{}: {

group this with eventCommit


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):

}

func push(a fsm.Args) error {

move this down right above/below abort


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):

		return err
	}
	ns.availableRows += 2 * rows

comment plz, something out each row once for kv backfill once for changefeed backfill


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):

		return err
	}
	ns.availableRows += 2 * rows

ditto comment here plz


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):

}

func commit(a fsm.Args) error {

group this and rollback up under openTxn


pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):

// exist before calling this constructor.
func NewFingerprintValidator(
	sqlDB *gosql.DB, origTable, fprintTable string, partitions []string, testColumns int,

testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc


pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):

		}

		// If we have processed all the row updates belonging to a certain timestamp,

"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."


pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):

		if len(v.buffer) == 0 || v.buffer[0].updated != row.updated {
			lastFingerprintedAt = row.updated
			if err := v.fingerprint(row.updated); err != nil {

I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch 2 times, most recently from 8f05236 to 3a50e29 Compare November 1, 2019 19:16
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 63 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: its

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 182 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: slightly more obvious to put txnOpenBeforeInitialScan := false here

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

any reason this one and the next can't work while feed is paused?

I'd disabled it because it made each individual run of the nemeses last substantially longer (up to like 2mins in some cases, as opposed to ~5s before), and there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?

I suspect the reason it takes this takes much longer is because it now stays in FeedPaused=true much longer (because eventAddColumn and eventRemoveColumn now contend with eventResume in that state). We could tune the event weights to fix this but I'm not sure this is even a problem as of now. What do you think?


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 438 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

group this with eventCommit

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 505 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

move this down right above/below abort

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 573 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

comment plz, something out each row once for kv backfill once for changefeed backfill

Done


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 594 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

ditto comment here plz

Done.


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 639 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

group this and rollback up under openTxn

Done.


pkg/ccl/changefeedccl/cdctest/validator.go, line 142 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

testColumns is pretty vague. maybe something about variable table width? even if we don't end up with a better name, this deserves a mention in the godoc

Changed the name and added some explanation around what that parameter is. Let me know if it seems lacking in any way.


pkg/ccl/changefeedccl/cdctest/validator.go, line 356 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

"If any updates have exactly equal timestamps, we have to apply them all before fingerprinting."

Done


pkg/ccl/changefeedccl/cdctest/validator.go, line 360 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

I think you lost the fingerprint check for row.updated.Prev(). Without that, if we missed a row update we wouldn't catch it, right? Say k1 was written at t1, t2, t3. If we only output t1 and t3 would anything catch that right now?

Done

@aayushshah15
Copy link
Contributor Author

aayushshah15 commented Nov 1, 2019

This is ready for a final look. Please let me know if the way I've handled the row.updated.Prev() fingerprinting makes sense.

Copy link
Contributor

@danhhz danhhz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):
hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO

there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?

all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should


pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):

		// If we've processed all row updates belonging to the previous row's timestamp,
		// we fingerprint at `updated.Prev()` since want to catch cases where one or more

nit: since we want

@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch 3 times, most recently from 1576f7a to 2ce3894 Compare November 1, 2019 20:05
Currently, the changefeed nemeses system does not support schema
changes. This has previously allowed some fairly critical bugs to creep
past our testing.

This PR adds support for schema change events to the changefeed nemeses.

Release note: None.
@aayushshah15 aayushshah15 force-pushed the changefeed_schema_change_nemeses branch from 2ce3894 to c82d570 Compare November 1, 2019 20:28
Copy link
Contributor Author

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)


pkg/ccl/changefeedccl/cdctest/nemeses.go, line 353 at r6 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

hmm yeah, i really do like it taking ~5s. 2m is far too long. what happens if you just give eventResume a really high weight? something like 100x what add/remove have? also totally fine just leaving this as a TODO

there didn't seem to be any reason why the feed being paused-unpaused would change anything wrt schema changes on the original table, maybe I'm wrong here?

all the bugs you've already found around schema changes were things we thought worked. my takeaway is that anything we can reasonably include in nemesis without diluting it, we should

It seems to be taking less time after increasing the weight for eventResume, which makes sense.


pkg/ccl/changefeedccl/cdctest/validator.go, line 362 at r7 (raw file):

Previously, danhhz (Daniel Harrison) wrote…

nit: since we want

Done.


pkg/sql/lease.go, line 989 at r2 (raw file):

Previously, ajwerner wrote…

In that case we should either use context.Background or hang a context off of the lease manager so there's only one that it tied to the stopper.

Going to send out a new PR to address this.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Happy to see this going!

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @aayushshah15, @ajwerner, @danhhz, @nvanbenschoten, and @tbg)

@aayushshah15
Copy link
Contributor Author

TFTRs!

bors r+

craig bot pushed a commit that referenced this pull request Nov 1, 2019
41842: changefeedccl: add schema change events to nemeses r=aayushshah15 a=aayushshah15

Currently, the changefeed nemeses system does not support schema
changes. This has previously allowed some fairly critical bugs to creep past our testing. 

This PR adds support for schema change events to the changefeed nemeses.

Release note: None.

Co-authored-by: Aayush Shah <[email protected]>
@craig
Copy link
Contributor

craig bot commented Nov 1, 2019

Build succeeded

@craig craig bot merged commit c82d570 into cockroachdb:master Nov 1, 2019
@danhhz
Copy link
Contributor

danhhz commented Nov 4, 2019

\o/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants