Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ccl/changeedccl: Add changefeed options into nemesis tests #137947

Merged
merged 1 commit into from
Jan 14, 2025

Conversation

aerfrei
Copy link
Contributor

@aerfrei aerfrei commented Dec 23, 2024

This work makes sure our nemesis tests for changefeeds randomize
over the options we use upon changefeed creation. This randomly adds
the key_in_value option (see below) and full_table_name option half
of the time and checks that the changefeed messages respect them in
the beforeAfter validator.

Note the following limitations: the full_table_name option, when on,
asserts that the topic in the output will be d.public.{table_name}
instead of checking for the actual name of the database/schema.

This change also does not add the key_in_value option when for the
webhook and cloudstorage sinks. Even before this change, since
key_in_value is on by default for those sinks, we remove the key
from the value in those testfeed messages for ease of testing.
Unfortunately, this makes these cases hard to test, so we leave them
out for now.

See also: #134119

Epic: CRDB-42866

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aerfrei aerfrei force-pushed the aerin/nemeses-ftn_and_kiv branch from 570a234 to 08d12f6 Compare December 23, 2024 22:42
@aerfrei aerfrei force-pushed the aerin/nemeses-ftn_and_kiv branch from 08d12f6 to abdb56f Compare January 7, 2025 15:51
@aerfrei aerfrei marked this pull request as ready for review January 7, 2025 16:26
@aerfrei aerfrei requested review from a team as code owners January 7, 2025 16:26
@aerfrei aerfrei requested review from dt and wenyihu6 and removed request for a team January 7, 2025 16:26
Copy link
Contributor Author

@aerfrei aerfrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some notes about some of the decisions I made here:

  • We choose to generate only valid changefeed options as opposed to randomly generating each option independently and retrying if they're not compatible with each other. This helps us a) make sure we have sufficient likelihood of getting a valid changefeed and b) we don't have to distinguish retryable errors from concerning errors

  • We changed the NoteRow signature since otherwise none of the NoteRow arguments contain the table name. Seems preferable to moving validation logic into noteFeedMessage. I don't see another viable option here.

  • Use testName in RunNemesis in order to determine the sink type. This was already being done to compute isSinkless, isCloudstorage in nemesis_test file (to disable `eventPause` in sinkless feeds and to randomly test parquet format for cloudstorage respectively) before this change. Now, that logic is moved into the RunNemesis function itself and to ensure we don’t set key_in_value for webhook and cloudstorage (as mentioned above). This seems messy but also seems to be the only way the sink type is available to us inside the test. The sink type will also be needed for other options that depend on the sink type: kafka_sink_config / pubsub_sink_config / webhook_sink_config, format (certain formats only compatible with certain sinks)

For this project more generally I'm keeping track of some known limitations here (these are mentioned in the initial comment).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @wenyihu6)

@aerfrei aerfrei force-pushed the aerin/nemeses-ftn_and_kiv branch 2 times, most recently from 3128c32 to c70abab Compare January 7, 2025 21:13
))
}
} else {
if topic != v.table {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

follow up: we can add more test coverage here for topic_name and when topic names are not table names

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does that happen? Are there certain settings or options that trigger that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

// the key in the value is extracted and removed from the test feed
// messages (see extractKeyFromJSONValue function).
// TODO: enable testing key_in_value for cloudstorage and webhook sinks
KeyInValue: !isCloudstorage && !isWebhook && rand.Intn(2) < 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we link the limitation github issue you created here in side todo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that now

@@ -33,10 +79,15 @@ var NemesesOptions = []NemesesOption{
EnableFpValidator: false,
EnableSQLSmith: true,
},
{
EnableFpValidator: false,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed here already? I thought we wanted this just for initial_scan = no / only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing this from the PR

@@ -95,7 +95,7 @@ func (sv *streamClientValidator) noteRow(
) error {
sv.mu.Lock()
defer sv.mu.Unlock()
return sv.NoteRow(partition, key, value, updated)
return sv.NoteRow(partition, key, value, updated, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q: why is their topic empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a better value to put here but this is what I was thinking. 1) This test is doing something pretty different with the validators. I don't think we're in the context of a changefeed on a normal table as far as I can tell. I would like to understand what's happening here better. 2) This validator is not a before/after validator so it will not be using the topic value anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it seems fine leaving as it is but perhaps we can add another item to our follow up issue to add a new validator for its sole purposes being validating topic names and other things and choose to opt out that validator here. wdyt?

Copy link
Contributor Author

@aerfrei aerfrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt and @wenyihu6)

// the key in the value is extracted and removed from the test feed
// messages (see extractKeyFromJSONValue function).
// TODO: enable testing key_in_value for cloudstorage and webhook sinks
KeyInValue: !isCloudstorage && !isWebhook && rand.Intn(2) < 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding that now

@@ -33,10 +79,15 @@ var NemesesOptions = []NemesesOption{
EnableFpValidator: false,
EnableSQLSmith: true,
},
{
EnableFpValidator: false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm removing this from the PR

))
}
} else {
if topic != v.table {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does that happen? Are there certain settings or options that trigger that?

@@ -95,7 +95,7 @@ func (sv *streamClientValidator) noteRow(
) error {
sv.mu.Lock()
defer sv.mu.Unlock()
return sv.NoteRow(partition, key, value, updated)
return sv.NoteRow(partition, key, value, updated, "")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a better value to put here but this is what I was thinking. 1) This test is doing something pretty different with the validators. I don't think we're in the context of a changefeed on a normal table as far as I can tell. I would like to understand what's happening here better. 2) This validator is not a before/after validator so it will not be using the topic value anyway.

@aerfrei aerfrei force-pushed the aerin/nemeses-ftn_and_kiv branch from c70abab to c9cafc0 Compare January 9, 2025 16:17
@aerfrei aerfrei requested a review from wenyihu6 January 13, 2025 15:23
))
}
} else {
if topic != v.table {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -95,7 +95,7 @@ func (sv *streamClientValidator) noteRow(
) error {
sv.mu.Lock()
defer sv.mu.Unlock()
return sv.NoteRow(partition, key, value, updated)
return sv.NoteRow(partition, key, value, updated, "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, it seems fine leaving as it is but perhaps we can add another item to our follow up issue to add a new validator for its sole purposes being validating topic names and other things and choose to opt out that validator here. wdyt?

Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some drive-by minor nits. Nice work!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aerfrei, @dt, and @wenyihu6)


pkg/ccl/changefeedccl/cdctest/nemeses.go line 120 at r2 (raw file):

	eventPauseCount := 10

	// TODO(dan): Ugly hack to disable `eventPause` in sinkless feeds.

nit: Does this need the TODO? If so, is there an issue we can reference here instead?


pkg/ccl/changefeedccl/cdctest/nemeses.go line 253 at r2 (raw file):

	cfo := newChangefeedOption(testName)
	log.Infof(ctx, "Using changefeed options: %s", cfo.String())

optional nit: put the entire CREATE CHANGEFEED statement in a var, log it here, then use it below.


pkg/crosscluster/physical/replication_random_client_test.go line 98 at r2 (raw file):

	sv.mu.Lock()
	defer sv.mu.Unlock()
	return sv.NoteRow(partition, key, value, updated, "")

nit: add the parameter name in a comment following empty parameters, i.e. "" /* topic */


pkg/ccl/changefeedccl/cdctest/validator.go line 226 at r2 (raw file):

) error {
	if v.fullTableName {
		// TODO: fetch the actual database and schema name for the full table name

For TODOs, please either add an issue # for future follow-up or, if expected to fast follow, add your name. We discourage plain TODOs since it's harder to find the context or follow up on them.

This work makes sure our nemesis tests for changefeeds randomize
over the options we use upon changefeed creation. This randomly adds
the key_in_value option (see below) and full_table_name option half
of the time and checks that the changefeed messages respect them in
the beforeAfter validator.

Note the following limitations: the full_table_name option, when on,
asserts that the topic in the output will be d.public.{table_name}
instead of checking for the actual name of the database/schema.

This change also does not add the key_in_value option when for the
webhook and cloudstorage sinks. Even before this change, since
key_in_value is on by default for those sinks, we remove the key
from the value in those testfeed messages for ease of testing.
Unfortunately, this makes these cases hard to test, so we leave them
out for now.

See also: cockroachdb#134119

Epic: CRDB-42866

Release note: None
Copy link
Contributor Author

@aerfrei aerfrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/cdctest/nemeses.go line 120 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

nit: Does this need the TODO? If so, is there an issue we can reference here instead?

I'm removing the TODO in favor of just having this issue. #139035. I didn't want to overload us with issues but it seems like that's a good way to keep track of some of these loose ends. Does that sound right?


pkg/ccl/changefeedccl/cdctest/validator.go line 226 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

For TODOs, please either add an issue # for future follow-up or, if expected to fast follow, add your name. We discourage plain TODOs since it's harder to find the context or follow up on them.

Removing this TODO in favor of keeping track of this with an issue as well. #139033


pkg/crosscluster/physical/replication_random_client_test.go line 98 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

nit: add the parameter name in a comment following empty parameters, i.e. "" /* topic */

Done, thanks!

Copy link
Contributor Author

@aerfrei aerfrei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dt, @rharding6373, and @wenyihu6)


pkg/ccl/changefeedccl/cdctest/nemeses.go line 253 at r2 (raw file):

Previously, rharding6373 (Rachael Harding) wrote…

optional nit: put the entire CREATE CHANGEFEED statement in a var, log it here, then use it below.

Done.

@aerfrei aerfrei force-pushed the aerin/nemeses-ftn_and_kiv branch from c9cafc0 to 7d9b214 Compare January 14, 2025 15:55
Copy link
Collaborator

@rharding6373 rharding6373 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aerfrei, @dt, and @wenyihu6)


pkg/ccl/changefeedccl/cdctest/nemeses.go line 120 at r2 (raw file):

Previously, aerfrei (Aerin Freilich) wrote…

I'm removing the TODO in favor of just having this issue. #139035. I didn't want to overload us with issues but it seems like that's a good way to keep track of some of these loose ends. Does that sound right?

You could add TODO(#139035): <short text here about what needs to change> so this piece of code is searchable using the issue. Same with the TODO you added below.

If we ever feel like we're overloading ourselves with issues, we can find a way to adjust this policy.

@aerfrei
Copy link
Contributor Author

aerfrei commented Jan 14, 2025

bors r=wenyihu6

@craig craig bot merged commit 9d7b5cb into cockroachdb:master Jan 14, 2025
22 checks passed
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jan 17, 2025
This patch skips the cdc/bank test, as it has been broken since cockroachdb#137947. This
patch disables the test while we work on the fix.

Release note: none
Epic: none
wenyihu6 added a commit to wenyihu6/cockroach that referenced this pull request Jan 17, 2025
This patch skips the cdc/bank test, as it has been broken since cockroachdb#137947. This
patch disables the test while we work on the fix.

Release note: none
Epic: none
craig bot pushed a commit that referenced this pull request Jan 17, 2025
137743: stop: grow stacks for async tasks r=tbg a=tbg

Surprisingly, this seems to have helped with alloc/op, but not with CPU time, at least in the microbenchmark.

```
name                                   old time/op    new time/op    delta
Sysbench/SQL/3node/oltp_read_write-24    11.6ms ± 1%    11.6ms ± 2%    ~     (p=0.201 n=14+15)

name                                   old alloc/op   new alloc/op   delta
Sysbench/SQL/3node/oltp_read_write-24    2.20MB ± 3%    2.16MB ± 1%  -1.88%  (p=0.047 n=15+12)

name                                   old allocs/op  new allocs/op  delta
Sysbench/SQL/3node/oltp_read_write-24     11.0k ± 2%     10.8k ± 0%    ~     (p=0.069 n=15+12)
```

End-to-end testing via the `sysbench-settings` roachtest (10 runs) shows a significant improvement in qps[^1]:

  | max | mean | median | % change max | % change mean | % change median
-- | -- | -- | -- | -- | -- | --
old | 23673.17 | 22775.40 | 22684.17 |   |   |  
new | 23955.21 | 23154.03 | 23256.25 | 1.19% | 1.66% | 2.52%

[^1]: https://docs.google.com/spreadsheets/d/1QUBZHllhk5CtDfcsJqc5eOIqUO4jVdhoYXMWj4CRQaQ/edit?gid=0#gid=0

Closes #130663.

Epic: CRDB-43584

139327: roachtest/cdc: skip cdc/bank r=rharding6373 a=wenyihu6

This patch skips the cdc/bank test, as it has been broken since #137947. This
patch disables the test while we work on the fix.

Informs: #139109
Release note: none
Epic: none

Co-authored-by: Tobias Grieger <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
craig bot pushed a commit that referenced this pull request Jan 17, 2025
138372: roachtest: task error groups r=DarrylWong a=herkolategan

Previously, the wait call on a task group would not return anything and any
error, that occurred in a task, would be reported to the test framework and fail
the test.

This change adds functionality to allow test authors more control and create an
error group if it is necessary to wait on errors, and not have the test
framework handle the errors directly.

A context can also now be specified to override the default context passed by
the test framework for the task manager.

Informs: #118214

Epic: None
Release note: None

139327: roachtest/cdc: skip cdc/bank r=rharding6373 a=wenyihu6

This patch skips the cdc/bank test, as it has been broken since #137947. This
patch disables the test while we work on the fix.

Informs: #139109
Release note: none
Epic: none

Co-authored-by: Herko Lategan <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
craig bot pushed a commit that referenced this pull request Jan 17, 2025
139026: licences: update THIRD-PARTY-NOTICES.txt r=celiala a=rail

Fixes: REL-1744
Release note: None

139192: roachtest: disable 4K block size intent resolution test r=pav-kv a=andrewbaptist

With a 4K block size, the intent resolution test will cause unacceptable slowdowns. This commit changes the perturbation/*/intents to only test up to 1024 block sizes.

This also reduces the max perturbation duration to 10 minutes.

Informs: #139187
Informs: #139188
Informs: #137590 
Informs: #135934
Fixes: #137590

Epic: none

Release note: None

139200: roachprod: Create/Destroy should avoid listing _all_ providers r=golgeek,RaduBerinde a=srosenberg

Previously, both `Create` and `Destroy` would attempt to list VMs across _all_ active cloud providers. Not only is it inefficient, but `Create` may also fail if
the user isn't re-authenticated to an unrelated
provider. E.g., `create --clouds gce` may fail
if AWS SSO token expired.

This change derives the set of required providers
from either the user-specified `--clouds` CLI option or the local cluster cache. The set is then used with `ListCloud` to skip listing unrelated providers.
The use of the local cache is sound at this time;
cluster's providers are immutable.

Epic: none

Release note: None

139246: ui: show job messages on the job detail page r=dt a=dt

Release note (ui change): Jobs can now choose to emit messages that are shown on the job detail page in 25.1+.

Epic: none.

<img width="1513" alt="Screenshot 2025-01-16 at 14 46 12" src="https://github.com/user-attachments/assets/1718ccdd-e8a1-4ec2-a032-a646f410f918" />


139327: roachtest/cdc: skip cdc/bank r=rharding6373 a=wenyihu6

This patch skips the cdc/bank test, as it has been broken since #137947. This
patch disables the test while we work on the fix.

Informs: #139109
Release note: none
Epic: none

Co-authored-by: Rail Aliiev <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: Stan Rosenberg <[email protected]>
Co-authored-by: David Taylor <[email protected]>
Co-authored-by: Wenyi Hu <[email protected]>
blathers-crl bot pushed a commit that referenced this pull request Jan 17, 2025
This patch skips the cdc/bank test, as it has been broken since #137947. This
patch disables the test while we work on the fix.

Release note: none
Epic: none
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants