Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

logictestccl: Fix regional_by_row_query_behavior flake #87391

Merged
merged 1 commit into from
Sep 18, 2022

Conversation

msirek
Copy link
Contributor

@msirek msirek commented Sep 5, 2022

Fixes #86284

This fixes a test flake in regional_by_row_query_behavior which is
caused by 2 issues:

  1. Zone configs not becoming available in SystemConfig between
    table creation time and query execution time.

  2. Metamorphic testing modify the coldata-batch-size setting,
    causing a different number of kv reads to be done, which affects
    the kv trace results in this test.

The flake can be reproduced via the command:

./dev testlogic ccl --files=regional_by_row_query_behavior \
                    --config=multiregion-9node-3region-3azs --stress

For point 1, the retry logic test directive is enhanced to purge the
zone config cache in the SystemConfig when a failing test is retried,
allowing the updated zone config to be read.

For point 2, the logic test blocking flag !metamorphic which resets
batch size related settings that were modified in metamorphic mode back
to their default production values, is updated to also cover the
coldata-batch-size setting. The flag is renamed to
!metamorphic-batch-sizes to better reflect its actual effect.

Release justification: low risk fix for test flake

Release note: None

@msirek msirek requested review from yuzefovich, michae2 and a team September 5, 2022 21:53
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

How did you determine the sleep time?

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @yuzefovich)


pkg/ccl/logictestccl/testdata/logic_test/regional_by_row_query_behavior line 2463 at r1 (raw file):

SET vectorize = "on"

statement ok nodeidx=0

why was this removed?

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw the following which appears to address similar issues:

# Set the closed timestamp interval to be short to shorten the amount of time
# we need to wait for the system config to propagate.
statement ok
SET CLUSTER SETTING kv.closed_timestamp.side_transport_interval = '10ms';
statement ok
SET CLUSTER SETTING kv.closed_timestamp.target_duration = '10ms';

It's using a 10ms time, so I tried running this as a stress test with 1 second sleep, which passed. I ran into a similar issue recently, which required adding a 5s sleep for stress tests to pass, but that test was lacking the above settings.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


pkg/ccl/logictestccl/testdata/logic_test/regional_by_row_query_behavior line 2463 at r1 (raw file):

Previously, rytaft (Rebecca Taft) wrote…

why was this removed?

In case the other node we're logging into has some additional latency in receiving the config. This part is not essential to the test. Node 0 might be the default anyway.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just tried running stress tests for an extended period, and did hit the error with a 1s sleep. Retesting with 2s...

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use retry option on a new query that checks that the zone configs are as expected? It might be worth pinging @irfansharif to see whether he has suggestions for the best way to wait until the expected state of system config is reached.

Reviewed all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. Upon further debugging it appears the problem actually stems from logic tests not honoring the !metamorphic test directive, which I'm currently investigating.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we instead use retry option on a new query that checks that the zone configs are as expected?

I believe the zone configs are cached, so if we do the lookup before they're available we'll cache a config with no lease preferences. We might need a test directive which retries and purges cached configs.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the fix and description to address both the unavailable zone configs problem and changing query plans in metamorphic mode.

RFAL

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2 and @rytaft)

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)


-- commits line 12 at r2:
My instinct is to extend the existing metamorphic directive of forcing the production batch sizes to also force the usage of the LOS if it is the only metamorphic variable that is causing the flakes. The rationale for this is that we don't lose the test coverage whereas if we skip the test under metamorphic build, we'll be skipping it 80% of the time (since each build is metamorphic with 0.8 probability in tests), so it's more likely that the flakes will be merged.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


-- commits line 12 at r2:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

My instinct is to extend the existing metamorphic directive of forcing the production batch sizes to also force the usage of the LOS if it is the only metamorphic variable that is causing the flakes. The rationale for this is that we don't lose the test coverage whereas if we skip the test under metamorphic build, we'll be skipping it 80% of the time (since each build is metamorphic with 0.8 probability in tests), so it's more likely that the flakes will be merged.

OK, it sounds like you're not in favor of any directive which would allow skipping of a logic test in metamorphic mode. To force the usage of LOS we'd have to reduce the costs of it in the optimizer, so maybe a new session flag could control that or maybe a new flag in eval.TestingKnobs. The current metamorphic directive is confusingly-named. Recent changes to the logic tests, in function processConfigs added a comment saying that flag checks "whether the test works only in non-metamorphic setting", and I too thought at first the directive avoids use of any metamorphic settings or skips the test in metamorphic mode. I changed the name to metamorphic-batch-sizes. Maybe I need to change the name again if we're going to make this set a "force LOS" setting. Or, maybe a new directive should be created.

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @michae2, @rytaft, and @yuzefovich)


-- commits line 12 at r2:

Previously, msirek (Mark Sirek) wrote…

OK, it sounds like you're not in favor of any directive which would allow skipping of a logic test in metamorphic mode. To force the usage of LOS we'd have to reduce the costs of it in the optimizer, so maybe a new session flag could control that or maybe a new flag in eval.TestingKnobs. The current metamorphic directive is confusingly-named. Recent changes to the logic tests, in function processConfigs added a comment saying that flag checks "whether the test works only in non-metamorphic setting", and I too thought at first the directive avoids use of any metamorphic settings or skips the test in metamorphic mode. I changed the name to metamorphic-batch-sizes. Maybe I need to change the name again if we're going to make this set a "force LOS" setting. Or, maybe a new directive should be created.

OK, I thought it was related to LOS, but it turns out it was really the coldata-batch-size setting. I've pushed an update.

@msirek msirek force-pushed the 86284 branch 2 times, most recently from 6c70033 to 59703b5 Compare September 8, 2022 14:41
@msirek msirek requested a review from a team September 8, 2022 14:41
Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for figuring this out! :lgtm:

Reviewed 6 of 11 files at r2, 9 of 9 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @michae2, @msirek, and @rytaft)


-- commits line 21 at r4:
It might be good to get a quick LGTM on this from @irfansharif.


pkg/config/system.go line 413 at r3 (raw file):

	s.mu.RLock()
	if len(s.mu.zoneCache) != 0 {
		s.mu.zoneCache = map[ObjectID]zoneEntry{}

nit: not sure how important this is, but we can optimize the allocations a bit by clearing the old map and reusing it with

for k := range m {
  delete(m, k)
}

In other parts of the codebase I've seen comments that this pattern translates to very efficient runtime.mapclear.

Now that I've typed it out, the optimization is probably not worth it here, so feel free to ignore the comment - I decided to keep this in just as an FYI.


pkg/config/system.go line 418 at r3 (raw file):

		s.mu.shouldSplitCache = map[ObjectID]bool{}
	}
	s.mu.RUnlock()

nit: we tend to unlock in a defer in cases like this.


pkg/sql/logictest/logic.go line 3929 at r3 (raw file):

	if serverArgsCopy.ForceProductionValues {
		if err := coldata.SetBatchSizeForTests(coldata.DefaultColdataBatchSize); err != nil {
			panic(errors.Wrapf(err, "Could not set batch size for test."))

nit: I think somewhere in our style guide we say to not use capital letters nor periods for the error messages (unless they contain multiple sentences).

@msirek msirek requested a review from irfansharif September 8, 2022 14:51
Fixes cockroachdb#86284

This fixes a test flake in regional_by_row_query_behavior which is
caused by 2 issues:

  1. Zone configs not becoming available in `SystemConfig` between
     table creation time and query execution time.

  2. Metamorphic testing modify the `coldata-batch-size` setting,
     causing a different number of kv reads to be done, which affects
     the kv trace results in this test.

The flake can be reproduced via the command:
```
./dev testlogic ccl --files=regional_by_row_query_behavior \
                    --config=multiregion-9node-3region-3azs --stress
```
For point 1, the `retry` logic test directive is enhanced to purge the
zone config cache in the SystemConfig when a failing test is retried,
allowing the updated zone config to be read.

For point 2, the logic test blocking flag `!metamorphic` which resets
batch size related settings that were modified in metamorphic mode back
to their default production values, is updated to also cover the
`coldata-batch-size` setting. The flag is renamed to
`!metamorphic-batch-sizes` to better reflect its actual effect.

Release justification: low risk fix for test flake

Release note: None
Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @michae2, @rytaft, and @yuzefovich)


-- commits line 21 at r4:

Previously, yuzefovich (Yahor Yuzefovich) wrote…

It might be good to get a quick LGTM on this from @irfansharif.

Thanks, added him as a reviewer.


pkg/config/system.go line 413 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: not sure how important this is, but we can optimize the allocations a bit by clearing the old map and reusing it with

for k := range m {
  delete(m, k)
}

In other parts of the codebase I've seen comments that this pattern translates to very efficient runtime.mapclear.

Now that I've typed it out, the optimization is probably not worth it here, so feel free to ignore the comment - I decided to keep this in just as an FYI.

OK, thanks. I'll leave as-is then. Thanks for the tip.


pkg/config/system.go line 418 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: we tend to unlock in a defer in cases like this.

Modified as suggested

Code quote:

PurgeZoneConfigCache() {

pkg/sql/logictest/logic.go line 3929 at r3 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

nit: I think somewhere in our style guide we say to not use capital letters nor periods for the error messages (unless they contain multiple sentences).

fixed

Code quote:

panic(errors.Wrapf(err, "Could not set batch size for test."))

Copy link
Contributor Author

@msirek msirek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR!
bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @michae2, @rytaft, and @yuzefovich)

@craig
Copy link
Contributor

craig bot commented Sep 18, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants