-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: multitenant-upgrade failed #69920
Comments
In tenant 11's logs:
so it took like five minutes to migrate, much in excess of the 45s the test restricts it to. I think the test is just using a vanilla SucceedsSoon, so of course it wouldn't be appropriate now that we have actual long running migrations in that path. Nevertheless, it seems to mostly pass, so something was extraordinarily slow this time around. The vast majority of time (think all of it) is spent here:
1114 is the fix_descriptor_migration.go migration. Here's tenant 11's log grepped for 1114:
It sure seems to take long breaks. Will assign to @cockroachdb/sql-experience to investigate further, as they seem to own the migration best I can tell. I'm noticing is this recurring error message in tenant 11's logs:
@RaduBerinde is that something to worry about or does the message's severity need to be reduced? Could this contribute to the slowdown? |
The message suggests that the tenant_usage table doesn't exist on the host, maybe something's not set up right in the test? In any case, we only recently (1-2 days ago) merged tenant-side throttling so there shouldn't have been any related slowdown in the test. |
The only nonstandard thing the tenant does is to upgrade the host cluster. Tenant 11 is created when the predecessor is running. But I actually got something wrong in my initial analysis. I don't think tenant 11 matters here. The timeout is on the host cluster upgrade: cockroach/pkg/cmd/roachtest/tests/multitenant_upgrade.go Lines 321 to 325 in 642e44a
so I need to look at its logs. |
in the cockroach logs:
the only slow movement there is from 1112 to 1113(fence), and it's not that slow - 9s. However, it seems to take 32s from the "test status" log line to first signs of the migration on the cluster. It's also relying on the auto upgrade loop here though, which has a sensible MaxBackoff of 30s, so this isn't surprising either: cockroach/pkg/server/auto_upgrade.go Lines 32 to 37 in dbc8676
So I think we need to just bump how long the test is willing to wait. I still think the tenant 11 logs merit a look from sql-experience as this migration does seem very slow in this case. @RaduBerinde tenant 11 is upgraded before the host cluster, so the host cluster will be at 21.1 and the tenant will run |
The auto-update retry loop on the host cluster alone can eat 30s of SucceedsSoon budget in at least one place in the test, plus the migrations can be "long-running". We're not putting lots of data into this cluster so it can't really be "that" long-running, but 45s seems pretty tight either way. Give it a few minutes to complete instead. Touches cockroachdb#69920. Leaving the issue open since there was a question about why a descriptor migration on the tenant was taking a few minutes. Release note: None
The auto-update retry loop on the host cluster alone can eat 30s of SucceedsSoon budget in at least one place in the test, plus the migrations can be "long-running". We're not putting lots of data into this cluster so it can't really be "that" long-running, but 45s seems pretty tight either way. Give it a few minutes to complete instead. Touches cockroachdb#69920. Leaving the issue open since there was a question about why a descriptor migration on the tenant was taking a few minutes. Release note: None
I looked at the migration, I couldn't reproduce it taking 5 minutes on master (even after changing the timeout back to 45 seconds) The thing that stands out to me is that it looks like at time
|
The auto-update retry loop on the host cluster alone can eat 30s of SucceedsSoon budget in at least one place in the test, plus the migrations can be "long-running". We're not putting lots of data into this cluster so it can't really be "that" long-running, but 45s seems pretty tight either way. Give it a few minutes to complete instead. Touches #69920. Leaving the issue open since there was a question about why a descriptor migration on the tenant was taking a few minutes. Release note: None
roachtest.multitenant-upgrade failed with artifacts on master @ cc6296c24ddb048215dabe5cc41339f306db4f41:
Same failure on other branches
|
This just failed over on the 21.2 branch as well with the same error. Odd. |
roachtest.multitenant-upgrade failed with artifacts on master @ 7e4ba61845bb47cc2d7146d1bbd70fd53eff5457:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ fb67c7f24a05673bf0bb06f72a4c697679d1fff4:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ afb9f8dd5c4b66c39a15dc16e1b9ca07c09816bd:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ fffd9274a986fcbff880b9d503334f117ba17515:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ de5c5d344e6c27b29d0ac9e5503a762aad67fb52:
Same failure on other branches
|
We definitely broke something here: Test fails on the cockroach/pkg/cmd/roachtest/multitenant_upgrade.go Lines 246 to 250 in 6b22b74
Last good build: https://teamcity.cockroachdb.com/viewLog.html?buildId=3544338 @ aef9aa0 First bad build: https://teamcity.cockroachdb.com/viewLog.html?buildId=3550313 @ cc6296c That's only a few commits.
af5a5a5 is of particular interest, as it introduces the line the failure arises from. I'd bet some amount of money that the problem is how we stop the tenant server: cockroach/pkg/cmd/roachtest/multitenant_upgrade.go Lines 75 to 85 in 6b22b74
That hardly seems foolproof. I'll send a PR to use the |
roachtest.multitenant-upgrade failed with artifacts on master @ 5753803308f9b2e0c3f3b61e77a96782b99c15f3:
Same failure on other branches
|
How is this not a release blocker? |
Could you elaborate? Are you commenting on the original test failure or the recent flakiness which is likely due to an inefficiency in the test? |
I guess the recent flakiness. I'm phobic that we haven't been tracking skipped and flaky tests with nearly the zeal of earlier releases. I clicked through the email and commented out of panic given the whole upgrade debacle yesterday. Sorry for the noise given your active investigation. What is interesting about "first bad build" is that it has PRs which rely on the borked migration we fixed yesterday with #71492. Perhaps this test was a clue we had a problem. For what it's worth, that issue was egregious and would have popped out immediately if we actually ever looked at a cluster which we upgraded. I miss the days of |
I looked into this more just now and I don't blame the
manually, the tenant server appears to get past that point, and I can execute SQL through it. |
roachtest.multitenant-upgrade failed with artifacts on master @ 24d632ccabb7d03b887d399e705f7e1b6b9f7435:
Same failure on other branches
|
In cockroachdb#71040 we added disk spilling which in particular added the following call to the `mt start-sql` code path: https://github.com/cockroachdb/cockroach/blob/af5a5a5065ce80c5e6568b4b422bf5c3a179e173/pkg/cli/mt_start_sql.go#L90-L89 The tenant doesn't support the `--store` flag, and so this will always be the default of `cockroach-data`. This has the unfortunate effect of trying to clean up the temp dirs for that directory, even if `--temp-dir` is supplied: https://github.com/cockroachdb/cockroach/blob/6999e5fded43f59eb5839dc8b943fd1e2a33a3fd/pkg/cli/start.go#L223-L227 In the `multitenant-upgrade` roachtest, as it happens there was actually a cockroach host instance running under `cockroach-data`, and so the tenant would fail to try to remove its (locked) temp dirs. This commit fixes that issue by making start-sql use an in-memory store spec. This fixes the test flake, but I wonder if the temp storage feature for tenants is working properly. I worry about this because the concept of a "temp engine" always seems to require a store: https://github.com/cockroachdb/cockroach/blob/6999e5fded43f59eb5839dc8b943fd1e2a33a3fd/pkg/cli/start.go#L274-L280 and I am not sure how deep this goes. Frankly I don't understand why if you are providing a Path you also need to provide a StoreSpec. I will leave untangling this to @knz and @jaylim-crl, who know this code better than I do. Fixes cockroachdb#69920. Release note: None
roachtest.multitenant-upgrade failed with artifacts on master @ 0984f873c6170ab34afe6fee4661fc5f76ac0dee:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ b3af96b0686773c78325d2b8b0623a8fcd3e9bf2:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ 5d972a683c531326bc0af403c6d3373d3f4b2267:
Same failure on other branches
|
roachtest.multitenant-upgrade failed with artifacts on master @ 642e44afbe6098f25022618be76c6f7b6b97df45:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
The text was updated successfully, but these errors were encountered: