-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ccl/logictestccl: TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup timed out #60773
Comments
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@83e70ce84b740e27e721c3b73c38a4b8b515094a:
MoreParameters:
See this test on roachdash |
may be a storage/kv/logging/server shutdown issue.
|
@otan As a complete noob at debugging these types of issues, what about the above snippet indicates kv/storage? |
my guess is
relates to logs not cleanly exiting, but also a lot of kv / storage related goroutines further down. but hard to read, and it's saturday ;) |
Thanks for the help on a Saturday. Will ping #engineering on Monday if I can't figure this out over the weekend. |
For what it's worth, I spent a bit of time trying to look into this, and it seems like the test has gotten stuck here multiple times when I stressed it:
In the logs I see the job (or at least what I think is that job) start, and then just stop emitting logs. No idea what is getting stuck specifically. |
I should clarify that this was on #60804 (specifically 0c5bed9, https://teamcity.cockroachdb.com/viewLog.html?buildId=2695724&tab=buildResultsDiv&buildTypeId=Cockroach_UnitTests_Test), which does make changes to jobs (but I wouldn't expect this RESTORE statement to be affected, since there aren't any tables/types undergoing schema changes involved). So it's possible I was seeing something different, but it looks like the same thing. |
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@8b6f3c84cc256debeeb4d4055c5f0d5c9a481213:
MoreParameters:
See this test on roachdash |
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@8b6f3c84cc256debeeb4d4055c5f0d5c9a481213:
MoreParameters:
See this test on roachdash |
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@64c4aef909f4382523cd9248341ca9f4448d841a:
MoreParameters:
See this test on roachdash |
brief curiousity glance seems stuck at executing a statement:
|
logs have a lot of
|
it seems it is stuck because the restore is never finishing. i don't think it's multi-region code that's causing the regression - my theory is it's multi-node related, as the test starts up a cluster of 9 nodes and i'm not sure other restore ccl logic tests do the same, so may be something new that has been unearthed. it seems to be stuck at a kvserver rate limit? you can see the plan node waiting
seems to be a restore:
and
spawned by a split and scatter:
seems like the split and scatter is blocked waiting for kv: something is rate limited?!
|
Refs: cockroachdb#60773 Release note: None
60855: logictestccl: split rename column into separate test r=arulajmani a=otan This hasn't flaked on a master build for a while, but people have reported it locally. Going to try split the rename test into a separate logic test with it's own cluster and see if we see similar symptoms. Release note: None 60902: logictestccl: skip multi_region_backup r=ajstorm a=otan Refs: #60773 Release note: None Co-authored-by: Oliver Tan <[email protected]>
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@682582fd65a512d90f187a3f7c8e368a43bd89e9:
MoreParameters:
See this test on roachdash |
I think #60642 might be to blame. |
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@2572200f7612c6508a52735a6a18767cfb7cc09d:
MoreParameters:
See this test on roachdash |
Can't seem to repro this anymore, and I'm hoping it means that the underlying hang has been resolved. Trying to reenable 🤞.
|
did you ensure multi_region_backup have sufficiently beefy machines? |
Was this only hit on beefy machines? I thought Yahor reproed locally.
Maybe I was mistaken.
…On Thu, Mar 18, 2021 at 5:22 PM Oliver Tan ***@***.***> wrote:
did you ensure multi_region_backup have sufficiently beefy machines?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#60773 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEMXVOW3GOTUKE3EBPLMD2TTEJVIFANCNFSM4X3VLXTQ>
.
|
Re-enabling this test exposed the fact that restore is currently broken for multi-region databases. Will be addressed with #62215. |
Needs #60835 to merge before we can re-enable this test case. |
62954: sql: Re-enable multi_region_backup test r=arulajmani a=ajstorm With #60835 merged, this test no longer flakes. I've stressed it on my GCE worker now for a while an it's all good. Resolves #60773. Release note: None 62959: sql: lease acquisition of OFFLINE descs may starve bulk operations r=ajwerner a=fqazi Fixes: #61798 Previously, offline descriptors would never have their leases cached and they would be released once the reference count hit zero. This was inadequate because when attempting to online these tables again the lease acquisition could be pushed back by other operations, leading to starvation / live locks. To address this, this patch will allow the leases of offline descriptors to be cached. Release note (bug fix): Lease acquisitions of descriptor in a offline state may starve out bulk operations (backup / restore) Co-authored-by: Adam Storm <[email protected]> Co-authored-by: Faizan Qazi <[email protected]>
With cockroachdb#60835 merged, this test no longer flakes. I've stressed it on my GCE worker now for a while an it's all good. Resolves cockroachdb#60773. Release note: None
(ccl/logictestccl).TestCCLLogic/multiregion-9node-3region-3azs/multi_region_backup failed on master@3e4d80ddb4b913d8af5d66a364b3870e1f7a49fa:
More
Parameters:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: