-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: acceptance/version-upgrade is flaky #87104
Comments
cc @cockroachdb/test-eng |
Skipping the flaky roachtest while we stabilize it. Informs: cockroachdb#87104 Release note: None Release justification: testing only change
This comment was marked as resolved.
This comment was marked as resolved.
Marking as release-blocker to reflect the gravity of this flake - afaict it's likely a problem that would be encountered by customers' workloads while upgrading to 22.2. I suggest someone from SQL queries to own this. @yuzefovich can you think of someone appropriate and facilitate the assignment? Thank you! |
FWIW, I remember seeing the second error message when I was trying to reduce the flakiness of this test about a month ago (#84382), so I don't think it's new. However, it was a fairly rare occurrence, and maybe it's become more frequent since then. |
86563: ts: fix the pretty-printing of tsd keys r=abarganier a=knz Found while working on #86524. Release justification: bug fix Release note (bug fix): When printing keys and range start/end boundaries for time series, the displayed structure of keys was incorrect. This is now fixed. 86904: sql: allow mismatch type numbers in `PREPARE` statement r=rafiss a=ZhouXing19 Previously, we only allow having the same number of parameters and placeholders in a `PREPARE` statement. This is not compatible with Postgres14's behavior. This commit is to loosen the restriction and enable this compatibility. We now take `max(#placeholders, #parameters)` as the true length of parameters of the prepare statement. For each parameter, we first look at the type deduced from the query stmt. If we can't deduce it, we take the type hint for this param. I.e. we now allow queries such as ``` PREPARE args_test_many(int, int) as select $1 // 2 parameters, but only 1 placeholder in the query. PREPARE args_test_few(int) as select $1, $2::int // 1 parameter, but 2 placeholders in the query. ``` fixes #86375 Release justification: Low risk, high benefit changes to existing functionality Release note: allow mismatch type numbers in `PREPARE` statement 87105: roachtest: skip flaky acceptance/version-upgrade r=tbg a=adityamaru Skipping the flaky roachtest while we stabilize it. Informs: #87104 Release note: None Release justification: testing only change 87117: bazci: fix output path computation r=rail a=rickystewart These updates were happening in-place so `bazci` was constructing big, silly paths like `backupccl_test/shard_6_of_16/shard_7_of_16/shard_13_of_16/...` We just need to copy the variable here. Release justification: Non-production code changes Release note: None Co-authored-by: Raphael 'kena' Poss <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: adityamaru <[email protected]> Co-authored-by: Ricky Stewart <[email protected]>
Action item here may be to do a bisect. |
I believe I identified the root cause in #87154 (it's a test issue, not an actual bug), so removing the release blocker label. |
I saw it flake on one of my PRs with a different error at a different time in the test. I'll need to take another look. |
So here is what's happening in this flake:
To me this looks like a dup of #44101. I'm not sure what to do here though. I don't have much context of how we would go about fixing #44101. Couple of options for fixing this flake in particular:
Curious what others think, especially @tbg on the feasibility of addressing #44101 for good. |
I don't see the breaker error in the output you pasted. Rather, this is the Lines 1548 to 1558 in 34de5fb
meaning that a previous attempt to dial failed, and legitimately failed (i.e. wasn't stopped by the breaker, as this wouldn't waste the onlyOnceDialer). Could you pull up a bit more of the log to see if you can find the true reason n1 couldn't talk to n4? |
re: "real" fix, see #44101 (comment) |
I copied the logs from here. Do you mean getting a more verbose logging output that what is printed by default? |
The logging just doesn't corroborate the scenario you've outlined, you say
That last part doesn't seem true - it looks more like the connection it pulled from the node dialer here cockroach/pkg/sql/execinfra/outboxbase.go Lines 32 to 53 in 711e228
is somehow unhealthy? Is it possible that the DistSQL request somehow straddles the restart and that n4 legit was down (or hadn't fully restarted yet) when that query was run? The reason I suspect this is because there's lots of code that you're hitting that tries to establish this connection as healthy, cockroach/pkg/rpc/nodedialer/nodedialer.go Lines 195 to 227 in 50c1196
|
That doesn't seem possible because the query is issued only after n4 is restarted.
That seems plausible. Here are the things that I'm confident in:
Let's take a closer look at the logs of n4 after the restart.
If we can rely on the clocks of n1 and n4 being in sync, then we can see that n4 is still running through the upgrade migrations at the time when n1 tries to perform |
Lines 489 to 548 in 1af6635
Since n4 is restarted, the relevant line is this: cockroach/pkg/server/server.go Line 1275 in 2675c7c
which is invoked at the top of the diff here: cockroach/pkg/server/server.go Lines 1403 to 1495 in 2675c7c
and note the bottom of the diff which sets grpc to "operational" (meaning it'll stop refusing incoming requests). The listener is opened a few pages before this, so a dial to n4 should have succeeded (i.e. no conn refused or the like); before the
Unfortunately, the error we really want is the one "before" that; this error here only tells us that a previous dial failed. Why did it fail? With what? That is unclear. |
@yuzefovich I made a separate issue for this problem: #87634 For now, let's introduce a 4s sleep after each node restart, that should reliably paper over it. Not great, but I don't think this is a new problem - I think we're seeing it now because we are now draining the nodes and so there is no range unavailability after downtime, which probably papered over it very reliably. Would you be able to send that PR, Yahor, and close this issue out if it passes a couple of runs? |
Thanks Tobi! I'll send a patch. |
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling This commit fixes a small bug on the transaction insight details page that was incorrectly mapping the waiting transaction statement fingerprints to the blocking transaction statements. The commit also aligns the summary cards in the details page. The commit also removes the contended key from the details page while we look for a more user- friendly format to display row contention. Before: ![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png) After: ![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png) Fixes #87838. Release note: None Release justification: bug fix 87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep. Addresses: #87104. Release note: None 87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes. Informs #86790 Release note: None 87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg Front-ports parts of #87253. When a rangefeed gets stuck, and the server is local, the server might notice the cancellation before the client, and may send a cancellation error back in a rangefeed event. We now handle this the same as the other case (where the stream client errors out due to the cancellation). This also checks in the test from #87253 (which is on release-22.1). Fixes #87370. No release note since this will be backported to release-22.2 Release note: None Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
Closing since it's been passing for close to a week now. If it fails again, better to open a new issue. |
In the past few days the acceptance/version-upgrade roachtest has been failing in various ways, some error modes are:
The last one is the most common failure mode at the moment where the test fails at this step -
cockroach/pkg/cmd/roachtest/tests/versionupgrade.go
Line 149 in b87e290
Build examples:
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278345&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_Tests_LocalRoachtest
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278433&buildTypeId=Cockroach_BazelEssentialCi
Jira issue: CRDB-19153
The text was updated successfully, but these errors were encountered: