Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/version-upgrade is flaky #87104

Closed
adityamaru opened this issue Aug 30, 2022 · 17 comments · Fixed by #87154
Closed

roachtest: acceptance/version-upgrade is flaky #87104

adityamaru opened this issue Aug 30, 2022 · 17 comments · Fixed by #87154
Assignees
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Aug 30, 2022

In the past few days the acceptance/version-upgrade roachtest has been failing in various ways, some error modes are:

pq: failed to run backup: exporting 112 ranges: unable to dial n2: breaker open

dial tcp 127.0.0.1:26259: connect: connection refused

pq: version mismatch in flow request: 65; this node accepts 69 through 69

The last one is the most common failure mode at the moment where the test fails at this step -

which is when node 1 is running the current binary version, while the other nodes are on the predecessor binary versions.

Build examples:
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278345&tab=buildResultsDiv&buildTypeId=Cockroach_Ci_Tests_LocalRoachtest
https://teamcity.cockroachdb.com/viewLog.html?buildId=6278433&buildTypeId=Cockroach_BazelEssentialCi

Jira issue: CRDB-19153

@adityamaru adityamaru added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Aug 30, 2022
@adityamaru adityamaru added A-testing Testing tools and infrastructure T-testeng TestEng Team labels Aug 30, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 30, 2022

cc @cockroachdb/test-eng

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 30, 2022
Skipping the flaky roachtest while we stabilize it.

Informs: cockroachdb#87104

Release note: None

Release justification: testing only change
@tbg tbg added the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 30, 2022
@blathers-crl

This comment was marked as resolved.

@tbg tbg added the branch-master Failures and bugs on the master branch. label Aug 30, 2022
@tbg
Copy link
Member

tbg commented Aug 30, 2022

Marking as release-blocker to reflect the gravity of this flake - afaict it's likely a problem that would be encountered by customers' workloads while upgrading to 22.2.

I suggest someone from SQL queries to own this. @yuzefovich can you think of someone appropriate and facilitate the assignment? Thank you!

@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Aug 30, 2022
@renatolabs
Copy link
Contributor

FWIW, I remember seeing the second error message when I was trying to reduce the flakiness of this test about a month ago (#84382), so I don't think it's new. However, it was a fairly rare occurrence, and maybe it's become more frequent since then.

craig bot pushed a commit that referenced this issue Aug 30, 2022
86563: ts: fix the pretty-printing of tsd keys r=abarganier a=knz

Found while working on #86524.

Release justification: bug fix

Release note (bug fix): When printing keys and range start/end
boundaries for time series, the displayed structure of keys
was incorrect. This is now fixed.

86904: sql: allow mismatch type numbers in `PREPARE` statement r=rafiss a=ZhouXing19

Previously, we only allow having the same number of parameters and placeholders
in a `PREPARE` statement. This is not compatible with Postgres14's behavior.

This commit is to loosen the restriction and enable this compatibility.
We now take `max(#placeholders, #parameters)` as the true length
 of parameters of the prepare statement. For each parameter, we first
look at the type deduced from the query stmt. If we can't deduce it, 
we take the type hint for this param.

I.e. we now allow queries such as 

```
PREPARE args_test_many(int, int) as select $1
// 2 parameters, but only 1 placeholder in the query.

PREPARE args_test_few(int) as select $1, $2::int
// 1 parameter, but 2 placeholders in the query.
```

fixes #86375

Release justification: Low risk, high benefit changes to existing functionality
Release note: allow mismatch type numbers in `PREPARE` statement

87105: roachtest: skip flaky acceptance/version-upgrade r=tbg a=adityamaru

Skipping the flaky roachtest while we stabilize it.

Informs: #87104

Release note: None

Release justification: testing only change

87117: bazci: fix output path computation r=rail a=rickystewart

These updates were happening in-place so `bazci` was constructing big,
silly paths like `backupccl_test/shard_6_of_16/shard_7_of_16/shard_13_of_16/...`
We just need to copy the variable here.

Release justification: Non-production code changes
Release note: None

Co-authored-by: Raphael 'kena' Poss <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: adityamaru <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
@msirek
Copy link
Contributor

msirek commented Aug 30, 2022

Action item here may be to do a bisect.

@yuzefovich
Copy link
Member

I believe I identified the root cause in #87154 (it's a test issue, not an actual bug), so removing the release blocker label.

@yuzefovich yuzefovich removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Aug 30, 2022
@craig craig bot closed this as completed in da00c3a Aug 31, 2022
@yuzefovich yuzefovich reopened this Sep 1, 2022
@yuzefovich
Copy link
Member

I saw it flake on one of my PRs with a different error at a different time in the test. I'll need to take another look.

@yuzefovich
Copy link
Member

So here is what's happening in this flake:

  • n4 has just been re-upgraded from 22.1.6 to current
  • n4 is the gateway for a SELECT query of Object Access feature
  • n4 thinks n1 is the leaseholder for the relevant range, so n4 issues SetupFlow RPC to n1
  • n1 receives that request and needs to perform FlowStream RPC to stream data back to n4.
  • n1 is able to get a connection (because we ignore the breaker), but then n1 fails to perform FlowStream RPC because of the breaker
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109  ‹[core]›‹[Channel #22 SubChannel #23] grpc: addrConn.createTransport failed to connect to {›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "Addr": "127.0.0.1:26263",›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "ServerName": "127.0.0.1:26263",›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "Attributes": null,›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "BalancerAttributes": null,›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "Type": 0,›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹  "Metadata": null›
W220901 00:36:29.692490 2795 google.golang.org/grpc/grpclog/component.go:41 ⋮ [-] 109 +‹}. Err: connection error: desc = "transport: Error while dialing cannot reuse client connection"›
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110  Outbox FlowStream connection error, distributed query will fail: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"›
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110 +(1) ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"›

To me this looks like a dup of #44101.

I'm not sure what to do here though. I don't have much context of how we would go about fixing #44101. Couple of options for fixing this flake in particular:

Curious what others think, especially @tbg on the feasibility of addressing #44101 for good.

@tbg
Copy link
Member

tbg commented Sep 5, 2022

because of the breaker

I don't see the breaker error in the output you pasted. Rather, this is the onlyOnceDialer:

cockroach/pkg/rpc/context.go

Lines 1548 to 1558 in 34de5fb

// onlyOnceDialer implements the grpc.WithDialer interface but only
// allows a single connection attempt. If a reconnection is attempted,
// redialChan is closed to signal a higher-level retry loop. This
// ensures that our initial heartbeat (and its version/clusterID
// validation) occurs on every new connection.
type onlyOnceDialer struct {
syncutil.Mutex
dialed bool
closed bool
redialChan chan struct{}
}

meaning that a previous attempt to dial failed, and legitimately failed (i.e. wasn't stopped by the breaker, as this wouldn't waste the onlyOnceDialer). Could you pull up a bit more of the log to see if you can find the true reason n1 couldn't talk to n4?

@tbg
Copy link
Member

tbg commented Sep 5, 2022

re: "real" fix, see #44101 (comment)

@yuzefovich
Copy link
Member

I copied the logs from here. Do you mean getting a more verbose logging output that what is printed by default?

@tbg
Copy link
Member

tbg commented Sep 7, 2022

The logging just doesn't corroborate the scenario you've outlined, you say

n1 is able to get a connection (because we ignore the breaker), but then n1 fails to perform FlowStream RPC because of the breaker

That last part doesn't seem true - it looks more like the connection it pulled from the node dialer here

// GetConnForOutbox is a shared function between the rowexec and colexec
// outboxes. It attempts to dial the destination ignoring the breaker, up to the
// given timeout and returns the connection or an error.
// This connection attempt is retried since failure results in a query error. In
// the past, we have seen cases where a gateway node, n1, would send a flow
// request to n2, but n2 would be unable to connect back to n1 due to this
// connection attempt failing.
// Retrying here alleviates these flakes and causes no impact to the end
// user, since the receiver at the other end will hang for
// SettingFlowStreamTimeout waiting for a successful connection attempt.
func GetConnForOutbox(
ctx context.Context, dialer Dialer, sqlInstanceID base.SQLInstanceID, timeout time.Duration,
) (conn *grpc.ClientConn, err error) {
firstConnectionAttempt := timeutil.Now()
for r := retry.StartWithCtx(ctx, base.DefaultRetryOptions()); r.Next(); {
conn, err = dialer.DialNoBreaker(ctx, roachpb.NodeID(sqlInstanceID), rpc.DefaultClass)
if err == nil || timeutil.Since(firstConnectionAttempt) > timeout {
break
}
}
return
}

is somehow unhealthy? Is it possible that the DistSQL request somehow straddles the restart and that n4 legit was down (or hadn't fully restarted yet) when that query was run? The reason I suspect this is because there's lots of code that you're hitting that tries to establish this connection as healthy,

conn, err := n.rpcContext.GRPCDialNode(addr.String(), nodeID, class).Connect(ctx)
if err != nil {
// If we were canceled during the dial, don't trip the breaker.
if ctxErr := ctx.Err(); ctxErr != nil {
return nil, ctxErr
}
err = errors.Wrapf(err, "failed to connect to n%d at %v", nodeID, addr)
if breaker != nil {
breaker.Fail(err)
}
return nil, err
}
// Check to see if the connection is in the transient failure state. This can
// happen if the connection already existed, but a recent heartbeat has
// failed and we haven't yet torn down the connection.
if err := grpcutil.ConnectionReady(conn); err != nil {
err = errors.Wrapf(err, "failed to check for ready connection to n%d at %v", nodeID, addr)
if breaker != nil {
breaker.Fail(err)
}
return nil, err
}
// TODO(bdarnell): Reconcile the different health checks and circuit breaker
// behavior in this file. Note that this different behavior causes problems
// for higher-levels in the system. For example, DistSQL checks for
// ConnHealth when scheduling processors, but can then see attempts to send
// RPCs fail when dial fails due to an open breaker. Reset the breaker here
// as a stop-gap before the reconciliation occurs.
if breaker != nil {
breaker.Success()
}
return conn, nil

@yuzefovich
Copy link
Member

Is it possible that the DistSQL request somehow straddles the restart

That doesn't seem possible because the query is issued only after n4 is restarted.

n4 hadn't fully restarted yet

That seems plausible.


Here are the things that I'm confident in:

  • n4 has just been upgraded
  • n4 is the gateway for "Object Access" query and performs SetupFlow RPC against n1, which succeeds
  • n1 serves SetupFlow RPC and creates an outbox
  • that outbox is able to get a connection via GetConnForOutbox, but then FlowStream RPC against n4 fails with
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110  Outbox FlowStream connection error, distributed query will fail: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"›
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110 +(1) ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"›
W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110 +Error types: (1) *status.Error
  • n4 keeps on waiting for n1 to dial back in for 10 seconds, after which it times out the query with
E220901 00:36:39.693231 2632 sql/flowinfra/flow_registry.go:336 ⋮ [n4,client=127.0.0.1:37748,user=root,f‹ab65bb47›] 148  flow id:‹ab65bb47-63ac-4f0f-85e0-0a0a088503c0› : 1 inbound streams timed out after 10s; propagated error throughout flow

Let's take a closer look at the logs of n4 after the restart.

I220901 00:36:29.645727 132 1@server/server_sql.go:1415 â‹® [n4] 52  serving sql connections
...
I220901 00:36:29.648057 1012 upgrade/upgrademanager/manager.go:115 ⋮ [n4,intExec=‹set-version›,migration-mgr] 54  migrating cluster from 22.1 to 1000022.1-68 (stepping through [1000022.1-2 1000022.1-4 1000022.1-6 1000022.1-8 1000022.1-10 1000022.1-12 1000022.1-14 1000022.1-16 1000022.1-18 1000022.1-20 1000022.1-22 1000022.1-24 1000022.1-26 1000022.1-28 1000022.1-30 1000022.1-32 1000022.1-34 1000022.1-36 1000022.1-38 1000022.1-40 1000022.1-42 1000022.1-44 1000022.1-46 1000022.1-48 1000022.1-50 1000022.1-52 1000022.1-54 1000022.1-56 1000022.1-58 1000022.1-60 1000022.1-62 1000022.1-64 1000022.1-66 1000022.1-68])
I220901 00:36:29.649797 1012 upgrade/upgradecluster/cluster.go:118 ⋮ [n4,intExec=‹set-version›,migration-mgr] 55  executing validate-cluster-version=1000022.1-68 on nodes n{1,2,3,4}
I220901 00:36:29.695749 1012 upgrade/upgrademanager/manager.go:135 ⋮ [n4,intExec=‹set-version›,migration-mgr] 56  stepping through 1000022.1-2
I220901 00:36:29.725805 1188 jobs/adopt.go:243 ⋮ [n4,intExec=‹set-version›,migration-mgr] 57  job 792730631195164676: resuming execution
I220901 00:36:29.741825 1190 jobs/registry.go:1206 â‹® [n4] 58  MIGRATION job 792730631195164676: stepping through state running with error: <nil>
I220901 00:36:29.763583 1190 jobs/registry.go:1206 â‹® [n4] 59  MIGRATION job 792730631195164676: stepping through state succeeded with error: <nil>
...
I220901 00:36:29.990979 1474 jobs/registry.go:1206 â‹® [n4] 99  MIGRATION job 792730632062795780: stepping through state running with error: <nil>

If we can rely on the clocks of n1 and n4 being in sync, then we can see that n4 is still running through the upgrade migrations at the time when n1 tries to perform FlowStream RPC which seems to corroborate the theory that n4 wasn't fully "up" yet. Do you know whether running migrations on n4 would somehow prevent other nodes to dial into it? Are we starting to serve sql connections too early (i.e. should we wait for migrations to complete)?

@tbg
Copy link
Member

tbg commented Sep 8, 2022

cockroach start (as deployed by roachprod) will return when it hits the sdnotify line at the end of this method:

cockroach/pkg/cli/start.go

Lines 489 to 548 in 1af6635

serverCfg.ReadyFn = func(waitForInit bool) {
// Inform the user if the network settings are suspicious. We need
// to do that after starting to listen because we need to know
// which advertise address NewServer() has decided.
hintServerCmdFlags(ctx, cmd)
// If another process was waiting on the PID (e.g. using a FIFO),
// this is when we can tell them the node has started listening.
if startCtx.pidFile != "" {
log.Ops.Infof(ctx, "PID file: %s", startCtx.pidFile)
if err := os.WriteFile(startCtx.pidFile, []byte(fmt.Sprintf("%d\n", os.Getpid())), 0644); err != nil {
log.Ops.Errorf(ctx, "failed writing the PID: %v", err)
}
}
// If the invoker has requested an URL update, do it now that
// the server is ready to accept SQL connections.
// (Note: as stated above, ReadyFn is called after the server
// has started listening on its socket, but possibly before
// the cluster has been initialized and can start processing requests.
// This is OK for SQL clients, as the connection will be accepted
// by the network listener and will just wait/suspend until
// the cluster initializes, at which point it will be picked up
// and let the client go through, transparently.)
if startCtx.listeningURLFile != "" {
log.Ops.Infof(ctx, "listening URL file: %s", startCtx.listeningURLFile)
// (Re-)compute the client connection URL. We cannot do this
// earlier (e.g. above, in the runStart function) because
// at this time the address and port have not been resolved yet.
clientConnOptions, serverParams := makeServerOptionsForURL(&serverCfg)
pgURL, err := clientsecopts.MakeURLForServer(clientConnOptions, serverParams, url.User(username.RootUser))
if err != nil {
log.Errorf(ctx, "failed computing the URL: %v", err)
return
}
if err = os.WriteFile(startCtx.listeningURLFile, []byte(fmt.Sprintf("%s\n", pgURL.ToPQ())), 0644); err != nil {
log.Ops.Errorf(ctx, "failed writing the URL: %v", err)
}
}
if waitForInit {
log.Ops.Shout(ctx, severity.INFO,
"initial startup completed.\n"+
"Node will now attempt to join a running cluster, or wait for `cockroach init`.\n"+
"Client connections will be accepted after this completes successfully.\n"+
"Check the log file(s) for progress. ")
}
// Ensure the configuration logging is written to disk in case a
// process is waiting for the sdnotify readiness to read important
// information from there.
log.Flush()
// Signal readiness. This unblocks the process when running with
// --background or under systemd.
if err := sdnotify.Ready(); err != nil {
log.Ops.Errorf(ctx, "failed to signal readiness using systemd protocol: %s", err)
}
}

Since n4 is restarted, the relevant line is this:

onSuccessfulReturnFn = func() { readyFn(false /* waitForInit */) }

which is invoked at the top of the diff here:

cockroach/pkg/server/server.go

Lines 1403 to 1495 in 2675c7c

onSuccessfulReturnFn()
// NB: This needs to come after `startListenRPCAndSQL`, which determines
// what the advertised addr is going to be if nothing is explicitly
// provided.
advAddrU := util.NewUnresolvedAddr("tcp", s.cfg.AdvertiseAddr)
// We're going to need to start gossip before we spin up Node below.
s.gossip.Start(advAddrU, filtered)
log.Event(ctx, "started gossip")
// Now that we have a monotonic HLC wrt previous incarnations of the process,
// init all the replicas. At this point *some* store has been initialized or
// we're joining an existing cluster for the first time.
advSQLAddrU := util.NewUnresolvedAddr("tcp", s.cfg.SQLAdvertiseAddr)
advHTTPAddrU := util.NewUnresolvedAddr("tcp", s.cfg.HTTPAdvertiseAddr)
if err := s.node.start(
ctx,
advAddrU,
advSQLAddrU,
advHTTPAddrU,
*state,
initialStart,
s.cfg.ClusterName,
s.cfg.NodeAttributes,
s.cfg.Locality,
s.cfg.LocalityAddresses,
); err != nil {
return err
}
log.Event(ctx, "started node")
if err := s.startPersistingHLCUpperBound(ctx, hlcUpperBoundExists); err != nil {
return err
}
s.replicationReporter.Start(ctx, s.stopper)
sentry.ConfigureScope(func(scope *sentry.Scope) {
scope.SetTags(map[string]string{
"cluster": s.StorageClusterID().String(),
"node": s.NodeID().String(),
"server_id": fmt.Sprintf("%s-%s", s.StorageClusterID().Short(), s.NodeID()),
"engine_type": s.cfg.StorageEngine.String(),
"encrypted_store": strconv.FormatBool(encryptedStore),
})
})
// We can now add the node registry.
s.recorder.AddNode(
s.registry,
s.node.Descriptor,
s.node.startedAt,
s.cfg.AdvertiseAddr,
s.cfg.HTTPAdvertiseAddr,
s.cfg.SQLAdvertiseAddr,
)
// Begin recording runtime statistics.
if err := startSampleEnvironment(s.AnnotateCtx(ctx),
s.ClusterSettings(),
s.stopper,
s.cfg.GoroutineDumpDirName,
s.cfg.HeapProfileDirName,
s.runtime,
s.status.sessionRegistry,
); err != nil {
return err
}
var graphiteOnce sync.Once
graphiteEndpoint.SetOnChange(&s.st.SV, func(context.Context) {
if graphiteEndpoint.Get(&s.st.SV) != "" {
graphiteOnce.Do(func() {
s.node.startGraphiteStatsExporter(s.st)
})
}
})
// Start the protected timestamp subsystem. Note that this needs to happen
// before the modeOperational switch below, as the protected timestamps
// subsystem will crash if accessed before being Started (and serving general
// traffic may access it).
//
// See https://github.com/cockroachdb/cockroach/issues/73897.
if err := s.protectedtsProvider.Start(ctx, s.stopper); err != nil {
return err
}
// After setting modeOperational, we can block until all stores are fully
// initialized.
s.grpc.setMode(modeOperational)

and note the bottom of the diff which sets grpc to "operational" (meaning it'll stop refusing incoming requests).

The listener is opened a few pages before this, so a dial to n4 should have succeeded (i.e. no conn refused or the like); before the operational line, RPCs would have been refused, but we are getting a different error which indicates that an attempt to dial failed:

W220901 00:36:29.692553 2826 sql/colflow/colrpc/outbox.go:194 ⋮ [n1,f‹ab65bb47›,streamID=‹0›] 110 Outbox FlowStream connection error, distributed query will fail: ‹rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing cannot reuse client connection"›

Unfortunately, the error we really want is the one "before" that; this error here only tells us that a previous dial failed. Why did it fail? With what? That is unclear.

@tbg
Copy link
Member

tbg commented Sep 8, 2022

@yuzefovich I made a separate issue for this problem: #87634

For now, let's introduce a 4s sleep after each node restart, that should reliably paper over it. Not great, but I don't think this is a new problem - I think we're seeing it now because we are now draining the nodes and so there is no range unavailability after downtime, which probably papered over it very reliably. Would you be able to send that PR, Yahor, and close this issue out if it passes a couple of runs?

@yuzefovich
Copy link
Member

Thanks Tobi! I'll send a patch.

craig bot pushed a commit that referenced this issue Sep 12, 2022
87645: ui: fix txn insight query bug, align summary card, remove contended keys in details page r=ericharmeling a=ericharmeling

This commit fixes a small bug on the transaction insight details page
that was incorrectly mapping the waiting transaction statement
fingerprints to the blocking transaction statements. The commit also
aligns the summary cards in the details page. The commit also removes
the contended key from the details page while we look for a more user-
friendly format to display row contention.

Before:

![image](https://user-images.githubusercontent.com/27286675/189216476-8211d598-5d4e-4255-846f-82c785764016.png)


After:

![image](https://user-images.githubusercontent.com/27286675/189216006-f01edeb6-ab2f-42ac-9978-6fce85b9a79a.png)

Fixes #87838.

Release note: None
Release justification: bug fix

87715: roachtest: add 4s of sleep after restart when upgrading nodes r=yuzefovich a=yuzefovich

We have seen cases where a transient error could occur when a newly-upgraded node serves as a gateway for a distributed query due to remote nodes not being able to dial back to the gateway for some reason (investigation of it is tracked in #87634). For now, we're papering over these flakes by 4 second sleep.

Addresses: #87104.

Release note: None

87840: roachtest: do not generate division ops in costfuzz and unoptimized tests r=mgartner a=mgartner

The division (`/`) and floor division (`//`) operators were making costfuzz and unoptimized-query-oracle tests flaky. This commit disables generation of these operators as a temporary mitigation for these flakes.

Informs #86790

Release note: None

87854: kvcoord: reliably handle stuck watcher error r=erikgrinaker a=tbg

Front-ports parts of #87253.

When a rangefeed gets stuck, and the server is local, the server might
notice the cancellation before the client, and may send a cancellation
error back in a rangefeed event.

We now handle this the same as the other case (where the stream client
errors out due to the cancellation).

This also checks in the test from
#87253 (which is on
release-22.1).

Fixes #87370.

No release note since this will be backported to release-22.2
Release note: None


Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Tobias Grieger <[email protected]>
@tbg
Copy link
Member

tbg commented Sep 19, 2022

Closing since it's been passing for close to a week now. If it fails again, better to open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-testing Testing tools and infrastructure branch-master Failures and bugs on the master branch. C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-testeng TestEng Team
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants