Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: acceptance/gossip/restart failed #48423

Closed
cockroach-teamcity opened this issue May 5, 2020 · 8 comments
Closed

roachtest: acceptance/gossip/restart failed #48423

cockroach-teamcity opened this issue May 5, 2020 · 8 comments
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 5, 2020

(roachtest).acceptance/gossip/restart failed on master@425eaa8fb05fc32b2c42827b85338daa52f4177c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/gossip/restart/run_1
	gossip.go:225,gossip.go:290,acceptance.go:91,test_runner.go:753: pq: result is ambiguous (error=unable to dial n3: breaker open [exhausted])

More

Artifacts: /acceptance/gossip/restart
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

Jira issue: CRDB-4327

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. labels May 5, 2020
@cockroach-teamcity cockroach-teamcity added this to the 20.1 milestone May 5, 2020
@tbg tbg assigned asubiotto and unassigned andreimatei May 12, 2020
@tbg
Copy link
Member

tbg commented May 12, 2020

I think the statement that failed is

if _, err := db.Exec(`UPSERT INTO test.kv (k, v) VALUES (1, 0)`); err != nil {
t.Fatal(err)
}

I assume we're getting an ambiguous result all the way to the client because the upsert decomposes into a read-write cycle... somehow? But this should never go back to the client in this case, no?

cc @asubiotto

@asubiotto
Copy link
Contributor

I think the optimizer plans a check separately from the actual upsert operation. I'm not very familiar with this code, but why do you think this shouldn't go back to the client if the transport is exhausted? Based on the ClientVisibleAmbiguousError interface, looks like we do expect to return these to the client sometimes.

@andreimatei
Copy link
Contributor

@tbg what were you thinking here?

@tbg
Copy link
Member

tbg commented May 19, 2020

Oh - looks like I wasn't thinking at all. Sorry about the noise. I guess my thinking was that this particular UPSERT is idempotent, and so it should never have to return an ambiguous result. But at the same time I know that we're not smart enough to do that (and it may not even be desirable, not sure).

Unclear what exactly to do with that failure. Do we really need to return an ambiguous result when a breaker is open? We didn't even attempt to send.

@andreimatei andreimatei assigned andreimatei and unassigned asubiotto May 19, 2020
@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/gossip/restart failed on master@436a82a518b142c3a3212f49bb595956b57ac68c:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/gossip/restart/run_1
	cluster.go:2138,cluster.go:2157,cluster.go:2261,gossip.go:215,gossip.go:290,acceptance.go:94,test_runner.go:753: pgurl for nodes [] empty: :1 from
		stdout:
		
		stderr:

More

Artifacts: /acceptance/gossip/restart
Related:

See this test on roachdash
powered by pkg/cmd/internal/issues

@knz
Copy link
Contributor

knz commented Jun 4, 2020

looked at this in this round of triaging - the latest failure is still with the same symptoms.

@tbg
Copy link
Member

tbg commented Jun 16, 2020

Looking at this again as part of triage. @asubiotto I think there is something for SQL to do here yet - I believe the error comes from

conn, err := dialer.Dial(ctx, nodeID, rpc.DefaultClass)
if err != nil {
log.Warningf(
ctx,
"Outbox Dial connection error, distributed query will fail: %+v",
err,
)
return err
}

I may be wrong- but it seems to come from SQL, it's this one here:

func (n *Dialer) dial(
ctx context.Context,
nodeID roachpb.NodeID,
addr net.Addr,
breaker *wrappedBreaker,
class rpc.ConnectionClass,
) (_ *grpc.ClientConn, err error) {
// Don't trip the breaker if we're already canceled.
if ctxErr := ctx.Err(); ctxErr != nil {
return nil, ctxErr
}
if breaker != nil && !breaker.Ready() {
err = errors.Wrapf(circuit.ErrBreakerOpen, "unable to dial n%d", nodeID)
return nil, err

This error should not become ambiguous since no connection attempt has been made (also, this is in the read portion of the query? Maybe I am wrong about the callsite here).

If we see this again, we should grab the logs before they're gone to make sure.

@cockroach-teamcity
Copy link
Member Author

(roachtest).acceptance/gossip/restart failed on master@e6c1a2abe0bb9008d904aad0f23eeff9ef217430:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: artifacts/acceptance/gossip/restart/run_1
	gossip.go:225,gossip.go:290,acceptance.go:96,test_runner.go:754: pq: result is ambiguous (error=unable to dial n4: breaker open [exhausted])

More

Artifacts: /acceptance/gossip/restart

See this test on roachdash
powered by pkg/cmd/internal/issues

@yuzefovich yuzefovich removed their assignment Aug 11, 2020
irfansharif added a commit to irfansharif/cockroach that referenced this issue Aug 12, 2020
- TestFollowerReadsWithStaleDescriptor (cockroachdb#56281)
- TestDistSQLRangeCachesIntegrationTest (cockroachdb#56282)
- acceptance/gossip/restart (cockroachdb#48423)

Release note: None
craig bot pushed a commit that referenced this issue Aug 12, 2020
52688: *: skip a couple of flakey tests r=irfansharif a=irfansharif

- TestFollowerReadsWithStaleDescriptor (#52681)
- TestDistSQLRangeCachesIntegrationTest (#52682)
- acceptance/gossip/restart (#48423)

Release note: None

---

+cc @andreimatei, @asubiotto who are the current assignees on those tests.

Co-authored-by: irfan sharif <[email protected]>
@dt dt removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label Sep 2, 2020
@jlinder jlinder added the T-sql-queries SQL Queries Team label Jun 16, 2021
@jordanlewis jordanlewis reopened this Jan 4, 2023
@blathers-crl blathers-crl bot added the T-kv KV Team label Jan 4, 2023
@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Jan 4, 2023
craig bot pushed a commit that referenced this issue Apr 10, 2023
100584: roachtest: fix `gossip/restart` tests r=erikgrinaker a=erikgrinaker

**roachtest: add `WaitForReady` helper**

This patch adds a `WaitForReady()` helper that will wait until the given nodes report ready via health checks.

Epic: none
Release note: None

**roachtest: fix `gossip/restart` tests**

This patch fixes `acceptance/gossip/restart` and `gossip/restart` by waiting for all nodes to report ready before restarting nodes, and unskips them.

Resolves #96091.
Touches #48423.

Epic: none
Release note: None



Co-authored-by: Erik Grinaker <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-kv KV Team
Projects
None yet
Development

No branches or pull requests

9 participants