Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
119650: roachtest: make failure recovery independent r=nvanbenschoten a=andrewbaptist

Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this, recover each failure in a separate goroutine. This will allow the
"most important" failure to recover first so that the others can recover
if they depend on each other.

This is more important today while we don't recover from all the failure
modes that chaos implements. Specifically we don't handle partial
partitions fully with epoch leases.

Epic: none
Fixes: #119085
Fixes: #119347
Fixes: #119361
Fixes: #119454

Release note: None

122283: server: don't log for missing locality r=yuzefovich a=andrewbaptist

Previously we would always log a message that the locality was unknown for requests from sql gateways. We should remove unnecessary logs from traces.

Epic: none

Release note: None

Co-authored-by: Andrew Baptist <[email protected]>
  • Loading branch information
craig[bot] and andrewbaptist committed Apr 12, 2024
3 parents c1fcb54 + 31a39cf + aa89e09 commit cd9661a
Show file tree
Hide file tree
Showing 3 changed files with 26 additions and 5 deletions.
18 changes: 16 additions & 2 deletions pkg/cmd/roachtest/tests/failover.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import (
gosql "database/sql"
"fmt"
"math/rand"
"sync"
"time"

"github.com/cockroachdb/cockroach/pkg/base"
Expand Down Expand Up @@ -339,10 +340,23 @@ func runFailoverChaos(ctx context.Context, t test.Test, c cluster.Cluster, readO

sleepFor(ctx, t, time.Minute)

// Recover the failers on different goroutines. Otherwise, they
// might interact as certain failures can prevent other failures
// from recovering.
var wg sync.WaitGroup
for node, failer := range nodeFailers {
t.L().Printf("recovering n%d (%s)", node, failer)
failer.Recover(ctx, node)
wg.Add(1)
node := node
failer := failer
m.Go(func(ctx context.Context) error {
defer wg.Done()
t.L().Printf("recovering n%d (%s)", node, failer)
failer.Recover(ctx, node)

return nil
})
}
wg.Wait()
}

sleepFor(ctx, t, time.Minute) // let cluster recover
Expand Down
7 changes: 4 additions & 3 deletions pkg/kv/kvpb/api.proto
Original file line number Diff line number Diff line change
Expand Up @@ -2873,9 +2873,10 @@ message Header {
// current batch. When set, it allows the server to handle pushes and write
// too old conditions locally.
bool can_forward_read_timestamp = 16;
// gateway_node_id is the ID of the gateway node where the request originated.
// For requests from tenants, this is set to the NodeID of the KV node handling
// the BatchRequest.
// gateway_node_id is the ID of the gateway node where the request
// originated. For requests from a shared-process cluster, this is set to the
// NodeID of the KV node which created the BatchRequest. For requests from a
// separate-process SQL Pod or a CLI, this is set to 0.
int32 gateway_node_id = 11 [(gogoproto.customname) = "GatewayNodeID",
(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.NodeID"];
// client_range_info represents the kvclient's knowledge about the state of
Expand Down
6 changes: 6 additions & 0 deletions pkg/server/node.go
Original file line number Diff line number Diff line change
Expand Up @@ -1561,6 +1561,12 @@ func (n *Node) getLocalityComparison(
return roachpb.LocalityComparisonType_UNDEFINED
}

// In separate-process multi-tenant mode the gatewayNodeID is 0.
if gatewayNodeID == 0 {
if _, ok := roachpb.ClientTenantFromContext(ctx); ok {
return roachpb.LocalityComparisonType_UNDEFINED
}
}
gatewayNodeDesc, err := gossip.GetNodeDescriptor(gatewayNodeID)
if err != nil {
log.VInfof(ctx, 2,
Expand Down

0 comments on commit cd9661a

Please sign in to comment.