Skip to content

Commit

Permalink
roachtest: make failure recovery independent
Browse files Browse the repository at this point in the history
Previously, the multiple failures were started and finished
independently. This caused a problem if the ability to recover from one
failure depended on a different failure recovering first. To mitigate
this, recover each failure in a separate goroutine. This will allow the
"most important" failure to recover first so that the others can recover
if they depend on each other.

This is more important today while we don't recover from all the failure
modes that chaos implements. Specifically we don't handle partial
partitions fully with epoch leases.

Epic: none
Fixes: #119085
Fixes: #119347
Fixes: #119361
Fixes: #119454

Release note: None
  • Loading branch information
andrewbaptist committed Feb 26, 2024
1 parent 59ec42d commit e51a4c4
Showing 1 changed file with 16 additions and 2 deletions.
18 changes: 16 additions & 2 deletions pkg/cmd/roachtest/tests/failover.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ import (
gosql "database/sql"
"fmt"
"math/rand"
"sync"
"time"

"github.com/cockroachdb/cockroach/pkg/base"
Expand Down Expand Up @@ -340,10 +341,23 @@ func runFailoverChaos(ctx context.Context, t test.Test, c cluster.Cluster, readO

sleepFor(ctx, t, time.Minute)

// Recover the failers on different goroutines. Otherwise, they
// might interact as certain failures can prevent other failures
// from recovering.
var wg sync.WaitGroup
for node, failer := range nodeFailers {
t.L().Printf("recovering n%d (%s)", node, failer)
failer.Recover(ctx, node)
wg.Add(1)
node := node
failer := failer
m.Go(func(ctx context.Context) error {
defer wg.Done()
t.L().Printf("recovering n%d (%s)", node, failer)
failer.Recover(ctx, node)

return nil
})
}
wg.Wait()
}

sleepFor(ctx, t, time.Minute) // let cluster recover
Expand Down

0 comments on commit e51a4c4

Please sign in to comment.