flaky test: TestSnapshotWithPruning #11923

p0mvn · 2022-05-10T17:14:11Z

Summary of Bug

A test-race failure was observed in CI: https://github.com/cosmos/cosmos-sdk/runs/6372160448?check_suite_focus=true

Upon initial investigation, it was concluded that the test is likely to be flaky. Tried rerunning the test 5 times locally and was not able to reproduce.

go test -count=5 -timeout 30m -race -run ^TestSnapshotWithPruning$ github.com/cosmos/cosmos-sdk/baseapp
ok      github.com/cosmos/cosmos-sdk/baseapp    372.122s

Making this issue to keep track of the problem, in case it happens again.

This is the conclusion from the initial investigation:

According to the logs, everything works as expected besides snapshot store pruning.

cosmos-sdk/snapshots/manager.go

Lines 446 to 458 in 6872cae

    
           m.logger.Info("completed state snapshot", "height", height, "format", snapshot.Format) 
        
           if m.opts.KeepRecent > 0 { 
        
           	m.logger.Debug("pruning state snapshots") 
        
           	pruned, err := m.Prune(m.opts.KeepRecent) 
        
           	if err != nil { 
        
           		m.logger.Error("Failed to prune state snapshots", "err", err) 
        
           		return 
        
           	} 
        
           	m.logger.Debug("pruned state snapshots", "pruned", pruned) 
        
           }

When we list snapshots in tests, we expect only the latest one (at height 20) to be returned. However, it returns both height 15 and 20, indicating that the one at height 15 is still present when it shouldn't.

If you check out the logs and compare them to the code linked above, there is a log for:

2022-05-10T14:54:21Z INFO completed state snapshot format=2 height=20 module=sdk/app

but no log saying:

m.logger.Error("Failed to prune state snapshots", "err", err)

which should have showed up at the info level if there was indeed an error. I think that the test is simply flaky. More specifically, we end up listing snapshots before the pruning completes. As a result, we get 2 snapshots rather than one. Most likely, if there was a 200-500ms sleep, this wouldn't happen.

So to mitigate this from happening again we can:

add a sleep statement before listing snapshots here:

cosmos-sdk/baseapp/baseapp_test.go

Line 1987 in 6872cae

config: &setupConfig{

also, configure Debug logs in the test. It could be helpful for seeing what's happening here:

cosmos-sdk/snapshots/manager.go

Lines 446 to 458 in 6872cae

    
           m.logger.Info("completed state snapshot", "height", height, "format", snapshot.Format) 
        
           if m.opts.KeepRecent > 0 { 
        
           	m.logger.Debug("pruning state snapshots") 
        
           	pruned, err := m.Prune(m.opts.KeepRecent) 
        
           	if err != nil { 
        
           		m.logger.Error("Failed to prune state snapshots", "err", err) 
        
           		return 
        
           	} 
        
           	m.logger.Debug("pruned state snapshots", "pruned", pruned) 
        
           }

Steps to Reproduce

Wasn't able to reproduce but linked the failed job.

For Admin Use

Not duplicate issue
Appropriate labels applied
Appropriate contributors tagged
Contributor assigned/self-assigned

The text was updated successfully, but these errors were encountered:

tac0turtle · 2023-08-18T15:03:59Z

we havent seen this in a while, closing for now. We can open a new issue if we encounter it

p0mvn added the T:Bug label May 10, 2022

p0mvn mentioned this issue May 10, 2022

Fix flaky CI tests #9010

Closed

17 tasks

tac0turtle closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flaky test: TestSnapshotWithPruning #11923

flaky test: TestSnapshotWithPruning #11923

p0mvn commented May 10, 2022

tac0turtle commented Aug 18, 2023

flaky test: TestSnapshotWithPruning #11923

flaky test: TestSnapshotWithPruning #11923

Comments

p0mvn commented May 10, 2022

Summary of Bug

Steps to Reproduce

For Admin Use

tac0turtle commented Aug 18, 2023