Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flaky test: TestSnapshotWithPruning #11923

Closed
4 tasks
Tracked by #9010
p0mvn opened this issue May 10, 2022 · 1 comment
Closed
4 tasks
Tracked by #9010

flaky test: TestSnapshotWithPruning #11923

p0mvn opened this issue May 10, 2022 · 1 comment
Labels

Comments

@p0mvn
Copy link
Member

p0mvn commented May 10, 2022

Summary of Bug

A test-race failure was observed in CI: https://github.com/cosmos/cosmos-sdk/runs/6372160448?check_suite_focus=true

Upon initial investigation, it was concluded that the test is likely to be flaky. Tried rerunning the test 5 times locally and was not able to reproduce.

go test -count=5 -timeout 30m -race -run ^TestSnapshotWithPruning$ github.com/cosmos/cosmos-sdk/baseapp
ok      github.com/cosmos/cosmos-sdk/baseapp    372.122s

Making this issue to keep track of the problem, in case it happens again.

This is the conclusion from the initial investigation:

According to the logs, everything works as expected besides snapshot store pruning.

m.logger.Info("completed state snapshot", "height", height, "format", snapshot.Format)
if m.opts.KeepRecent > 0 {
m.logger.Debug("pruning state snapshots")
pruned, err := m.Prune(m.opts.KeepRecent)
if err != nil {
m.logger.Error("Failed to prune state snapshots", "err", err)
return
}
m.logger.Debug("pruned state snapshots", "pruned", pruned)
}

When we list snapshots in tests, we expect only the latest one (at height 20) to be returned. However, it returns both height 15 and 20, indicating that the one at height 15 is still present when it shouldn't.

If you check out the logs and compare them to the code linked above, there is a log for:

2022-05-10T14:54:21Z INFO completed state snapshot format=2 height=20 module=sdk/app

but no log saying:

m.logger.Error("Failed to prune state snapshots", "err", err)

which should have showed up at the info level if there was indeed an error. I think that the test is simply flaky. More specifically, we end up listing snapshots before the pruning completes. As a result, we get 2 snapshots rather than one. Most likely, if there was a 200-500ms sleep, this wouldn't happen.

So to mitigate this from happening again we can:

  • add a sleep statement before listing snapshots here:
    config: &setupConfig{
  • also, configure Debug logs in the test. It could be helpful for seeing what's happening here:
    m.logger.Info("completed state snapshot", "height", height, "format", snapshot.Format)
    if m.opts.KeepRecent > 0 {
    m.logger.Debug("pruning state snapshots")
    pruned, err := m.Prune(m.opts.KeepRecent)
    if err != nil {
    m.logger.Error("Failed to prune state snapshots", "err", err)
    return
    }
    m.logger.Debug("pruned state snapshots", "pruned", pruned)
    }

Steps to Reproduce

Wasn't able to reproduce but linked the failed job.

For Admin Use

  • Not duplicate issue
  • Appropriate labels applied
  • Appropriate contributors tagged
  • Contributor assigned/self-assigned
@p0mvn p0mvn added the T:Bug label May 10, 2022
@p0mvn p0mvn mentioned this issue May 10, 2022
17 tasks
@tac0turtle
Copy link
Member

we havent seen this in a while, closing for now. We can open a new issue if we encounter it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants