raft/rafttest: introduce datadriven testing #11005

tbg · 2019-08-07T17:15:46Z

It has often been tedious to test the interactions between multi-member
Raft groups, especially when many steps were required to reach a certain
scenario. Often, this boilerplate was as boring as it is hard to write and
hard to maintain, making it attractive to resort to shortcuts whenever
possible, which in turn tended to undercut how meaningful and maintainable
the tests ended up being - that is, if the tests were even written, which
sometimes they weren't.

This change introduces a datadriven framework specifically for testing
deterministically the interaction between multiple members of a raft group
with the goal of reducing the friction for writing these tests to near
zero.

In the near term, this will be used to add thorough testing for joint
consensus (which is already available today, but wildly undertested), but
just converting an existing test into this framework has shown that the
concise representation and built-in inspection of log messages highlights
unexpected behavior much more readily than the previous unit tests did (the
test in question is snapshot_succeed_via_app_resp; the reader is invited
to compare the old and new version of it).

The main building block is InteractionEnv, which holds on to the state of
the whole system and exposes various relevant methods for manipulating it,
including but not limited to adding nodes, delivering and dropping
messages, and proposing configuration changes. All of this is extensible so
that in the future I hope to use it to explore the phenomena discussed in

#7625 (comment)

which requires injecting appropriate "crash points" in the Ready handling
loop. Discussions of the "what if X happened in state Y" can quickly be
made concrete by "scripting up an interaction test".

Additionally, this framework is intentionally not kept internal to the raft
package.. Though this is in its infancy, a goal is that it should be
possible for a suite of interaction tests to allow applications to validate
that their Storage implementation behaves accordingly, simply by running a
raft-provided interaction suite against their Storage.

raft/rafttest/interaction_env_handler.go

raft/rafttest/interaction_env_handler_campaign.go

raft/testdata/confchange_v1.txt

raft/rafttest/interaction_env_handler_compact.go

raft/rafttest/interaction_env_logger.go

raft/rafttest/interaction_env_handler_campaign.go

raft/rafttest/interaction_env_handler_handle_ready.go

raft/rafttest/interaction_env_handler.go

raft/rafttest/interaction_env.go

raft/rafttest/interaction_env_handler_propose_conf_change.go

raft/rafttest/interaction_env_handler_stabilize.go

raft/testdata/campaign.txt

raft/testdata/snapshot_succeed_via_app_resp.txt

Picks up some fixes for papercuts.

tbg · 2019-08-09T22:58:42Z

Thanks for the reviews so far. I need to address some more additional comments and clean up some more, will ping when it's ready.

codecov-io · 2019-08-09T23:28:50Z

Codecov Report

Merging #11005 into master will increase coverage by 0.2%.
The diff coverage is 83.33%.

@@            Coverage Diff            @@
##           master   #11005     +/-   ##
=========================================
+ Coverage   63.98%   64.18%   +0.2%     
=========================================
  Files         402      402             
  Lines       37646    37712     +66     
=========================================
+ Hits        24087    24205    +118     
+ Misses      11955    11897     -58     
- Partials     1604     1610      +6

Impacted Files	Coverage Δ
raft/tracker/progress.go	`97.72% <ø> (ø)`	⬆️
raft/raft.go	`90.51% <100%> (+0.34%)`	⬆️
raft/util.go	`82.17% <81.53%> (+9.7%)`	⬆️
pkg/netutil/netutil.go	`63.11% <0%> (-7.38%)`	⬇️
pkg/fileutil/purge.go	`65.9% <0%> (-6.82%)`	⬇️
etcdserver/api/v3rpc/lease.go	`69.31% <0%> (-5.69%)`	⬇️
clientv3/concurrency/election.go	`79.68% <0%> (-2.35%)`	⬇️
pkg/mock/mockserver/mockserver.go	`72.72% <0%> (-2.28%)`	⬇️
clientv3/balancer/balancer.go	`86.36% <0%> (-2.28%)`	⬇️
clientv3/maintenance.go	`40.81% <0%> (-2.05%)`	⬇️
... and 16 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4a4629f...09788c9. Read the comment docs.

It has often been tedious to test the interactions between multi-member Raft groups, especially when many steps were required to reach a certain scenario. Often, this boilerplate was as boring as it is hard to write and hard to maintain, making it attractive to resort to shortcuts whenever possible, which in turn tended to undercut how meaningful and maintainable the tests ended up being - that is, if the tests were even written, which sometimes they weren't. This change introduces a datadriven framework specifically for testing deterministically the interaction between multiple members of a raft group with the goal of reducing the friction for writing these tests to near zero. In the near term, this will be used to add thorough testing for joint consensus (which is already available today, but wildly undertested), but just converting an existing test into this framework has shown that the concise representation and built-in inspection of log messages highlights unexpected behavior much more readily than the previous unit tests did (the test in question is `snapshot_succeed_via_app_resp`; the reader is invited to compare the old and new version of it). The main building block is `InteractionEnv`, which holds on to the state of the whole system and exposes various relevant methods for manipulating it, including but not limited to adding nodes, delivering and dropping messages, and proposing configuration changes. All of this is extensible so that in the future I hope to use it to explore the phenomena discussed in etcd-io#7625 (comment) which requires injecting appropriate "crash points" in the Ready handling loop. Discussions of the "what if X happened in state Y" can quickly be made concrete by "scripting up an interaction test". Additionally, this framework is intentionally not kept internal to the raft package.. Though this is in its infancy, a goal is that it should be possible for a suite of interaction tests to allow applications to validate that their Storage implementation behaves accordingly, simply by running a raft-provided interaction suite against their Storage.

tbg · 2019-08-12T09:52:49Z

Going to merge this now, but happy to address any follow-up comments.

tbg requested a review from bdarnell August 7, 2019 17:15

tbg mentioned this pull request Aug 8, 2019

raft: add TODO about broadcasting commit index after conf change #11002

Closed

bdarnell approved these changes Aug 8, 2019

View reviewed changes

raft/rafttest/interaction_env_handler.go Show resolved Hide resolved

raft/rafttest/interaction_env_handler_campaign.go Outdated Show resolved Hide resolved

raft/testdata/confchange_v1.txt Show resolved Hide resolved

raft/testdata/confchange_v1.txt Outdated Show resolved Hide resolved

nvanbenschoten reviewed Aug 8, 2019

View reviewed changes

vendor: bump datadriven

f57c16c

Picks up some fixes for papercuts.

tbg force-pushed the interactiontest branch from 5f1e2f7 to 09788c9 Compare August 9, 2019 22:57

tbg force-pushed the interactiontest branch from 09788c9 to e8090e5 Compare August 12, 2019 09:13

tbg changed the title ~~[wip] raft/rafttest: introduce datadriven testing~~ raft/rafttest: introduce datadriven testing Aug 12, 2019

tbg merged commit 029401a into etcd-io:master Aug 12, 2019

tbg deleted the interactiontest branch August 12, 2019 09:52

BusyJay mentioned this pull request Jun 19, 2020

port etcd joint consensus tikv/raft-rs#378

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

raft/rafttest: introduce datadriven testing #11005

raft/rafttest: introduce datadriven testing #11005

tbg commented Aug 7, 2019 •

edited

Loading

tbg commented Aug 9, 2019

codecov-io commented Aug 9, 2019

tbg commented Aug 12, 2019

raft/rafttest: introduce datadriven testing #11005

raft/rafttest: introduce datadriven testing #11005

Conversation

tbg commented Aug 7, 2019 • edited Loading

tbg commented Aug 9, 2019

codecov-io commented Aug 9, 2019

Codecov Report

tbg commented Aug 12, 2019

tbg commented Aug 7, 2019 •

edited

Loading