Skip to content
This repository has been archived by the owner on Dec 12, 2022. It is now read-only.

Latest commit

 

History

History
78 lines (52 loc) · 6.11 KB

README.md

File metadata and controls

78 lines (52 loc) · 6.11 KB

nebula-chaos

This repository has been deprecated, it's no longer maintained.

Chaos framework for the Storage Service

Plan Intro

There are some built-in plans in nebula-chaos. Each plan is a json in conf directory. The plan need to specify some instances (usually including nebula graph/meta/storage) and some actions. The actions is a collection of different type actions, which forms a dag. The dependency between actions need to be specified in depends field. Most of the action need to specify related nebula instance in inst_index field. You can add customize based on these rules.

A utils to draw a flow chart of the plan is included, use it like this: python3 src/tools/FlowChart.py conf/scale_up_and_down.json.

Start all services, write data, then create a check point, write some more data, restore from check point. In the end, we check the validity by checking whether data is the same as the one when we create check point.

Clean all wals of specified space, then start all services, write a circle, then check data integrity.

Start all services, disturb (random kill a storage service, clean the data path, restart) while write a circle, then check data integrity.

Start all services, disturb (random kill a storage service, truncate some bytes from last wal of specified space and part, restart) while write a circle, then check data integrity.

Use integer vid, start all services, disturb (random kill and restart a storage service) while write and read using integer vid.

Use string vid, start all services, disturb (random kill and restart a storage service) while write and read using string vid.

Start all services, kill all storage services and restart while writing.

Start 3 storage servies, add 4th storage service using balance data while write a circle, then check data integrity. Then stop 1st storage service, remove it using balance data while write a circle then check data integrity. Likewise, add 1st storage service back and remove the 4th storage service.

Start all services, disturb (random drop all packets of a storage service, recover later) while write a circle, then check data integrity. The network partition is based on iptables. Make sure the user has sudo authority and can execute iptables without password.

PS: all storage services in random_network_partition and random_traffic_control must be deployed on different ip. The reason is that we don't know the source port of storage service, we can only use ip to indicate the service.

Start all services, disturb (random delay all packets of a storage service, recover later) while write a circle, then check data integrity. The traffic is based on tcconfig, which is a tc command wrapper. Install it at first, since it will use tc and ip command, use the following scripts to make it has capabilities with not super user.

setcap cap_net_admin+ep /usr/sbin/tc
setcap cap_net_raw,cap_net_admin+ep /usr/sbin/ip

Start all services, disturb (cat /dev/zero until disk is full) while write a circle, the storage services which use the direcory should be crashed, then we clean the mock file and restart, check data integrity at last.

Use a ramdisk or tmpfs with limited size to test this plan, otherwise the whole disk will be occupied.

Start all services, disturb (simulate slow disk io) while write a circle, then check data integrity. We use SysytemTap to simulate slow disk io. The major and minor field is the MAJOR/MINOR device id of disk where storage serveice's data path mounted.

yum install systemtap

You may need install kernel-devel and kernel-debuginfo as well (the version must be same with kernel).

Start all services, balance leader, turn off auto_compactions, set wal_ttl to 60s, five concurrent threads write about 10G of data, view the leaders distribution of the current space, enable forced compression, turn on auto_compactions, wait a while, view the leaders distribution of the current space again, compare the results of checking the leaders distribution to see if the leaders have changed.

Start all services, balance leader, turn off auto_compactions, set wal_ttl to 60s, using storage perf to write data, stop writing data after the specified time, view the leaders distribution of the current space, enable forced compression, turn on auto_compactions, wait a while, view the leaders distribution of the current space again, compare the results of checking the leaders distribution to see if the leaders have changed. storage perf needs to be specified by the user, stable version Git: 1cd031fa.

Start all services, write some data with index, check if index is compatible with data.

Start all services, write some data, rebuild index, check if index is compatible with data.

Start all services, write some data, then write some data to overwrite the previous data and rebuild index at the same time, check if index is compatible with data.