You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Partitioning a cluster of data centers running AntidoteDB can cause :ok g-set adds to not be fully replicated, or in some cases appear on other nodes only to not be present in the final read.
# multiple dcs with no faults Ok
lein run test --topology dcs --workload g-set --nemesis none
# intra dc partitioning Ok
lein run test --topology nodes --workload g-set --nemesis partition
# inter dc partitioning fails
lein run test --topology dcs --workload g-set --nemesis partition
# property driven tests don't always fail every run, can be run multiple times
lein run test --topology dcs --workload g-set --nemesis partition --test-count 5
The best way to initially interact with the test results is through the web server as described in jepsen-docker-workaround.
Here's a sample workflow tracing an anomaly:
click on invalid test from summary screen
click on results.edn
see 81 elements are missing from the final reads, pick one, i.e. 136
open history.txt, scroll to bottom, add see that 136 is only present on original node
Now lets look at an AntidoteDB log file for a node:
from the test summary screen
click on a node name to see all log files from that node
click on the AntidoteDB log of intestest
scroll to bottom to observe message loss recovery caused by partitioning
The timeline.html can also be used:
see :ok add for value 136 by worker 4
see it was replicated in read by worker 3 a few transactions later:
But missing from final read by worker 3:
Please ask if there's any questions, desired changes to the test, environment, etc.
The text was updated successfully, but these errors were encountered:
P.S. a good way to get a representative feel for what happens during inter dc partitioning:
# run test multiple times regardless of valid? true/false
lein run test-all --topology dcs --workload g-set --nemesis partition --test-count 10
Most will be invalid. Take a quick look at the test summary pages, latency-raw.png to see partition timing/duration and any failed transactions (red/orange), results.edn for total :ok adds missing from final reads, and the general feel in jepsen.log.
Test failure does seem to group into several patterns:
several sequential adds not fully replicating
adds replicating to a node and then being lost on that node
zero mq getting disrupted and no further replication for remainder of test
Partitioning a cluster of data centers running AntidoteDB can cause :ok g-set adds to not be fully replicated, or in some cases appear on other nodes only to not be present in the final read.
Details of the Jepsen test: https://github.com/nurturenature/fuzz_dist/blob/main/doc/antidotedb.md
Jepsen environment configured for AntidoteDB: https://github.com/nurturenature/jepsen-docker-workaround
Test commands:
The best way to initially interact with the test results is through the web server as described in jepsen-docker-workaround.
Here's a sample workflow tracing an anomaly:
results.edn
history.txt
, scroll to bottom, add see that 136 is only present on original nodeNow lets look at an AntidoteDB log file for a node:
The
timeline.html
can also be used:But missing from final read by worker 3:
Please ask if there's any questions, desired changes to the test, environment, etc.
The text was updated successfully, but these errors were encountered: