You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is defined as a peer that fails to send, receive, or persist network messages.
To achieve this, we want to prove through chaos engineering that the off-chain protocol can survive and recover from network failures.
A script that sets up the environment with the required components and infrastructure specifications.
it should prepare:
1 shared cardano node.
slot length should be configured based on script argument
3 hydra nodes, each with its own volume and hydraw instance.
each with a custom event-sink configured.
the hydra explorer
test driver
an HTTP server that waits for operator commands to execute a plan.
holds a copy of the signing keys.
runs execution plans (orchestration scripts) on demand.
plans send HTTP requests to each hydraw instance.
plans are configurable to wait for the last submitted transaction to be confirmed in a snapshot before processing the next (executes one step at a time); to make it reusable for stress-testing.
plans can be empty.
plans can contain failure instructions so that during execution, it introduces changes to the infrastructure to cause network failures (e.g., delay, loss, duplicate, and re-order packets) or shuts down the nodes to cause service unavailability and then restarts them after some period of time.
plans stop upon being unable to progress with its execution after several attempts/retries.
client inputs and failure instructions can be introduced manually by the operator during exectution.
that means plans maintain WS connections to each hydra node.
failure instructions are executed using one of these tools:
Why
Currently, in our test strategy hierarchy, we are not covering scenarios where real networking failures can occur.
We strive for long-living heads, so providing proofs of resiliency and fault tolerance is essential.
This will open future opportunities to explore different network protocols (such as UDP) and implementations.
What
Motivated by #1436, we aim to:
To achieve this, we want to prove through chaos engineering that the off-chain protocol can survive and recover from network failures.
For that, we need to:
Prepare a local demo.
Run it on a real cluster in the cloud.
Both require us to build:
cluster bootstrap script
it should prepare:
test driver
test observer
The text was updated successfully, but these errors were encountered: