-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exhaustively list fatal exceptions in the error policy #1553
Comments
1554: Detect clock changes r=edsko a=edsko Closes #759. Code is tested using mock time; result of the test labelling: ``` # cabal run test-consensus -- -p delayClockShift --quickcheck-replay=680184 --quickcheck-tests 10000 Up to date ouroboros-consensus WallClock delayClockShift: OK (18.51s) +++ OK, passed 10000 tests. schedule goes back (10000 in total): 63.07% False 36.93% True schedule length (10000 in total): 81.03% R_Gt 20 9.26% R_Btwn (10,20) 4.08% R_Btwn (5,10) 1.04% R_Btwn (4,5) 1.03% R_Eq 1 0.96% R_Eq 2 0.94% R_Eq 4 0.87% R_Eq 0 0.79% R_Eq 3 schedule skips (10000 in total): 38.66% R_Btwn (10,20) 24.96% R_Btwn (5,10) 5.59% R_Eq 0 5.43% R_Eq 2 5.27% R_Btwn (4,5) 5.12% R_Eq 1 5.08% R_Eq 3 4.99% R_Eq 4 4.90% R_Gt 20 All 1 tests passed (18.52s) ``` I also verified that the exception bubbles up to the node. Ran the latest node with this PR, set my clock back an hour, and got ``` cardano-node: ExceptionInLinkedThread "ThreadId 20" (SystemClockMovedBack 2020-01-31 11:15:31.00184141 UTC (SlotNo {unSlotNo = 3713312}) (SlotNo {unSlotNo = 3713132})) ``` There is no need to define a custom error policy for this due to #1553 , nor a custom exit failure due to #1551 (comment) . Co-authored-by: Edsko de Vries <[email protected]>
The following comment in https://github.com/input-output-hk/ouroboros-network/blob/master/ouroboros-consensus/src/Ouroboros/Consensus/Node/ErrorPolicy.hs#L38-L43 actually states the opposite:
@coot Is that comment correct? If so, would it make sense to make "shut down" the default? |
Yes that's correct, and the default was chosen because we don't want to accidentally shut down the node, e.g. one might mount an attack in which triggers error exception somewhere in the code. If the default is to shut down, one could shutdown possibly a large portion of network. We want to avoid this, that's why the default is to persist. @edsko if we miss a chance to shutdown because of an error what can go wrong? The node will keep running, maybe it will self corrupt its db. Another note: all exceptions are logged (I hope with high severity - I will double check in |
@mrBliss so that means this ticket turns into "Let's make sure we really do exhaustively list all exceptions that should cause a DB revalidation" right? |
Indeed, I'll update the ticket |
One exception that is missing: -- | Failed to strip off the envelope from an encoded header
--
-- This indicates either a bug or disk corruption.
data DropEncodedSizeException =
DropEncodedSizeError CBOR.DeserialiseFailure
| DropEncodedSizeTrailingBytes Lazy.ByteString
deriving (Show) |
There is no point listing all exceptions that should shut down the node; we have zero guarantees that this list is exhaustive, and so it is inevitable that some PR at some point is going to introduce an exception not in the list. The default should simply be that the node shuts down and is restarted.EDIT:
The default case in the error policy is not shutdown, but log + disconnect from peer.
So the new goal of this ticket is to exhaustively list fatal exceptions in the consensus error policy.
The text was updated successfully, but these errors were encountered: