Exhaustively list fatal exceptions in the error policy #1553

edsko · 2020-01-31T12:38:59Z

There is no point listing all exceptions that should shut down the node; we have zero guarantees that this list is exhaustive, and so it is inevitable that some PR at some point is going to introduce an exception not in the list. The default should simply be that the node shuts down and is restarted.

EDIT:
The default case in the error policy is not shutdown, but log + disconnect from peer.

If a particular exception is not handled by any policy, a default kicks in, which currently means logging the exception and disconnecting from the peer (in both directions), but allowing an immediate reconnect. This is fine for exceptions that only affect that peer.

So the new goal of this ticket is to exhaustively list fatal exceptions in the consensus error policy.

1554: Detect clock changes r=edsko a=edsko Closes #759. Code is tested using mock time; result of the test labelling: ``` # cabal run test-consensus -- -p delayClockShift --quickcheck-replay=680184 --quickcheck-tests 10000 Up to date ouroboros-consensus WallClock delayClockShift: OK (18.51s) +++ OK, passed 10000 tests. schedule goes back (10000 in total): 63.07% False 36.93% True schedule length (10000 in total): 81.03% R_Gt 20 9.26% R_Btwn (10,20) 4.08% R_Btwn (5,10) 1.04% R_Btwn (4,5) 1.03% R_Eq 1 0.96% R_Eq 2 0.94% R_Eq 4 0.87% R_Eq 0 0.79% R_Eq 3 schedule skips (10000 in total): 38.66% R_Btwn (10,20) 24.96% R_Btwn (5,10) 5.59% R_Eq 0 5.43% R_Eq 2 5.27% R_Btwn (4,5) 5.12% R_Eq 1 5.08% R_Eq 3 4.99% R_Eq 4 4.90% R_Gt 20 All 1 tests passed (18.52s) ``` I also verified that the exception bubbles up to the node. Ran the latest node with this PR, set my clock back an hour, and got ``` cardano-node: ExceptionInLinkedThread "ThreadId 20" (SystemClockMovedBack 2020-01-31 11:15:31.00184141 UTC (SlotNo {unSlotNo = 3713312}) (SlotNo {unSlotNo = 3713132})) ``` There is no need to define a custom error policy for this due to #1553 , nor a custom exit failure due to #1551 (comment) . Co-authored-by: Edsko de Vries <[email protected]>

mrBliss · 2020-02-28T14:29:33Z

The following comment in https://github.com/input-output-hk/ouroboros-network/blob/master/ouroboros-consensus/src/Ouroboros/Consensus/Node/ErrorPolicy.hs#L38-L43 actually states the opposite:

If a particular exception is not handled by any policy, a default
kicks in, which currently means logging the exception and disconnecting
from the peer (in both directions), but allowing an immediate
reconnect. This is fine for exceptions that only affect that peer.
It is however essential that we handle exceptions here that /must/
shut down the node (mainly storage layer errors).

@coot Is that comment correct? If so, would it make sense to make "shut down" the default?

coot · 2020-03-03T09:13:21Z

Yes that's correct, and the default was chosen because we don't want to accidentally shut down the node, e.g. one might mount an attack in which triggers error exception somewhere in the code. If the default is to shut down, one could shutdown possibly a large portion of network. We want to avoid this, that's why the default is to persist.

@edsko if we miss a chance to shutdown because of an error what can go wrong? The node will keep running, maybe it will self corrupt its db.

Another note: all exceptions are logged (I hope with high severity - I will double check in cardano-node). So there is still information in the logs. Maybe the ones that should kill the node, should be descriptive enough (or contain some keyword that could be parsed by tools).

edsko · 2020-03-03T09:35:11Z

@mrBliss so that means this ticket turns into "Let's make sure we really do exhaustively list all exceptions that should cause a DB revalidation" right?

mrBliss · 2020-03-03T09:37:40Z

@mrBliss so that means this ticket turns into "Let's make sure we really do exhaustively list all exceptions that should cause a DB revalidation" right?

Indeed, I'll update the ticket

edsko · 2020-03-03T09:49:51Z

One exception that is missing:

-- | Failed to strip off the envelope from an encoded header
--
-- This indicates either a bug or disk corruption.
data DropEncodedSizeException =
     DropEncodedSizeError CBOR.DeserialiseFailure
   | DropEncodedSizeTrailingBytes Lazy.ByteString
   deriving (Show)

edsko added consensus issues related to ouroboros-consensus technical debt labels Jan 31, 2020

edsko mentioned this issue Jan 31, 2020

Detect clock changes #1554

Merged

edsko added this to the S8 2020-03-12 milestone Feb 19, 2020

edsko assigned mrBliss Feb 19, 2020

mrBliss changed the title ~~Remove "shutdown node" cases from error policy~~ Exhaustively list fatal exceptions in the error policy Mar 3, 2020

mrBliss added priority high and removed technical debt labels Mar 3, 2020

mrBliss mentioned this issue Mar 3, 2020

ErrorPolicy: include block-specific exceptions #1738

Merged

iohk-bors bot closed this as completed in 83cf065 Mar 3, 2020

iohk-bors bot closed this as completed in #1738 Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exhaustively list fatal exceptions in the error policy #1553

Exhaustively list fatal exceptions in the error policy #1553

edsko commented Jan 31, 2020 •

edited by mrBliss

Loading

mrBliss commented Feb 28, 2020

coot commented Mar 3, 2020

edsko commented Mar 3, 2020

mrBliss commented Mar 3, 2020

edsko commented Mar 3, 2020

Exhaustively list fatal exceptions in the error policy #1553

Exhaustively list fatal exceptions in the error policy #1553

Comments

edsko commented Jan 31, 2020 • edited by mrBliss Loading

mrBliss commented Feb 28, 2020

coot commented Mar 3, 2020

edsko commented Mar 3, 2020

mrBliss commented Mar 3, 2020

edsko commented Mar 3, 2020

edsko commented Jan 31, 2020 •

edited by mrBliss

Loading