Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exhaustively list fatal exceptions in the error policy #1553

Closed
edsko opened this issue Jan 31, 2020 · 5 comments · Fixed by #1738
Closed

Exhaustively list fatal exceptions in the error policy #1553

edsko opened this issue Jan 31, 2020 · 5 comments · Fixed by #1738
Assignees
Labels
consensus issues related to ouroboros-consensus
Milestone

Comments

@edsko
Copy link
Contributor

edsko commented Jan 31, 2020

There is no point listing all exceptions that should shut down the node; we have zero guarantees that this list is exhaustive, and so it is inevitable that some PR at some point is going to introduce an exception not in the list. The default should simply be that the node shuts down and is restarted.

EDIT:
The default case in the error policy is not shutdown, but log + disconnect from peer.

If a particular exception is not handled by any policy, a default kicks in, which currently means logging the exception and disconnecting from the peer (in both directions), but allowing an immediate reconnect. This is fine for exceptions that only affect that peer.

So the new goal of this ticket is to exhaustively list fatal exceptions in the consensus error policy.

@edsko edsko added consensus issues related to ouroboros-consensus technical debt labels Jan 31, 2020
iohk-bors bot added a commit that referenced this issue Feb 7, 2020
1554: Detect clock changes r=edsko a=edsko

Closes #759.

Code is tested using mock time; result of the test labelling:

```
# cabal run test-consensus -- -p delayClockShift --quickcheck-replay=680184 --quickcheck-tests 10000
Up to date
ouroboros-consensus
  WallClock
    delayClockShift: OK (18.51s)
      +++ OK, passed 10000 tests.
      
      schedule goes back (10000 in total):
      63.07% False
      36.93% True
      
      schedule length (10000 in total):
      81.03% R_Gt 20
       9.26% R_Btwn (10,20)
       4.08% R_Btwn (5,10)
       1.04% R_Btwn (4,5)
       1.03% R_Eq 1
       0.96% R_Eq 2
       0.94% R_Eq 4
       0.87% R_Eq 0
       0.79% R_Eq 3
      
      schedule skips (10000 in total):
      38.66% R_Btwn (10,20)
      24.96% R_Btwn (5,10)
       5.59% R_Eq 0
       5.43% R_Eq 2
       5.27% R_Btwn (4,5)
       5.12% R_Eq 1
       5.08% R_Eq 3
       4.99% R_Eq 4
       4.90% R_Gt 20

All 1 tests passed (18.52s)
```

I also verified that the exception bubbles up to the node. Ran the latest node with this PR, set my clock back an hour, and got

```
cardano-node: ExceptionInLinkedThread "ThreadId 20" (SystemClockMovedBack 2020-01-31 11:15:31.00184141 UTC (SlotNo {unSlotNo = 3713312}) (SlotNo {unSlotNo = 3713132}))
```

There is no need to define a custom error policy for this due to #1553 , nor a custom exit failure due to #1551 (comment) .

Co-authored-by: Edsko de Vries <[email protected]>
@edsko edsko added this to the S8 2020-03-12 milestone Feb 19, 2020
@mrBliss
Copy link
Contributor

mrBliss commented Feb 28, 2020

The following comment in https://github.com/input-output-hk/ouroboros-network/blob/master/ouroboros-consensus/src/Ouroboros/Consensus/Node/ErrorPolicy.hs#L38-L43 actually states the opposite:

If a particular exception is not handled by any policy, a default
kicks in, which currently means logging the exception and disconnecting
from the peer (in both directions), but allowing an immediate
reconnect. This is fine for exceptions that only affect that peer.
It is however essential that we handle exceptions here that /must/
shut down the node (mainly storage layer errors).

@coot Is that comment correct? If so, would it make sense to make "shut down" the default?

@coot
Copy link
Contributor

coot commented Mar 3, 2020

Yes that's correct, and the default was chosen because we don't want to accidentally shut down the node, e.g. one might mount an attack in which triggers error exception somewhere in the code. If the default is to shut down, one could shutdown possibly a large portion of network. We want to avoid this, that's why the default is to persist.

@edsko if we miss a chance to shutdown because of an error what can go wrong? The node will keep running, maybe it will self corrupt its db.

Another note: all exceptions are logged (I hope with high severity - I will double check in cardano-node). So there is still information in the logs. Maybe the ones that should kill the node, should be descriptive enough (or contain some keyword that could be parsed by tools).

@edsko
Copy link
Contributor Author

edsko commented Mar 3, 2020

@mrBliss so that means this ticket turns into "Let's make sure we really do exhaustively list all exceptions that should cause a DB revalidation" right?

@mrBliss
Copy link
Contributor

mrBliss commented Mar 3, 2020

@mrBliss so that means this ticket turns into "Let's make sure we really do exhaustively list all exceptions that should cause a DB revalidation" right?

Indeed, I'll update the ticket

@mrBliss mrBliss changed the title Remove "shutdown node" cases from error policy Exhaustively list fatal exceptions in the error policy Mar 3, 2020
@edsko
Copy link
Contributor Author

edsko commented Mar 3, 2020

One exception that is missing:

-- | Failed to strip off the envelope from an encoded header
--
-- This indicates either a bug or disk corruption.
data DropEncodedSizeException =
     DropEncodedSizeError CBOR.DeserialiseFailure
   | DropEncodedSizeTrailingBytes Lazy.ByteString
   deriving (Show)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consensus issues related to ouroboros-consensus
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants