Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workbench | macOS: node fails with divide-by-zero #4380

Closed
deepfire opened this issue Aug 24, 2022 · 8 comments
Closed

workbench | macOS: node fails with divide-by-zero #4380

deepfire opened this issue Aug 24, 2022 · 8 comments
Assignees

Comments

@deepfire
Copy link
Contributor

deepfire commented Aug 24, 2022

@mgmeier reported that running the default profile (make default-autostay) fails on macOS, with nodes bombing out with a divide by zero exception:

workbench:  run:  allocated 2022-08-24-08.20.9b5fe.default-bage @ /Users/michael/IOG/cardano-node/run/2022-08-24-08.20.9b5fe.default-bage
  - run dir:         /Users/michael/IOG/cardano-node/run/2022-08-24-08.20.9b5fe.default-bage
  - profile JSON:    /Users/michael/IOG/cardano-node/run/2022-08-24-08.20.9b5fe.default-bage/profile.json
  - node specs:      /Users/michael/IOG/cardano-node/run/2022-08-24-08.20.9b5fe.default-bage/node-specs.json
  - topology:        /Users/michael/IOG/cardano-node/run/2022-08-24-08.20.9b5fe.default-bage/topology/topology.pdf
  - node base port:  30000
  - RTView URL:              http://localhost:30300
  - EKG URL (node-0):        http://localhost:30100/
  - Prometheus URL (node-0): http://localhost:30200/metrics
  - era:                babbage
  - epoch slots:        600
  - slot duration:      1
  - k:                  3
  - active slots coeff: 0.05
  - hosts:              6
  - pools:              6
    - normal:             6
    - dense:              0
  - UTxO:               0, of which:
    - delegated:          6
    - generated:          43200
    - stuffed:            0
  - delegators:         6
  - generator duration: 1800s
    - requested epochs:   3ep
    - effective epochs:   3ep
    - transaction count:  22kTx
    - full blocks:        119


workbench:  run:  starting 2022-08-24-08.20.9b5fe.default-bage
workbench:  run | scenario:  starting fixed-loaded
tracer: started
workbench:  supervisor:  waiting for cardano-tracer to create socket:OK
node-0: ERROR (spawn error)
node-1: ERROR (spawn error)
node-2: ERROR (spawn error)
node-3: ERROR (spawn error)
node-4: ERROR (spawn error)
node-5: ERROR (spawn error)

workbench:  scenario:  exit trap triggered
tracer: stopped
@deepfire deepfire self-assigned this Aug 24, 2022
@deepfire
Copy link
Contributor Author

Michael also provided these node log files:

2022-08-24-07.50.9b5fe.default-bage_logs_node0.tar.gz

@deepfire
Copy link
Contributor Author

Ok, so the node fails prior to the tracing system initialisation -- this is the last two lines in stdout, after the config dump:

New tracer configuration: TraceConfig {tcOptions = fromList [([],[ConfSeverity {severity = Notice},ConfBackend {backends = [Stdout MachineFormat,EKGBackend,Forwarder]}]),(["BlockFetch","Client"],[ConfSeverity {severity = Info},ConfDetail {detail = DMinimal}]),(["BlockFetch","Client","CompletedBlockFetch"],[ConfLimiter {maxFrequency = 2.0}]),(["BlockFetch","Server"],[ConfSeverity {severity = Info}]),(["ChainDB"],[ConfSeverity {severity = Info}]),(["ChainDB","AddBlockEvent","AddBlockValidation","ValidCandidate"],[ConfLimiter {maxFrequency = 2.0}]),(["ChainDB","AddBlockEvent","AddedBlockToQueue"],[ConfLimiter {maxFrequency = 2.0}]),(["ChainDB","AddBlockEvent","AddedBlockToVolatileDB"],[ConfLimiter {maxFrequency = 2.0}]),(["ChainDB","CopyToImmutableDBEvent","CopiedBlockToImmutableDB"],[ConfLimiter {maxFrequency = 2.0}]),(["ChainSync","Client"],[ConfSeverity {severity = Info},ConfDetail {detail = DMinimal}]),(["ChainSync","ServerBlock"],[ConfSeverity {severity = Info}]),(["ChainSync","ServerHeader"],[ConfSeverity {severity = Info}]),(["Forge"],[ConfSeverity {severity = Info}]),(["Mempool"],[ConfSeverity {severity = Info}]),(["Net","AcceptPolicy"],[ConfSeverity {severity = Info}]),(["Net","DNSResolver"],[ConfSeverity {severity = Info}]),(["Net","ErrorPolicy"],[ConfSeverity {severity = Info}]),(["Net","Subscription"],[ConfSeverity {severity = Info}]),(["Resources"],[ConfSeverity {severity = Info}]),(["Startup","DiffusionInit"],[ConfSeverity {severity = Info}]),(["TxSubmission","Remote"],[ConfDetail {detail = DMinimal}])], tcForwarder = TraceOptionForwarder {tofConnQueueSize = 2000, tofDisconnQueueSize = 200000, tofVerbosity = Minimum}, tcNodeName = Nothing, tcPeerFrequency = Nothing, tcResourceFrequency = Nothing}
"connection with \"../tracer/tracer.socket\" failed: divide by zero"

@mgmeier
Copy link
Contributor

mgmeier commented Aug 24, 2022

  • All nodes of a local cluster fail the exact same way
  • smoking gun from a node's stderr: cardano-node: ExceptionInLinkedThread (ThreadId 16) divide by zero
  • smoking gun from a node's stdout: "connection with \"../tracer/tracer.socket\" failed: divide by zero" after New tracer configuration: TraceConfig {..}

@deepfire
Copy link
Contributor Author

@deepfire
Copy link
Contributor Author

deepfire commented Aug 24, 2022

Another puzzling bit is that runInLoop is expected to gobble up this exception and continue. Yet the node dies (which, in this case is a good outcome, since we have a correctness failure, not a correctable runtime failure) -- but only after the trace IO action succeeds?!:

https://github.com/input-output-hk/cardano-node/blob/bench-master2/trace-dispatcher/src/Cardano/Logging/Utils.hs#L18-L23

@coot
Copy link
Contributor

coot commented Aug 24, 2022

The divide by zero exception is thrown in some thread, it could be that this exception is propagated to the main thread (with something like link which wrapps exceptions in ExceptionInLinkedThread), which you're tryJust does not handle (as it ignore all async exceptions), and thus the runInLoop errors.

iohk-bors bot added a commit that referenced this issue Aug 24, 2022
4377: README:  document package localisation r=deepfire a=deepfire

In the README:

1. Add a ToC.
2. Fix the structure, so the ToC actually makes sense.
3. Fix some minor RST syntax issues.
4. Some very minor fixes.
5. Expand the Development section with the workbench usage info.
6. Add `cabal-plan` and `graphmod` to the workbench shell.
7. Makes the node start on Darwin, when metric collection is enabled, by fixing #4380

Co-authored-by: Kosyrev Serge <[email protected]>
Co-authored-by: Michael Karg <[email protected]>
@deepfire deepfire moved this from In Progress to In Review in Performance & Tracing Aug 24, 2022
@mgmeier
Copy link
Contributor

mgmeier commented Aug 24, 2022

The node's resource tracer started a linked async thread in startResourceTracer which in the end, on Darwin, queried readResourceStatsInternal from Darwin.hsc. This faulty calculation caused a division-by-zero: https://github.com/input-output-hk/cardano-node/blob/bench-master2/trace-resources/src/Cardano/Logging/Resources/Darwin.hsc#L108-L109

The remaining issue is why Darwin reports user_time and system_time as zero at all, which will be handled seperately.

@mgmeier mgmeier closed this as completed Aug 24, 2022
Repository owner moved this from In Review to Done in Performance & Tracing Aug 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants