Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: Remove the ErrorSlot from the heartbeat and connection tasks #3263

Closed
5 of 6 tasks
Tracked by #3322 ...
teor2345 opened this issue Dec 20, 2021 · 9 comments
Closed
5 of 6 tasks
Tracked by #3322 ...
Assignees
Labels
A-network Area: Network protocol updates or fixes C-security Category: Security issues C-tracking-issue Category: This is a tracking issue for other tasks I-panic Zebra panics with an internal error message

Comments

@teor2345
Copy link
Contributor

teor2345 commented Dec 20, 2021

Motivation

The error handling for the Connection and heartbeat tasks is really messy. It has been really buggy in the past, and it's hard to understand or modify. The way the code is written can cause hangs or panics.

At the moment, it's causing a "failed servers must set their error slot" panic when Zebra is overloaded.

Scheduling

This is a denial of service risk which might be triggerable remotely.

This is high-risk hang code with multiple mutexes, so it should be prioritised over other potential hang or panic bugs.

Sub-Tasks

High-priority ongoing connection panics:

Other panics that are very rare:

Cleanups:

This will give us a much stronger guarantee that the connection task will exit when there's an error.

Related Work

@teor2345 teor2345 added C-cleanup Category: This is a cleanup S-needs-triage Status: A bug report needs triage C-security Category: Security issues I-panic Zebra panics with an internal error message I-hang A Zebra component stops responding to requests I-usability Zebra is hard to understand or use A-network Area: Network protocol updates or fixes A-diagnostics Area: Diagnosing issues or monitoring performance labels Dec 20, 2021
@teor2345 teor2345 removed the C-cleanup Category: This is a cleanup label Dec 20, 2021
@teor2345 teor2345 changed the title Return errors from the heartbeat and connection tasks Security: Return errors correctly from the heartbeat and connection tasks Dec 20, 2021
@teor2345 teor2345 added P-Low and removed P-Low labels Jan 6, 2022
@ftm1000
Copy link

ftm1000 commented Jan 26, 2022

redistributing issues that can be separately worked from Epic #2311

@mpguerra mpguerra mentioned this issue Jan 27, 2022
40 tasks
@ftm1000
Copy link

ftm1000 commented Jan 27, 2022

@teor2345
Copy link
Contributor Author

teor2345 commented Mar 1, 2022

This cleanup isn't needed any time soon.

@teor2345
Copy link
Contributor Author

teor2345 commented Jun 27, 2022

I got a panic in this code, so I'm re-opening this ticket.

I'll add the panic info to the ticket, and the details in a comment.

@teor2345 teor2345 reopened this Jun 27, 2022
@teor2345 teor2345 added P-Medium ⚡ and removed I-usability Zebra is hard to understand or use S-incomplete labels Jun 27, 2022
@teor2345
Copy link
Contributor Author

teor2345 commented Jun 27, 2022

Here is the panic I saw recently:

Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

It looks like it might have been triggered by switching to another chain fork.

Here are the detailed logs:

2022-06-24T14:24:44.018156Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("3fa349d6b60b6a70e23af69513faceafc52ce13061a0538de364691cac2a8252"), auth_digest: AuthDigest("9a830fe29ab5db38201f44e662aa85edd146f943f687b467ad5bfd270888c5e5") })})
2022-06-24T14:24:46.446497Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool::gossip: sending mempool transaction broadcast request=AdvertiseTransactionIds({Witnessed(WtxId { id: transaction::Hash("590e22047e713accbbb12062e178a78948da9ca20d05fa881fcd730333010609"), auth_digest: AuthDigest("ec7dd1511d4c99b4d6907880b3097de0d04be8ac22a93c84d110e5449beb6b03") })})   
2022-06-24T14:24:56.394244Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.34.40.41:8233")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:24:59.349912Z  INFO {zebrad="135aedc" net="Main"}: zebrad::commands::start: finished initial sync to chain tip, using gossiped blocks sync_percent=100.000 % current_height=Height(1714438) remaining_sync_blocks=0 time_since_last_state_block=PT0S                                                                                                                            
2022-06-24T14:25:24.130776Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("157.90.88.178:9058")}:msg_as_req{msg="inv"}:inbound:download_and_verify{hash=0000000000b41efb77d297d134f441cc5e18ef2605f694a90f45d4d89e0dcab6}: zebrad::components::inbound::downloads: downloaded and verified gossiped block height=Height(1714438)
2022-06-24T14:25:24.131484Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("8.209.65.101:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                                                                                                                                                                     
2022-06-24T14:25:24.131481Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.28.203.21:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131285Z  INFO {zebrad="135aedc" net="Main"}: zebrad::components::mempool: resetting mempool: switched best chain, skipped blocks, or activated network upgrade tip_height=Height(1714438)                               
2022-06-24T14:25:24.131403Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("209.141.47.197:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
(repeated approximately 75 times)

The application panicked (crashed).
Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

Metadata:
version: 1.0.0-beta.11+48.g135aedc
Zcash network: Mainnet
state version: 24
branch: min-lookahead-limit
git commit: 135aedc
commit timestamp: 2022-06-22T03:34:32Z
target triple: x86_64-unknown-linux-gnu
build profile: release

   0: zebrad::components::inbound::downloads::download_and_verify with hash=0000000000b8a01c82a9df285251995b67492dc301e14685b362dbf04df41736                                                   
      at zebrad/src/components/inbound/downloads.rs:1792022-06-24T14:25:24.131597Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("3.249.165.235:5001")}:msg_as_req{msg="inv"}: zebra_network::p
eer::connection: inbound service is overloaded, closing connection
2022-06-24T14:25:24.131603Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("84.75.28.247:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                     
2022-06-24T14:25:24.131605Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("147.135.11.134:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
2022-06-24T14:25:24.131611Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("45.43.30.2:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                       
2022-06-24T14:25:24.131612Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("52.211.176.240:5001")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                   
2022-06-24T14:25:24.131613Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("213.252.244.214:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection                                  
2022-06-24T14:25:24.131637Z  INFO {zebrad="135aedc" net="Main"}:{peer=Out("176.31.239.109:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection 

   1: zebrad::components::inbound::inbound
      at zebrad/src/components/inbound.rs:289

   2: zebra_network::peer::connection::msg_as_req with msg="inv"
      at zebra-network/src/peer/connection.rs:979

   3: zebra_network::peer::handshake:: with peer=Out("203.96.179.202:8233")
      at zebra-network/src/peer/handshake.rs:791

   4: zebrad::application:: with zebrad="135aedc" net="Main"
      at zebrad/src/application.rs:376

@teor2345 teor2345 added the S-needs-investigation Status: Needs further investigation label Jun 27, 2022
@ftm1000 ftm1000 removed the S-needs-triage Status: A bug report needs triage label Jun 27, 2022
@ftm1000
Copy link

ftm1000 commented Jun 27, 2022

ticket to be split and re-estimated.

@teor2345
Copy link
Contributor Author

I think I might have found another instance of this bug, but it results in a syncer hang, not a panic.

The logs look like:

2022-06-30T10:06:24.137697Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection

Here are the full logs from the start of the hang:

2022-06-30T09:24:24.090749Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719622) network_upgrade=Nu5 remaining_sync_blocks=1467 time_since_last_state_block=0s
2022-06-30T09:24:42.602806Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=9 extra_hashes=463 lookahead_limit=5
2022-06-30T09:25:24.092622Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=0s
2022-06-30T09:26:24.092922Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=1m
2022-06-30T09:27:24.094536Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719627) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=2m
2022-06-30T09:27:54.159075Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=8 extra_hashes=458 lookahead_limit=5
2022-06-30T09:28:24.095393Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1458 time_since_last_state_block=0s
2022-06-30T09:29:24.096079Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1459 time_since_last_state_block=1m
2022-06-30T09:30:24.097127Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=2m
2022-06-30T09:31:24.098606Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1460 time_since_last_state_block=3m
2022-06-30T09:32:24.099312Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1461 time_since_last_state_block=4m
2022-06-30T09:32:37.021862Z  INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:33:24.100697Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1462 time_since_last_state_block=5m
2022-06-30T09:33:54.462655Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=Invalid { error: Block(Transaction(InternalDowncastError("downcast to known transaction error type failed, original error: Elapsed(())"))), height: Height(1719636), hash: block::Hash("000000000116e5c27c040e82740852a1eb60365360063203161c1ae8e5689ebf") }
2022-06-30T09:33:54.463531Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T09:34:24.102078Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1463 time_since_last_state_block=6m
2022-06-30T09:34:39.021738Z  INFO {zebrad="8ff5146" net="Main"}:crawl_and_dial{new_peer_interval=61s}: zebra_network::peer_set::candidate_set: timeout waiting for peer service readiness or peer responses
2022-06-30T09:35:01.464711Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T09:35:03.020050Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=5 extra_hashes=494 lookahead_limit=5
2022-06-30T09:35:24.103447Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=7m
2022-06-30T09:36:24.104184Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1464 time_since_last_state_block=8m
2022-06-30T09:37:24.105729Z  INFO {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: estimated progress to chain tip sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 remaining_sync_blocks=1465 time_since_last_state_block=9m
2022-06-30T09:38:24.107319Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 10 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=10m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:39:24.108535Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 11 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=11m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T09:40:24.109445Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 12 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.915% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=12m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
...(similar logs for 25 minutes)...
2022-06-30T10:06:24.137697Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 38 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.914% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=38m target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:07:18.412883Z  WARN {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: error downloading and verifying block e=ValidationRequestError { error: Elapsed(()), height: Height(1719631), hash: block::Hash("00000000000db5ffa65d7581bfe54a9aa39e7d257d1920ae76eae2e34974636e") }
2022-06-30T10:07:18.413029Z  INFO {zebrad="8ff5146" net="Main"}:sync: zebrad::components::sync: waiting to restart sync timeout=67s state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266078Z  INFO {zebrad="8ff5146" net="Main"}:sync:try_to_sync: zebrad::components::sync: starting sync, obtaining new tips state_tip=Some(Height(1719629))
2022-06-30T10:09:59.266146Z  WARN {zebrad="8ff5146" net="Main"}: zebrad::components::sync::progress: chain updates have stalled, state height has not increased for 41 minutes. Hint: check your network connection, and your computer clock and time zone sync_percent=99.913% current_height=Height(1719629) network_upgrade=Nu5 time_since_last_state_block=41m 35s target_block_spacing=PT75S max_block_spacing=None is_syncer_stopped=false
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("161.53.201.188:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266574Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("135.181.210.43:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266579Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("104.59.147.15:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266603Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("65.108.79.158:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
2022-06-30T10:09:59.266613Z  INFO {zebrad="8ff5146" net="Main"}:{peer=Out("185.189.149.189:8233")}:msg_as_req{msg="inv"}: zebra_network::peer::connection: inbound service is overloaded, closing connection
... (repeated about 75 times) ...

@teor2345
Copy link
Contributor Author

teor2345 commented Jul 4, 2022

Here is the panic I saw recently:

Message:  failed servers must set their error slot
Location: zebra-network/src/peer/client.rs:496

I saw another instance of the same panic today.

There was no chain fork, so it seems like it is just caused by inbound service overload.

@teor2345 teor2345 added C-tracking-issue Category: This is a tracking issue for other tasks and removed I-hang A Zebra component stops responding to requests S-needs-investigation Status: Needs further investigation A-diagnostics Area: Diagnosing issues or monitoring performance labels Jul 4, 2022
@teor2345 teor2345 changed the title Security: Return errors correctly from the heartbeat and connection tasks Security: Stop using the ErrorSlot in the heartbeat and connection tasks Jul 4, 2022
@teor2345 teor2345 changed the title Security: Stop using the ErrorSlot in the heartbeat and connection tasks Security: Remove the ErrorSlot from the heartbeat and connection tasks Jul 4, 2022
@teor2345
Copy link
Contributor Author

I'm seeing panics in this code every few days now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-security Category: Security issues C-tracking-issue Category: This is a tracking issue for other tasks I-panic Zebra panics with an internal error message
Projects
None yet
Development

No branches or pull requests

2 participants