You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
The near process seems to be unstable since upgrading to 1.37.0-rc.2 on multiple nodes (Active validators and backup), issue seemed to have start after the resharding. Process seems to randomly crash. Currently have check 5 different nodes in multiple datacenters.
To Reproduce
Seems to be somewhat random, but of the 5 nodes that have been checked the service have all crashed within the last 2-3 hours as of the time of writing. Though seems to happen multiple times a day on all noces
Expected behavior
Process runs without crashing
Screenshots
Logs before the crash (all the WARNs in the testnet release make it hard to catch)
(Active Node)
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: thread 'actix-rt|system:0|arbiter:831' has overflowed its stack
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: fatal runtime error: stack overflow
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 28.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 neard[182471]: 2024-02-13T00:54:09.911073Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64
(Backup Node)
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: thread 'actix-rt|system:0|arbiter:149' has overflowed its stack
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: fatal runtime error: stack overflow
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 35.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 neard[154742]: 2024-02-13T00:25:50.291390Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64
Memory stack from one of the nodes.
Version (please complete the following information):
nearcore 1.37.0-rc.2 (Locally compiled)
rust version (if local): rustc 1.72.0
network: testnet
VMs: VMWare VMs
OS: Ubuntu 22.04.3 LTS
Cores: 8
RAM: 32GB
Storage: 1TB (NVME)
Additional context
Seems to have only have started since the resharding in this release.
Seems to occur on both active and backup nodes.
The text was updated successfully, but these errors were encountered:
Describe the bug
The near process seems to be unstable since upgrading to 1.37.0-rc.2 on multiple nodes (Active validators and backup), issue seemed to have start after the resharding. Process seems to randomly crash. Currently have check 5 different nodes in multiple datacenters.
To Reproduce
Seems to be somewhat random, but of the 5 nodes that have been checked the service have all crashed within the last 2-3 hours as of the time of writing. Though seems to happen multiple times a day on all noces
Expected behavior
Process runs without crashing
Screenshots
Logs before the crash (all the WARNs in the testnet release make it hard to catch)
(Active Node)
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: thread 'actix-rt|system:0|arbiter:831' has overflowed its stack
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: fatal runtime error: stack overflow
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 28.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 neard[182471]: 2024-02-13T00:54:09.911073Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64
(Backup Node)
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: thread 'actix-rt|system:0|arbiter:149' has overflowed its stack
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: fatal runtime error: stack overflow
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 35.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 neard[154742]: 2024-02-13T00:25:50.291390Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64
Memory stack from one of the nodes.
Version (please complete the following information):
Additional context
Seems to have only have started since the resharding in this release.
Seems to occur on both active and backup nodes.
The text was updated successfully, but these errors were encountered: