Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Nobody-Inc · 2024-02-13T02:37:01Z

Describe the bug
The near process seems to be unstable since upgrading to 1.37.0-rc.2 on multiple nodes (Active validators and backup), issue seemed to have start after the resharding. Process seems to randomly crash. Currently have check 5 different nodes in multiple datacenters.

To Reproduce
Seems to be somewhat random, but of the 5 nodes that have been checked the service have all crashed within the last 2-3 hours as of the time of writing. Though seems to happen multiple times a day on all noces

Expected behavior
Process runs without crashing

Screenshots

Logs before the crash (all the WARNs in the testnet release make it hard to catch)

(Active Node)
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: thread 'actix-rt|system:0|arbiter:831' has overflowed its stack
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: fatal runtime error: stack overflow
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 28.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 neard[182471]: 2024-02-13T00:54:09.911073Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64

(Backup Node)
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: thread 'actix-rt|system:0|arbiter:149' has overflowed its stack
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: fatal runtime error: stack overflow
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 35.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 neard[154742]: 2024-02-13T00:25:50.291390Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64

Memory stack from one of the nodes.

Version (please complete the following information):

nearcore 1.37.0-rc.2 (Locally compiled)
rust version (if local): rustc 1.72.0
network: testnet
VMs: VMWare VMs
OS: Ubuntu 22.04.3 LTS
Cores: 8
RAM: 32GB
Storage: 1TB (NVME)

Additional context
Seems to have only have started since the resharding in this release.
Seems to occur on both active and backup nodes.

Nobody-Inc · 2024-02-13T02:42:59Z

@posvyatokum I am not able to assign this, but was instructed to send this your way.

posvyatokum · 2024-03-11T14:22:48Z

Closed in favour of #10749

posvyatokum self-assigned this Feb 13, 2024

github-actions bot mentioned this issue Mar 1, 2024

Monthly issue metrics report utnet-org/utility-readonly#10

Closed

posvyatokum mentioned this issue Mar 11, 2024

[1.37.0] Stack overflow #10749

Open

posvyatokum closed this as completed Mar 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Nobody-Inc commented Feb 13, 2024 •

edited

Loading

Nobody-Inc commented Feb 13, 2024

posvyatokum commented Mar 11, 2024

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Comments

Nobody-Inc commented Feb 13, 2024 • edited Loading

Nobody-Inc commented Feb 13, 2024

posvyatokum commented Mar 11, 2024

Nobody-Inc commented Feb 13, 2024 •

edited

Loading