Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Closed
Nobody-Inc opened this issue Feb 13, 2024 · 2 comments
Closed

Neard 1.37.0-rc.2 Process Crash - Stack overflow #10604

Nobody-Inc opened this issue Feb 13, 2024 · 2 comments
Assignees

Comments

@Nobody-Inc
Copy link

Nobody-Inc commented Feb 13, 2024

Describe the bug
The near process seems to be unstable since upgrading to 1.37.0-rc.2 on multiple nodes (Active validators and backup), issue seemed to have start after the resharding. Process seems to randomly crash. Currently have check 5 different nodes in multiple datacenters.

To Reproduce
Seems to be somewhat random, but of the 5 nodes that have been checked the service have all crashed within the last 2-3 hours as of the time of writing. Though seems to happen multiple times a day on all noces

Expected behavior
Process runs without crashing

Screenshots

Logs before the crash (all the WARNs in the testnet release make it hard to catch)

(Active Node)
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: thread 'actix-rt|system:0|arbiter:831' has overflowed its stack
Feb 13 00:53:39 ni-atl-tnvla01 neard[180282]: fatal runtime error: stack overflow
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:53:39 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 28.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: neard.service: Consumed 45min 46.653s CPU time.
Feb 13 00:54:09 ni-atl-tnvla01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:54:09 ni-atl-tnvla01 neard[182471]: 2024-02-13T00:54:09.911073Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64

(Backup Node)
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: thread 'actix-rt|system:0|arbiter:149' has overflowed its stack
Feb 13 00:25:19 ni-dtw-tnval01 neard[154140]: fatal runtime error: stack overflow
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Main process exited, code=killed, status=6/ABRT
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Failed with result 'signal'.
Feb 13 00:25:20 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Scheduled restart job, restart counter is at 35.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Stopped NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: neard.service: Consumed 9min 24.296s CPU time.
Feb 13 00:25:50 ni-dtw-tnval01 systemd[1]: Started NEARd Daemon Service.
Feb 13 00:25:50 ni-dtw-tnval01 neard[154742]: 2024-02-13T00:25:50.291390Z INFO neard: version="1.37.0-rc.2" build="1.37.0-rc.2" latest_protocol=64

Memory stack from one of the nodes.
image

Version (please complete the following information):

  • nearcore 1.37.0-rc.2 (Locally compiled)
  • rust version (if local): rustc 1.72.0
  • network: testnet
  • VMs: VMWare VMs
  • OS: Ubuntu 22.04.3 LTS
  • Cores: 8
  • RAM: 32GB
  • Storage: 1TB (NVME)

Additional context
Seems to have only have started since the resharding in this release.
Seems to occur on both active and backup nodes.

@Nobody-Inc
Copy link
Author

@posvyatokum I am not able to assign this, but was instructed to send this your way.

@posvyatokum
Copy link
Member

Closed in favour of #10749

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants