Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error: unexpected signal during runtime execution #8558

Open
fanatl opened this issue Aug 25, 2020 · 11 comments
Open

fatal error: unexpected signal during runtime execution #8558

fanatl opened this issue Aug 25, 2020 · 11 comments
Labels
type/crash The issue description contains a golang panic and stack trace

Comments

@fanatl
Copy link

fanatl commented Aug 25, 2020

Overview of the Issue

After upgrading from version 1.4.1 to 1.7.2 consul agent periodically restarts or hangs

Reproduction Steps

Consul v1.7.2
3 servers
254 agents

Consul info for both Client and Server

Client info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 2
        services = 2
build:
        prerelease = 
        revision = 9ea1a204
        version = 1.7.2
consul:
        acl = disabled
        known_servers = 3
        server = false
runtime:
        arch = amd64
        cpu_count = 72
        goroutines = 96
        max_procs = 72
        os = linux
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 295
        failed = 2
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 914365
        members = 256
        query_queue = 0
        query_time = 607
Server info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 6
        services = 6
build:
        prerelease = 
        revision = 9ea1a204
        version = 1.7.2
consul:
        acl = disabled
        bootstrap = false
        known_datacenters = 6
        leader = false
        leader_addr = 172.16.200.32:8300
        server = true
raft:
        applied_index = 364750949
        commit_index = 364750949
        fsm_pending = 0
        last_contact = 15.676487ms
        last_log_index = 364750949
        last_log_term = 16
        last_snapshot_index = 364740740
        last_snapshot_term = 16
        latest_configuration = [{Suffrage:Voter ID:19c90ce8-ed90-ec59-bcb5-f3c2373fe6d2 Address:172.16.200.53:8300} {Suffrage:Voter ID:609cd8f2-b630-1b49-dc2f-db5889c72d42 Address:172.16.200.32:8300} {Suffrage:Voter ID:1b8a5854-e5e9-5072-e855-90c0758973aa Address:172.16.200.11:8300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 16
runtime:
        arch = amd64
        cpu_count = 48
        goroutines = 784
        max_procs = 48
        os = linux
        version = go1.13.7
serf_lan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 295
        failed = 3
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 914366
        members = 257
        query_queue = 0
        query_time = 607
serf_wan:
        coordinate_resets = 0
        encrypted = true
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 3418
        members = 20
        query_queue = 0
        query_time = 34

Operating system and Environment details

OS:
Oracle Linux Server release 7.6

Architecture:
x86_64

Procinfo
processor       : 71
vendor_id       : GenuineIntel
cpu family      : 6
model           : 85
model name      : Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
stepping        : 4
microcode       : 0x2000065
cpu MHz         : 2999.876
cache size      : 25344 KB
physical id     : 1
siblings        : 36
core id         : 27
cpu cores       : 18
apicid          : 119
initial apicid  : 119
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti ssbd mba ibrs ibpb stibp tpr_shadow flexpriority ept fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm mpx rdt_a avx512f avx512dq rdseed adx clflushopt clwb intel_pt avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips        : 4617.46
clflush size    : 64
cache_alignment : 64
address sizes   : 47 bits physical, 48 bits virtual
power management:

Meminfo
MemTotal:       1053580972 kB
MemFree:        10921816 kB
MemAvailable:   994938308 kB
Buffers:           67404 kB
Cached:         990592488 kB
SwapCached:            0 kB
Active:         338375648 kB
Inactive:       671238784 kB
Active(anon):   20888036 kB
Inactive(anon):  2275220 kB
Active(file):   317487612 kB
Inactive(file): 668963564 kB
Unevictable:       13740 kB
Mlocked:           13740 kB
SwapTotal:      16777212 kB
SwapFree:       16777212 kB
Dirty:           2711660 kB
Writeback:             0 kB
AnonPages:      18685476 kB
Mapped:          1926308 kB
Shmem:           4209344 kB
Slab:           30215320 kB
SReclaimable:   29884528 kB
SUnreclaim:       330792 kB
KernelStack:       24944 kB
PageTables:       100816 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    543567696 kB
Committed_AS:   25208224 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
HardwareCorrupted:     0 kB
AnonHugePages:  17924096 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:     7847596 kB
DirectMap2M:    584421376 kB
DirectMap1G:    480247808 kB

Log Fragments

goroutine 19 [running]:
runtime.throw(0x30c01b6, 0x2a)
/usr/local/go/src/runtime/panic.go:774 +0x72 fp=0xc0001baf30 sp=0xc0001baf00 pc=0x42f482
runtime.sigpanic()
/usr/local/go/src/runtime/signal_unix.go:378 +0x47c fp=0xc0001baf60 sp=0xc0001baf30 pc=0x444f6c
runtime.timerproc(0x50bd2c0)
/usr/local/go/src/runtime/time.go:260 +0xa2 fp=0xc0001bafd8 sp=0xc0001baf60 pc=0x44e172
runtime.goexit()
/usr/local/go/src/runtime/asm_amd64.s:1357 +0x1 fp=0xc0001bafe0 sp=0xc0001bafd8 pc=0x45f3f1
created by runtime.(*timersBucket).addtimerLocked
/usr/local/go/src/runtime/time.go:169 +0x10e
goroutine 1 [select, 1482 minutes]:
github.com/hashicorp/consul/command/agent.(*cmd).run(0xc0001cf500, 0xc000174140, 0x4, 0x4, 0x0)
/home/circleci/project/consul/command/agent/agent.go:331 +0x13eb
github.com/hashicorp/consul/command/agent.(*cmd).Run(0xc0001cf500, 0xc000174140, 0x4, 0x4, 0xc00000cec0)
/home/circleci/project/consul/command/agent/agent.go:78 +0x4d
github.com/mitchellh/cli.(*CLI).Run(0xc00019a780, 0xc00019a780, 0x80, 0xc00000d200)
/go/pkg/mod/github.com/mitchellh/[email protected]/cli.go:255 +0x1da
@stepanovmm1992
Copy link

Yes, I have same problem.

@dnephin dnephin added the type/crash The issue description contains a golang panic and stack trace label Aug 26, 2020
@dnephin
Copy link
Contributor

dnephin commented Aug 26, 2020

Thank you for the report! This sounds like it may be an issue with the go runtime.

For anyone who has hit this problem, which Linux kernel version are you using (uname -a) ?

Release v1.7.2 was built with go1.13.7. This Go issue seems like it might be related: golang/go#35777

I believe this was fixed in go1.14, which we use to build the v1.8.x releases. Upgrading to 1.8.x may resolve the problem.

Later v1.7.x releases (ex: 1.7.7) were also built with newer version of go, which may also include the fix.

@fanatl
Copy link
Author

fanatl commented Aug 28, 2020

uname -a

Linux 4.14.35-1818.3.3.el7uek.x86_64 #2 SMP Mon Sep 24 14:45:01 PDT 2018 x86_64 x86_64 x86_64 GNU/Linux

Version 1.7.7 is installed.
Reboots are still happening.

Detailed logs are attached.

consul-server-2_2020.08.28.log
consul-agent-kn-0033_2020.08.28.log
consul-agent-kn_0030_2020.08.28.log

@dnephin
Copy link
Contributor

dnephin commented Aug 28, 2020

That you for the report and the logs! I'm not sure what is happening here, but from what I can tell it is an issue with the Go runtime. I've opened an issue on the Go issue tracker (golang/go#41099) to see if they can help.

If you are able to test with the latest 1.8.x release (which was built with go1.14.x) that might help as well.

@dnephin
Copy link
Contributor

dnephin commented Aug 28, 2020

It sounds like we will need to try to reproduce with go1.14.x or go1.15, since go1.13.x is no longer supported with the release of go1.15.

I built a version of Consul 1.7.7 using go1.14.7. You can find those binaries built in CI here: https://app.circleci.com/pipelines/github/hashicorp/consul/12178/workflows/c0691c42-089a-4e26-b966-8d9ae1dcd8c9/jobs/229429/artifacts

Note that these are not official release binaries, but the only change from the official release is the change in Go version.

@fanatl
Copy link
Author

fanatl commented Aug 31, 2020

Thanks for the help.

Installed the consul indicated on your link, we are watching the work.

@fanatl
Copy link
Author

fanatl commented Sep 1, 2020

Unfortunately the reboots are still going on.

Found a dependency. Service reboots occur only on hosts with Intel Optane connected in RAM mode.

Probably this is a Go runtime issue.

@dnephin
Copy link
Contributor

dnephin commented Sep 1, 2020

Ah, good find!

If you can provide logs from the binary built with go1.14.7 I will update the issue I opened on the golang issue tracker (golang/go#41099). They may be able to help find the problem.

@fanatl
Copy link
Author

fanatl commented Sep 2, 2020

Sure, log attached.

consul-agent-kn-0030_2020.09.02_trace.log

@fanatl
Copy link
Author

fanatl commented Sep 8, 2020

I have attached the logs in a previous post.
Could you please update issue (golang/go#41099).

@fanatl
Copy link
Author

fanatl commented Sep 28, 2020

@dnephin
Any news on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/crash The issue description contains a golang panic and stack trace
Projects
None yet
Development

No branches or pull requests

3 participants