Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault while freeing memory #4

Closed
puetzp opened this issue Jan 31, 2019 · 20 comments
Closed

Segmentation fault while freeing memory #4

puetzp opened this issue Jan 31, 2019 · 20 comments
Assignees
Labels

Comments

@puetzp
Copy link

puetzp commented Jan 31, 2019

Hey,

I am encountering a problem using the program to query a device. The program terminates with a seg fault.
Maybe the following debugging output helps you to track down the issue:

root@<snip>:/var/log # gdb --args /usr/lib/nagios/plugins/contrib/check_interfaces '-h' '<snip>' '--user' '<snip>' '--auth-proto' '<snip>' '--auth-phrase' '<snip>' '--priv-proto' '<snip>' '--priv-phrase' '<snip>' '-x' '27' '-r' ''
<...>
Reading symbols from /usr/lib/nagios/plugins/contrib/check_interfaces...(no debugging symbols found)...done.
(gdb) run
Starting program: /usr/lib/nagios/plugins/contrib/check_interfaces -h <snip> --user <snip> --auth-proto <snip> --auth-phrase <snip> --priv-proto <snip> --priv-phrase <snip> -x 27 -r <snip>
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6ebeabf in malloc_consolidate (av=av@entry=0x7ffff71eb620 <main_arena>) at malloc.c:4165
4165    malloc.c: Datei oder Verzeichnis nicht gefunden.
(gdb) bt
#0  0x00007ffff6ebeabf in malloc_consolidate (av=av@entry=0x7ffff71eb620 <main_arena>) at malloc.c:4165
#1  0x00007ffff6ebf641 in _int_free (av=0x7ffff71eb620 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4057
#2  0x00007ffff792b168 in snmp_free_varbind () from /usr/lib/x86_64-linux-gnu/libnetsnmp.so.30
#3  0x00007ffff792b1b4 in snmp_free_pdu () from /usr/lib/x86_64-linux-gnu/libnetsnmp.so.30
#4  0x000000000040217a in ?? ()
#5  0x00007ffff6e67b45 in __libc_start_main (main=0x401700, argc=17, argv=0x7fffffffe558, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe548)
    at libc-start.c:287
#6  0x0000000000403ae5 in ?? ()
@dnsmichi
Copy link
Contributor

Which platform, version and plugin binary version is that? Can you please add a bt full output as well? This sounds more like an issue with net-snmp itself on this platform. Which net-snmp version is used?

@puetzp
Copy link
Author

puetzp commented Feb 1, 2019

Here is the info:

(gdb) bt full
#0  0x00007ffff6ec2634 in __GI___libc_free (mem=0x611620) at malloc.c:2945
        ar_ptr = <optimized out>
        p = 0x611610
        hook = <optimized out>
#1  0x00007ffff792b1fd in snmp_free_pdu () from /usr/lib/x86_64-linux-gnu/libnetsnmp.so.30
No symbol table info available.
#2  0x000000000040217a in ?? ()
No symbol table info available.
#3  0x00007ffff6e67b45 in __libc_start_main (main=0x401700, argc=17, argv=0x7fffffffe548, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe538)
    at libc-start.c:287
        result = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {0, 4445349801132744119, 4209340, 140737488348480, 0, 0, -4445349800241603145, -4445365250230526537}, mask_was_saved = 0}}, priv = {
            pad = {0x0, 0x0, 0x7fffffffe5d8, 0x7ffff7ffe1b0}, data = {prev = 0x0, cleanup = 0x0, canceltype = -6696}}}
        not_first_call = <optimized out>
#4  0x0000000000403ae5 in ?? ()
No symbol table info available.

Dist info:

Distributor ID: Debian
Description:    Debian GNU/Linux 8.11 (jessie)
Release:        8.11
Codename:       jessie

libs:

<snip>:~ # apt-cache policy libsnmp-base
libsnmp-base:
  Installiert:           5.7.2.1+dfsg-1+deb8u2
<...>
remote@<snip>:~ # apt-cache policy libsnmp-dev
libsnmp-dev:
  Installiert:           5.7.2.1+dfsg-1+deb8u2
<...>
remote@<snip>:~ # apt-cache policy libsnmp30:amd64
libsnmp30:
  Installiert:           5.7.2.1+dfsg-1+deb8u2
<...>

I'm using the latest version of the plugin (1.4).

@dnsmichi
Copy link
Contributor

dnsmichi commented Feb 6, 2019

@mxhash can you take a deeper look please? I haven't been involved in development here, nor understand the specific code parts where this could be triggered. I've just updated the README for getting things to compile easier.

@puetzp
Copy link
Author

puetzp commented Feb 6, 2019

I agree that net-snmp seems to be the issue. Executing the plugin in a Debian Stretch environment with libsnmp v5.6.3 does not fix the problem. So it might not have been fixed in a newer version yet.
I should probably open an issue in net-snmp.

@phil-or
Copy link

phil-or commented Apr 3, 2019

I think I have the same problem. While executing the check, a Segmentation Fault returned.
But this error occurs just on a few hosts.

[root@]# gdb --args '/usr/lib64/nagios/plugins/check_interfaces' '--aliases' '--auth-phrase' 'xx' '--auth-proto' 'xx' '--hostname' 'xx' '--if-names' '--priv-phrase' 'xx' '--priv-proto' 'xx' '--regex' '' '--timeout' '30000' '--user' 'xx'
(gdb) run
Starting program: /usr/lib64/nagios/plugins/check_interfaces --aliases --auth-phrase xx --auth-proto xx --hostname xx --if-names --priv-phrase xx --priv-proto xx --regex xx --timeout 30000 --user xx
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff6bb011e in _int_free () from /lib64/libc.so.6

Version:
check_interfaces 1.4
CentOS 7

Sometimes the error occurs, sometimes not...

@dnsmichi dnsmichi added the bug label Apr 3, 2019
@mxhash mxhash assigned dnsmichi and unassigned mxhash Apr 8, 2019
@mxhash
Copy link
Member

mxhash commented Apr 8, 2019

I give some tests a try. Any steps to reproduce the issue?

@phil-or
Copy link

phil-or commented Apr 8, 2019

We only have problems with checks on Nexus7700. But sometimes it works, sometimes not....

@mxhash
Copy link
Member

mxhash commented Apr 8, 2019

Ah ok ;-) bad to test for me... Does this problem also occur when querying with snmp utils?

@puetzp
Copy link
Author

puetzp commented Apr 8, 2019

Thank you for following up on this issue!
I queried all OIDs manually using a mix of snmpwalk/get/bulkget and the error did not occur whereas the program would produce a seg fault ...
Tested on a Cisco ASA.

@phil-or
Copy link

phil-or commented Apr 8, 2019

I also did a snmpwalk over all OIDs of this device and didn't get an error

@mxhash
Copy link
Member

mxhash commented Apr 8, 2019

Thank you guys. I'll give it a try.

mxhash added a commit that referenced this issue Apr 12, 2019
mxhash added a commit that referenced this issue Apr 12, 2019
@mxhash
Copy link
Member

mxhash commented Apr 12, 2019

Allright, did some testing and had no change to reproduce the issue you had. But we do not have these kind of hardware. I tried with 3com and juniper in default mode.

The stack trace also looks like an error in the libsnmp functions. Maybe you can get more info out of it (different version, debug with symbols, etc.)

At the moment I have no clue ;-)

Cheers,
Marius

@Bjoern-10101
Copy link

Hi,

possibly I have the same problem with Dell S4048-ON Switches.
The error message reads:
memoryaccesserror (translated from the german "Speicherzugriffsfehler")

If I run the check with sudo the message changes to:
Error in packet
Reason: wrongLength (The set value has an illegal length from what the agent expects)

Debug with gdb and symbols:

(gdb) run
Starting program: /home/username/check_interfaces-1.4a/check_interfaces -h "HOST"
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[Starting benchmark] Start SNMP session
[Finished benchmark after 1.396214 ms] Start SNMP session
[Starting benchmark] Send SNMP request for OIDs: .1.3.6.1.2.1.1.3, .1.3.6.1.2.1.2.1, .1.3.6.1.2.1.2.2.1.2
[Finished benchmark after 182.805305 ms] Send SNMP request for OIDs: .1.3.6.1.2.1.1.3, .1.3.6.1.2.1.2.1, .1.3.6.1.2.1.2.2.1.2
got 68 interfaces
reached end of interfaces
Device says it has 68 but really has 69 interfaces
69 interfaces found

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7925105 in _int_malloc (av=av@entry=0x7ffff7a5dc40 <main_arena>, bytes=bytes@entry=1032) at malloc.c:4033
4033    malloc.c: Datei oder Verzeichnis nicht gefunden. 

The output from gdb with elevated rights reads:

[Starting benchmark] Start SNMP session
[Finished benchmark after 1.106840 ms] Start SNMP session
[Starting benchmark] Send SNMP request for OIDs: .1.3.6.1.2.1.1.3, .1.3.6.1.2.1.2.1, .1.3.6.1.2.1.2.2.1.2
[Finished benchmark after 157.621356 ms] Send SNMP request for OIDs: .1.3.6.1.2.1.1.3, .1.3.6.1.2.1.2.1, .1.3.6.1.2.1.2.2.1.2
got 68 interfaces
reached end of interfaces
Device says it has 68 but really has 69 interfaces
69 interfaces found
[Starting benchmark] Send SNMP request for OIDs: .1.3.6.1.2.1.2.2.1.7, .1.3.6.1.2.1.2.2.1.8, .1.3.6.1.2.1.2.2.1.10, .1.3.6.1.2.1.2.2.1.13, .1.3.6.1.2.1.2.2.1.14, .1.3.6.1.2.1.2.2.1.16, .1.3.6.1.2.1.2.2.1.19, .1.3.6.1.2.1.2.2.1.20
[Finished benchmark after 0.973542 ms] Send SNMP request for OIDs: .1.3.6.1.2.1.2.2.1.7, .1.3.6.1.2.1.2.2.1.8, .1.3.6.1.2.1.2.2.1.10, .1.3.6.1.2.1.2.2.1.13, .1.3.6.1.2.1.2.2.1.14, .1.3.6.1.2.1.2.2.1.16, .1.3.6.1.2.1.2.2.1.19, .1.3.6.1.2.1.2.2.1.20
Error in packet
Reason: wrongLength (The set value has an illegal length from what the agent expects)

I tried to find a fix for this and came to the lines 1488, 1499 and 1500 in snmp_bulkget.c.
1488 creates a SNMP_MSG_GET pdu. In 1499 and 1500 the values for "non_repeaters" and "max_repetitons" are assigned. My knowledge about C and NET-SNMP is limited so I am not 100% sure if I am right here, but both values are used by getbulk operations.
In this case it looks like it is used with a SNMP_MSG_GET PDU. I am not sure if this is wanted/correct.
NET-SNMP
In this case "non_repeaters" is assigned the value "i" witch is the number of OID's that should be requested. In my case this value is 8.
What could be problematic is that both "variables" are macros and are define in "type.h"

typedef struct snmp_pdu {

#define non_repeaters	errstat
#define max_repetitions errindex
.

I think it would be fine if it is utilized by a GETBULK request. In this case I think it is used by a GET request and in this case the values for "errstat" and "errindex" are populated with "8" and "0".

The errorcode "wrongLength (The set value has an illegal length from what the agent expects)" can be found in the netsnmp file snmp_client.c and it seems like the number 8 matches the error message in line 1204.

If I change the value of "non_repeaters" to something between 1 and 19 the returned error message is changing accordingly.

If I change it to 0 the check is running fine an no error messages are shown.

My biggest question here is, why do requests to other devices do not show this problem?
I've compiled it with the netsnmp debug symbols an tried to spot the place where the errorstat is set in the response. But this is a little bit time consuming ;-)

Please let me know if this was helpful or if I am wrong.

Cheers,
Bjoern

@fragfutter
Copy link

same issue on redhat enterprise linux 8, backtrace looks similiar

#0  0x00007ffff69a299f in raise () from /lib64/libc.so.6
#1  0x00007ffff698ccf5 in abort () from /lib64/libc.so.6
#2  0x00007ffff69e5dd7 in __libc_message () from /lib64/libc.so.6
#3  0x00007ffff69ec70c in malloc_printerr () from /lib64/libc.so.6
#4  0x00007ffff69ee257 in _int_free () from /lib64/libc.so.6
#5  0x00007ffff7866848 in snmp_free_varbind () from /lib64/libnetsnmp.so.35
#6  0x00007ffff786afa8 in snmp_free_pdu () from /lib64/libnetsnmp.so.35
#7  0x00000000004029ab in ?? ()
#8  0x00007ffff698e873 in __libc_start_main () from /lib64/libc.so.6
#9  0x000000000040148e in ?? ()

@0xliam
Copy link

0xliam commented Sep 24, 2020

Sorry to bump an old thread - we're seeing this happen across a number of devices at different sites, lots ofa different vendors.

I've noticed if I add --mode nonbulk the check works correctly.

@dnsmichi dnsmichi removed their assignment Sep 25, 2020
@stevie-sy
Copy link

Sorry to bring this issue back on top. As my colleague @phil-or wrote 2019 we had sometimes troubles with Nexus 7700 series. We have installed serveral of them. We are using VDC on every Nexus. That means on every physical device are "installed" more than one virtual routers. They are configured in the same way on every hardware. Incl. the VLAN-Interfaces the Nexus has 74 ports to monitor.

Interesting thing is, some time after my colleague wrote here, everything worked fine. But suddenly the problem comes back few weeks ago. But only with one virtual router!
I took some time to go a little bit deeper into this topic and I tried some debugging tools and I found following results:

First I tried "strace". This is the anonymized output after loading net-snmp:
It looks like while/after changing the program break with "brk" the crash happens.

...
27126 15:52:07.975888 socket(AF_INET, SOCK_DGRAM, IPPROTO_IP) = 3
27126 15:52:07.976349 uname({sysname="Linux", nodename="___Icinga-Satellite", release="3.10.0-1160.15.2.el7.x86_64", version="#1 SMP Wed Feb 3 15:06:38 UTC 2021", machine="x86_64", domainname="(none)"}) = 0
27126 15:52:07.976410 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
27126 15:52:07.976469 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
27126 15:52:07.976615 sendmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="\_____________________"..., iov_len=64}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 64
27126 15:52:07.976752 select(4, [3], NULL, NULL, {tv_sec=59, tv_usec=999995}) = 1 (in [3], left {tv_sec=59, tv_usec=999116})
27126 15:52:07.977716 recvmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="_____________________"..., iov_len=65536}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 105
27126 15:52:07.977788 getsockname(3, {sa_family=AF_INET, sin_port=htons(58139), sin_addr=inet_addr("0.0.0.0")}, [28->16]) = 0
27126 15:52:07.977932 sendmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="_____________________"..., iov_len=157}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 157
27126 15:52:07.978034 select(4, [3], NULL, NULL, {tv_sec=59, tv_usec=999999}) = 1 (in [3], left {tv_sec=59, tv_usec=999206})
27126 15:52:07.978901 recvmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="_____________________"..., iov_len=65536}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 123
27126 15:52:07.978979 getsockname(3, {sa_family=AF_INET, sin_port=htons(58139), sin_addr=inet_addr("0.0.0.0")}, [28->16]) = 0
27126 15:52:07.979097 sendmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="_____________________"..., iov_len=160}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 160
27126 15:52:07.979190 select(4, [3], NULL, NULL, {tv_sec=59, tv_usec=999998}) = 1 (in [3], left {tv_sec=59, tv_usec=955910})
27126 15:52:08.023399 recvmsg(3, {msg_name={sa_family=AF_INET, sin_port=htons(161), sin_addr=inet_addr("__IP-Nexus__")}, msg_namelen=16, msg_iov=[{iov_base="_____________________"..., iov_len=65536}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, MSG_DONTWAIT) = 2850
27126 15:52:08.023480 getsockname(3, {sa_family=AF_INET, sin_port=htons(58139), sin_addr=inet_addr("0.0.0.0")}, [28->16]) = 0
27126 15:52:08.023615 brk(NULL)         = 0x726000
27126 15:52:08.023658 brk(0x747000)     = 0x747000
27126 15:52:08.023790 brk(NULL)         = 0x747000
27126 15:52:08.023830 brk(0x768000)     = 0x768000
27126 15:52:08.023959 brk(NULL)         = 0x768000
27126 15:52:08.024000 brk(NULL)         = 0x768000
27126 15:52:08.024047 brk(0x752000)     = 0x752000
27126 15:52:08.024111 brk(NULL)         = 0x752000
27126 15:52:08.024154 --- SIGSEGV {si_signo=SIGSEGV, si_code=SI_KERNEL, si_addr=NULL} ---

After I tried "ltrace". This is the shorted and anonymized output starting with the function "snmp_pdu_create":
It looks like that the snmpget works. But after processing the last Interface and calling "snmp_free_pdu" the crash happens:


...
snmp_pdu_create(165, 0x6075a0, 0x7ffc1da9a240, 0x7ffc1da9a250)                                                                  = 0x191f7b0
snmp_parse_oid(0x405429, 0x191eb90, 0x191ef90, 0)                                                                               = 0x191eb90
snmp_add_null_var(0x191f7b0, 0x191eb90, 8, 0)                                                                                   = 0x191f8d0
snmp_parse_oid(0x40543a, 0x191ef98, 0x191f398, 1)                                                                               = 0x191ef98
snmp_add_null_var(0x191f7b0, 0x191ef98, 8, 0)                                                                                   = 0x191fd50
snmp_parse_oid(0x40544b, 0x191f3a0, 0x191f7a0, 2)                                                                               = 0x191f3a0
snmp_add_null_var(0x191f7b0, 0x191f3a0, 10, 0)                                                                                  = 0x19201d0
snmp_synch_response(0x191e8e0, 0x191f7b0, 0x7ffc1da9a248, 0x191fd50)                                                            = 0
memcmp(0x191eb90, 0x1931030, 64, 1024)                                                                                          = 0
memcmp(0x191eb90, 0x19314b0, 64, 0)                                                                                             = 0xffffffff
memcmp(0x191ef98, 0x19314b0, 64, 2)                                                                                             = 0
calloc(73, 864)                                                                                                                 = 0x1920d50
calloc(73, 864)                                                                                                                 = 0x1931c10
__memcpy_chk(0x7ffc1da9a340, 0x194e760, 88, 1032)                                                                               = 0x7ffc1da9a340
memcmp(0x191f3a0, 0x194e760, 80, 1032)                                                                                          = 0
memcpy(0x1920d68, "mgmt0", 5)                                                                                                   = 0x1920d68

... process all interfaces (shortened)

__memcpy_chk(0x7ffc1da9a340, 0x1962fe0, 88, 1032)                                                                               = 0x7ffc1da9a340
memcmp(0x191f3a0, 0x1962fe0, 80, 1032)                                                                                          = 0
memcpy(0x19303c8, "Ethernet1/7-mpls layer", 22)                                                                                 = 0x19303c8
__memcpy_chk(0x7ffc1da9a340, 0x1963460, 88, 1032)                                                                               = 0x7ffc1da9a340
memcmp(0x191f3a0, 0x1963460, 80, 1032)                                                                                          = 0xffffffff
snmp_free_pdu(0x1930e70, 0x19634b0, 0, 3 <no return ...>
--- SIGSEGV (Segmentation fault) ---

Next debug tool I used was "valgrind". This is the anonymized output:
What we'll see here is that the problem may cause with the library "/usr/lib64/libc-2.17.so".

==12455== Memcheck, a memory error detector
==12455== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==12455== Using Valgrind-3.15.0-608cb11914-20190413 and LibVEX; rerun with -h for copyright info
==12455== Command: /usr/lib64/nagios/plugins/check_interfaces --aliases --auth-phrase '__AUTH_PHRASE__' --auth-proto SHA --hostname __IP-Nexus__ --if-names --priv-phrase '__PRIV-PHRASE__' --priv-proto AES --regex '__REGEX__' --retries 3 --timeout 60000 --user '__SNMPv3-USER__'
==12455== 
--12455-- Valgrind options:
--12455--    -v
--12455-- Contents of /proc/version:
--12455--   Linux version 3.10.0-1160.15.2.el7.x86_64 ([email protected]) (gcc version 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) ) #1 SMP Wed Feb 3 15:06:38 UTC 2021
--12455-- 
--12455-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-ssse3-avx-avx2-bmi-f16c-rdrand
--12455-- Page sizes: currently 4096, max supported 4096
--12455-- Valgrind library directory: /usr/libexec/valgrind
--12455-- Reading syms from /usr/lib64/nagios/plugins/check_interfaces
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/ld-2.17.so
--12455-- Reading syms from /usr/libexec/valgrind/memcheck-amd64-linux
--12455--    object doesn't have a symbol table
--12455--    object doesn't have a dynamic symbol table
--12455-- Scheduler: using generic scheduler lock implementation.
--12455-- Reading suppressions file: /usr/libexec/valgrind/default.supp
==12455== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-12455-by-root-on-lxli03-t38.ooe.gv.at
==12455== embedded gdbserver: writing to   /tmp/vgdb-pipe-to-vgdb-from-12455-by-root-on-lxli03-t38.ooe.gv.at
==12455== embedded gdbserver: shared mem   /tmp/vgdb-pipe-shared-mem-vgdb-12455-by-root-on-lxli03-t38.ooe.gv.at
==12455== 
==12455== TO CONTROL THIS PROCESS USING vgdb (which you probably
==12455== don't want to do, unless you know exactly what you're doing,
==12455== or are doing some strange experiment):
==12455==   /usr/libexec/valgrind/../../bin/vgdb --pid=12455 ...command...
==12455== 
==12455== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==12455==   /path/to/gdb /usr/lib64/nagios/plugins/check_interfaces
==12455== and then give GDB the following command
==12455==   target remote | /usr/libexec/valgrind/../../bin/vgdb --pid=12455
==12455== --pid is optional if only one valgrind process is running
==12455== 
--12455-- REDIR: 0x4019e40 (ld-linux-x86-64.so.2:strlen) redirected to 0x580c7ed5 (???)
--12455-- REDIR: 0x4019c10 (ld-linux-x86-64.so.2:index) redirected to 0x580c7eef (???)
--12455-- Reading syms from /usr/libexec/valgrind/vgpreload_core-amd64-linux.so
--12455-- Reading syms from /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
==12455== WARNING: new redirection conflicts with existing -- ignoring it
--12455--     old: 0x04019e40 (strlen              ) R-> (0000.0) 0x580c7ed5 ???
--12455--     new: 0x04019e40 (strlen              ) R-> (2007.0) 0x04c2d1b0 strlen
--12455-- REDIR: 0x4019dc0 (ld-linux-x86-64.so.2:strcmp) redirected to 0x4c2e300 (strcmp)
--12455-- REDIR: 0x401aa80 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4c31f90 (mempcpy)
--12455-- Reading syms from /usr/lib64/librt-2.17.so
--12455-- Reading syms from /usr/lib64/libnetsnmp.so.31.0.2
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libssl.so.1.0.2k
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libcrypto.so.1.0.2k
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libm-2.17.so
--12455-- Reading syms from /usr/lib64/libc-2.17.so
--12455-- Reading syms from /usr/lib64/libpthread-2.17.so
--12455-- Reading syms from /usr/lib64/libgssapi_krb5.so.2.2
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libkrb5.so.3.3
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libcom_err.so.2.1
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libk5crypto.so.3.1
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libdl-2.17.so
--12455-- Reading syms from /usr/lib64/libz.so.1.2.7
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libkrb5support.so.0.1
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libkeyutils.so.1.5
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libresolv-2.17.so
--12455-- Reading syms from /usr/lib64/libselinux.so.1
--12455--    object doesn't have a symbol table
--12455-- Reading syms from /usr/lib64/libpcre.so.1.2.0
--12455--    object doesn't have a symbol table
--12455-- REDIR: 0x5da91c0 (libc.so.6:strcasecmp) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da5f40 (libc.so.6:strnlen) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5dab490 (libc.so.6:strncasecmp) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da89a0 (libc.so.6:memset) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da8950 (libc.so.6:memcpy@GLIBC_2.2.5) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da5e10 (libc.so.6:strlen) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da43c0 (libc.so.6:strcmp) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5dadb60 (libc.so.6:memcpy@@GLIBC_2.14) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da8b00 (libc.so.6:mempcpy) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da5850 (libc.so.6:strcpy) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da78b0 (libc.so.6:strncpy) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da4300 (libc.so.6:index) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da8380 (libc.so.6:bcmp) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5daf210 (libc.so.6:rawmemchr) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da4100 (libc.so.6:strcat) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da9020 (libc.so.6:stpcpy) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da6020 (libc.so.6:strncmp) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5e2e310 (libc.so.6:__memcpy_chk) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da78f0 (libc.so.6:rindex) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5dbef70 (libc.so.6:strstr) redirected to 0x4a247a0 (_vgnU_ifunc_wrapper)
--12455-- REDIR: 0x5da5e60 (libc.so.6:__GI_strlen) redirected to 0x4c2d110 (__GI_strlen)
--12455-- REDIR: 0x5da7930 (libc.so.6:__GI_strrchr) redirected to 0x4c2cb70 (__GI_strrchr)
--12455-- REDIR: 0x5d9e740 (libc.so.6:malloc) redirected to 0x4c29eec (malloc)
--12455-- REDIR: 0x5dadbd0 (libc.so.6:__GI_memcpy) redirected to 0x4c2efb0 (__GI_memcpy)
--12455-- REDIR: 0x5d9eb60 (libc.so.6:free) redirected to 0x4c2afe6 (free)
--12455-- REDIR: 0x5da8a00 (libc.so.6:__GI_memset) redirected to 0x4c30ee0 (memset)
--12455-- REDIR: 0x5e59c40 (libc.so.6:__strrchr_sse42) redirected to 0x4c2cc00 (__strrchr_sse42)
--12455-- REDIR: 0x5da6060 (libc.so.6:__GI_strncmp) redirected to 0x4c2d940 (__GI_strncmp)
--12455-- REDIR: 0xffffffffff600400 (???:???) redirected to 0x580c7ec1 (???)
--12455-- REDIR: 0x5da4400 (libc.so.6:__GI_strcmp) redirected to 0x4c2e210 (__GI_strcmp)
--12455-- REDIR: 0xffffffffff600000 (???:???) redirected to 0x580c7eb7 (???)
--12455-- REDIR: 0x5d9ec40 (libc.so.6:realloc) redirected to 0x4c2c1c5 (realloc)
--12455-- REDIR: 0x5d9f160 (libc.so.6:calloc) redirected to 0x4c2bff3 (calloc)
--12455-- REDIR: 0x5d52940 (libc.so.6:setenv) redirected to 0x4c32a20 (setenv)
--12455-- REDIR: 0x5da4340 (libc.so.6:__GI_strchr) redirected to 0x4c2cca0 (__GI_strchr)
--12455-- REDIR: 0x5da8b70 (libc.so.6:__GI_mempcpy) redirected to 0x4c31cc0 (__GI_mempcpy)
--12455-- REDIR: 0x5e57eb0 (libc.so.6:__strcmp_sse42) redirected to 0x4c2e2b0 (__strcmp_sse42)
--12455-- REDIR: 0x5e887a0 (libc.so.6:__strlen_sse2_pminub) redirected to 0x4c2d0f0 (strlen)
--12455-- REDIR: 0x5e618c0 (libc.so.6:__strncasecmp_avx) redirected to 0x4c2db70 (strncasecmp)
--12455-- REDIR: 0x5da9060 (libc.so.6:__GI_stpcpy) redirected to 0x4c30850 (__GI_stpcpy)
--12455-- REDIR: 0x5db3da0 (libc.so.6:__strncpy_sse2_unaligned) redirected to 0x4c2d780 (__strncpy_sse2_unaligned)
--12455-- REDIR: 0x5e57e00 (libc.so.6:__strchr_sse42) redirected to 0x4c2cd60 (index)
--12455-- REDIR: 0x5da8030 (libc.so.6:memchr) redirected to 0x4c2e3a0 (memchr)
--12455-- REDIR: 0x5daf450 (libc.so.6:strchrnul) redirected to 0x4c31ab0 (strchrnul)
--12455-- REDIR: 0x5e6db60 (libc.so.6:__memcpy_ssse3_back) redirected to 0x4c2e7b0 (memcpy@@GLIBC_2.14)
--12455-- REDIR: 0x5e58c60 (libc.so.6:__strncmp_sse42) redirected to 0x4c2da20 (__strncmp_sse42)
--12455-- REDIR: 0x5dbe9b0 (libc.so.6:__GI_strstr) redirected to 0x4c32220 (__strstr_sse2)
--12455-- REDIR: 0x5e60250 (libc.so.6:__strcasecmp_avx) redirected to 0x4c2da90 (strcasecmp)
--12455-- REDIR: 0x5da87c0 (libc.so.6:__GI_memmove) redirected to 0x4c31320 (__GI_memmove)
--12455-- REDIR: 0x5e830d0 (libc.so.6:__memcmp_sse4_1) redirected to 0x4c30590 (__memcmp_sse4_1)
--12455-- REDIR: 0x5e59de0 (libc.so.6:__strstr_sse42) redirected to 0x4c322b0 (__strstr_sse42)
--12455-- REDIR: 0x5e73100 (libc.so.6:__memmove_ssse3_back) redirected to 0x4c2e460 (memcpy@GLIBC_2.2.5)
--12455-- REDIR: 0x5db3770 (libc.so.6:__strcpy_sse2_unaligned) redirected to 0x4c2d1d0 (strcpy)
--12455-- REDIR: 0x5e6db50 (libc.so.6:__memcpy_chk_ssse3_back) redirected to 0x4c32080 (__memcpy_chk)
==12455== Invalid write of size 4
==12455==    at 0x401FCF: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455==  Address 0x7c50ebc is 12 bytes after a block of size 63,072 alloc'd
==12455==    at 0x4C2C089: calloc (vg_replace_malloc.c:762)
==12455==    by 0x401ED7: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455== 
==12455== Invalid write of size 8
==12455==    at 0x4C2E8F3: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
==12455==    by 0x401FE8: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455==  Address 0x7c50ec8 is 24 bytes after a block of size 63,072 in arena "client"
==12455== 
==12455== Invalid write of size 2
==12455==    at 0x4C2E943: memcpy@@GLIBC_2.14 (vg_replace_strmem.c:1035)
==12455==    by 0x401FE8: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455==  Address 0x7c50ed8 is 24 bytes before a block of size 63,072 alloc'd
==12455==    at 0x4C2C089: calloc (vg_replace_malloc.c:762)
==12455==    by 0x401EE9: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455== 
==12455== Invalid write of size 1
==12455==    at 0x401FFB: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455==  Address 0x7c50ede is 18 bytes before a block of size 63,072 alloc'd
==12455==    at 0x4C2C089: calloc (vg_replace_malloc.c:762)
==12455==    by 0x401EE9: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
==12455== 

valgrind: m_mallocfree.c:307 (get_bszB_as_is): Assertion 'bszB_lo == bszB_hi' failed.
valgrind: Heap block lo/hi size mismatch: lo = 63136, hi = 8387231318654088261.
This is probably caused by your program erroneously writing past the
end of a heap block and corrupting heap metadata.  If you fix any
invalid writes reported by Memcheck, this assertion failure will
probably go away.  Please try that before reporting this as a bug.


host stacktrace:
==12455==    at 0x5804C3A3: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x5804C4B7: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x5804C651: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x58055BD3: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x58043813: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x58042796: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x58047462: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x58041D0B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x5801848B: ??? (in /usr/libexec/valgrind/memcheck-amd64-linux)
==12455==    by 0x1002EEBA74: ???
==12455==    by 0x1002BA9F2F: ???
==12455==    by 0x1C0F: ???
==12455==    by 0x100200842F: ???
==12455==    by 0x100200842F: ???
==12455==    by 0x1002BA9F3F: ???
==12455==    by 0x10023C3FAF: ???

sched status:
  running_tid=1

Thread 1: status = VgTs_Runnable (lwpid 12455)
==12455==    at 0x4049DA: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x40218A: ??? (in /usr/lib64/nagios/plugins/check_interfaces)
==12455==    by 0x5D3B554: (below main) (in /usr/lib64/libc-2.17.so)
client stack range: [0x1FFEFDC000 0x1FFF000FFF] client SP: 0x1FFEFDFB28
valgrind stack range: [0x1002AAA000 0x1002BA9FFF] top usage: 7360 of 1048576


Note: see also the FAQ in the source distribution.
It contains workarounds to several common problems.
In particular, if Valgrind aborted or crashed after
identifying problems in your program, there's a good chance
that fixing those problems will prevent Valgrind aborting or
crashing, especially if it happened in m_mallocfree.c.

If that doesn't help, please report this bug to: www.valgrind.org

In the bug report, send all the above text, the valgrind
version, and what OS and version you are using.  Thanks.

The last tool I used is "dmesg". This one seems to confirm what "valgrind" is printing:


traps: check_interface[11583] general protection ip:7f2d2a692fbe sp:7ffc844ffc30 error:0 in libc-2.17.so[7f2d2a612000+1c4000]

I did also other tests with diffrent parameters. Like @0xliam mentioned "--mode nonbulk". With this one the segfault also happens on problematic virtual router on the Nexus 7700.
I also tried the value "cisco" for this parameter. Here it's the same. But I saw there exists the parameter "--max-repetitions" - which is still not mentioned in the docs on GitHub or exchange.icinga.com by the way. With the reference to the docs from net-snmp, I read the standard value for this parameter is 10. So I tried diffrent values. At the end with the
value "1" the check works also with the problematic virtual router. Every number which is greater delivers the segfault.
For me it looks like check_interfaces catches more or diffrent (corrupted??) information with the snmp request than the other similar configured routers - for whatever reason.

I hope this information helps to debug this better.

@phil-or
Copy link

phil-or commented Jan 25, 2022

Any news on this issue? It Seems there is no solution for this problem yet?

@tbauriedel
Copy link
Member

ref/NC/761567

@RincewindsHat
Copy link
Member

@phil-or just merged #39. care to try it?

@RincewindsHat
Copy link
Member

I will now call this fixed by #39.
Honorably mentions to @distahl, who fixed this 3 years ago(!!), which I did not see (mea culpa).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants