Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in `node': corrupted double-linked list (TLS / SSL crashes) #933

Closed
theogravity opened this issue Jul 3, 2019 · 14 comments
Closed

Comments

@theogravity
Copy link
Contributor

theogravity commented Jul 3, 2019

Problem description

App crashes with a core trace. Unsure what is the cause. I've attached the trace log.

dump.txt

Excerpt from it:

*** Error in `node': corrupted double-linked list: 0x000000002b02cca0 ***
======= Backtrace: =========
/lib/x86_64-linux-gnu/libc.so.6(+0x70bfb)[0x7f894cfb4bfb]
/lib/x86_64-linux-gnu/libc.so.6(+0x76fc6)[0x7f894cfbafc6]
/lib/x86_64-linux-gnu/libc.so.6(+0x7733d)[0x7f894cfbb33d]
/lib/x86_64-linux-gnu/libc.so.6(+0x78dfa)[0x7f894cfbcdfa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x54)[0x7f894cfbef64]
node(CRYPTO_zalloc+0x61)[0x14460b1]
node(SSL_new+0x39)[0x13519f9]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x6c845)[0x7f894c495845]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x51076)[0x7f894c47a076]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x2b357)[0x7f894c454357]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x6d7d5)[0x7f894c4967d5]
/home/node/node_modules/grpc/src/node/extension_binary/node-v64-linux-x64-glibc/grpc_node.node(+0x7c2f5)[0x7f894c4a52f5]
node[0xa55559]
node[0xa5a5c8]
node(uv_run+0x14b)[0xa4a21b]
node(_ZN4node5StartEPN2v87IsolateEPNS_11IsolateDataERKSt6vectorISsSaISsEES9_+0x565)[0x8e6f45]
node(_ZN4node5StartEiPPc+0x469)[0x8e5239]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f894cf642e1]
node[0x89ed85]

...
  • This may have started happening with a client that implemented managed channels on the client side, but unsure how that would lead to crashes on the server.
    • The client may hold a channel open between 5-30s before closing entirely (which I understand would contribute to memory consumption to keep the socket open)
  • Our grpc servers get millions of grpc connections a day, don't generally have control over the client impls (there's multiple client/devices) of how they connect to the server
  • We use a cluster of docker containers with 3GB of memory on multiple host boxes - horizontal / vertical scaling only delays the problem
  • Our grpc connection requires TLS
  • We're unable to reproduce the problem locally at the moment (when I do memory profiling, I see gc happening properly, but I'm only launching 5-10 of our own test clients at once to test our various flows), it only seems to happen in our high-traffic production environments

Questions:

  • (might not be one for you to answer) - If node runs out of memory, we wouldn't see a corrupted double-linked list type error, but an error specific to running out of memory, correct?
  • Any potential ideas of where / how to troubleshoot based on looking at the trace log?

Environment

grpc-node versions we've tried:

  • 1.20.3
  • 1.16.1

Node Version: 10.15.2 (we've also upgraded node along the way in the 10.x series)

OS: Debian-based (using docker node:10.5.2 image), we do not do any package updates (eg apt update, install latest devtools, etc)

Additional info

We've noticed that we get a lot of these entries in our logs - it could be clients that are misconfigured, unsure if it eventually leads to crashing:

E0702 22:22:08.625087997       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:1417A0C1:SSL routines:tls_post_process_client_hello:no shared cipher.
E0702 22:23:23.517829514       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106203.517802318","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:23:54.952124948       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106234.952105465","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:13.495415849       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown.
E0702 22:27:20.835176028       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106440.835163957","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:26.764072732       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106446.764040019","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0702 22:27:40.745594638       7 ssl_transport_security.cc:1229] Handshake failed with fatal error SSL_ERROR_SSL: error:14094416:SSL routines:ssl3_read_bytes:sslv3 alert certificate unknown.
E0702 22:28:19.820591172       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562106499.820576136","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}

We also get these in our logs too:

E0703 00:14:17.961441781       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112857.961430955","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:14:51.348432608       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112891.348418529","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:15:32.933723158       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112932.933711032","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:16:14.605935643       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112974.605925035","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
E0703 00:16:27.814974144       7 tcp_server_custom.cc:220]   getpeername error: {"created":"@1562112987.814961950","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":73,"grpc_status":14,"os_error":"socket is not connected"}
@murgatroid99
Copy link
Member

First, I think these are separate errors. I think the first "Handshake failed" error indicates that a client tried to connect with a TLS cypher set that the server does not support, and I'm less sure about the second error but my guess is that you're using mutual TLS and the client does not have a recognized certificate. The getpeername error is probably a relatively minor bug; I would bet that the error messages you're seeing are the main impact.

The "corrupted double-linked list" error is one I have also seen very occasionally but I also do not know how to reproduce it. I'm pretty sure it's actually not a memory usage problem; it looks like memory corruption. The first thing that would be useful to know is how frequently you see that error on your servers, in error rate over time and/or related to the number of calls handled. That will help figure out how to proceed.

One solution may be to use the @grpc/grpc-js library. The most recent release of that library added the server APIs. It's still in beta and may not have all of the functionality you need, but it definitely won't have this bug. And if you do try it out it would be very helpful to know what if anything is missing for your use case.

@theogravity
Copy link
Contributor Author

Thanks for the response. Unfortunately, we do not have control over the implementation of clients in many cases. We do not use mTLS for our handshaking.

I am interested in the grpc-js library, but we're unsure how much we'd have to change with our own codebase to have it work. We also do a lot of bi-directional streaming as well.

Given that the core dump references SSL, we're going to put a proxy in front of our service and have it perform TLS termination instead of grpc to remove TLS being a factor.

Will update when I'm able to.

@theogravity
Copy link
Contributor Author

theogravity commented Aug 27, 2019

An update - putting a proxy in front def eliminated our SSL error messages, and this specific issue (as in we don't get the corrupted double-linked list crashes anymore). Our setup is:

Proxy does TLS termination -> grpc service running without TLS enabled (eg ServerCredentials.createInsecure() instead of ServerCredentials.createSsl()).

Given that, there has to be something with the TLS/SSL implementation here that's causing it.

Sorry the update took so long, had to go through our lengthy internal vetting/qa process before it was staged against production level traffic.

@theogravity theogravity changed the title Error in `node': corrupted double-linked list Error in `node': corrupted double-linked list (TLS / SSL crashes) Apr 7, 2021
@Matt-Esch
Copy link

The "corrupted double-linked list" error is one I have also seen very occasionally but I also do not know how to reproduce it.

This is a reproducible bug in the GRPC server. This relates to grpc/grpc#19430 and I have been able to reproduce myself. Simply spamming a grpc server with nmap -p $port causes this to occur with and without TLS or listening on 0.0.0.0

The getpeerinfo failure is related, but I would imagine that the issue occurred before this failure. It feels like a race condition in the socket lifecycle, but that is an unsubstantiated guess.

@Matt-Esch
Copy link

Running this repro with valgrind

E0113 14:50:31.035581800   24820 tcp_server_custom.cc:220]   getpeername error: {"created":"@1642085431.026475500","description":"getpeername failed","file":"../deps/grpc/src/core/lib/iomgr/tcp_uv.cc","file_line":72,"grpc_status":14,"os_error":"socket is not connected"}
==24820== Invalid read of size 8
==24820==    at 0xBC3ED57: gpr_unref (sync.cc:104)
==24820==    by 0xBB51644: grpc_error_do_unref(grpc_error*) (error.cc:186)
==24820==    by 0xBBC9110: grpc_error_unref (error.h:196)
==24820==    by 0xBBC9110: finish_accept (tcp_server_custom.cc:221)
==24820==    by 0xBBC9110: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==  Address 0x91c2e60 is 0 bytes inside a block of size 232 free'd
==24820==    at 0x4D42D3B: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24820==    by 0xBB52DA4: grpc_error_unref (error.h:196)
==24820==    by 0xBB52DA4: grpc_log_error(char const*, grpc_error*, char const*, int) (error.cc:813)
==24820==    by 0xBBC8F9F: grpc_log_if_error (error.h:271)
==24820==    by 0xBBC8F9F: finish_accept (tcp_server_custom.cc:220)
==24820==    by 0xBBC8F9F: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==  Block was alloc'd at
==24820==    at 0x4D43D2F: realloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==24820==    by 0xBC3CE53: gpr_realloc (alloc.cc:83)
==24820==    by 0xBB52477: get_placement (error.cc:205)
==24820==    by 0xBB52477: internal_set_str (error.cc:238)
==24820==    by 0xBB52477: grpc_error_set_str(grpc_error*, grpc_error_strs, grpc_slice const&) (error.cc:484)
==24820==    by 0xBB55559: tcp_error_create (tcp_uv.cc:78)
==24820==    by 0xBB55559: uv_socket_getpeername(grpc_custom_socket*, sockaddr const*, int*) (tcp_uv.cc:232)
==24820==    by 0xBBC8F71: finish_accept (tcp_server_custom.cc:216)
==24820==    by 0xBBC8F71: custom_accept_callback(grpc_custom_socket*, grpc_custom_socket*, grpc_error*) (tcp_server_custom.cc:259)
==24820==    by 0xBB56187: accept_new_connection (tcp_uv.cc:266)
==24820==    by 0xBB56187: uv_on_connect(uv_stream_s*, int) (tcp_uv.cc:283)
==24820==    by 0x1BEC6C7: uv__server_io (stream.c:570)
==24820==    by 0x1BF5688: uv__io_poll (linux-core.c:462)
==24820==    by 0x1BDC914: uv_run (core.c:385)
==24820==    by 0xE28C6F: node::NodeMainInstance::Run() (node_main_instance.cc:130)
==24820==    by 0xD5D684: node::Start(int, char**) (node.cc:1030)
==24820==    by 0x24EBBBF: main (node_main.cc:126)
==24820==
E0113 14:50:31.052687900   24820 sync.cc:105]                assertion failed: prior > 0

@Matt-Esch
Copy link

Appears to be a double unref bug in tcp_server_custom.cc, https://github.com/grpc/grpc/blob/master/src/core/lib/iomgr/tcp_server_custom.cc#L210

The call to GRPC_LOG_IF_ERROR takes ownership of the error and unrefs the error, as per the guidelines: (https://grpc.github.io/grpc/core/md_doc_core_grpc-error.html), so this extra unref call is unnecessary and causing a double-free bug.

@murgatroid99
Copy link
Member

If that is the problem, it's fixed on the latest version of the library. The grpc package is on version 1.24 of the core, in which that bug was fixed in grpc/grpc#26750. That update was picked up in #1861 and published in version 1.24.11 of the library.

@theogravity
Copy link
Contributor Author

theogravity commented Jan 13, 2022

I no longer work at the company in question, but thank you for eventually fixing it / integrating the fix!

@Matt-Esch
Copy link

Upgrading and replaying the repro I can confirm that this is indeed fixed in 1.24.11. There may be some scope for improving the peer read error handling (should an endpoint be created at all if we know the socket is already closed?) but I'm satisfied that this resolves the issue.

@murgatroid99
Copy link
Member

This library has been deprecated for almost 9 months. We will still make security fixes like this one, but not broader improvements like that.

@theogravity
Copy link
Contributor Author

This library has been deprecated for almost 9 months. We will still make security fixes like this one, but not broader improvements like that.

I would put a notice on the project readme that it's in security-fix mode and to use the javascript-based client instead.

@murgatroid99
Copy link
Member

The notice is there right at the top of the library's README, and it is marked as deprecated in npm. If you're talking about this repository as a whole, both packages live here so that would not be the appropriate place for that notice.

@theogravity
Copy link
Contributor Author

The notice is there right at the top of the library's README, and it is marked as deprecated in npm. If you're talking about this repository as a whole, both packages live here so that would not be the appropriate place for that notice.

Got it. Yeah, I was referring to the main repo for this lib. Following the link does go to the readme with the notice.

@Matt-Esch
Copy link

We attempted an upgrade to grpc-js but we had performance issues so we're stuck with this for now. I suspect it's not that grpc-js is particularly slow on its own, rather it's just an issue of piling extra work on the main thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants