Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix deadlock in sighandler #231

Merged
merged 4 commits into from
Jun 28, 2014
Merged

fix deadlock in sighandler #231

merged 4 commits into from
Jun 28, 2014

Conversation

idning
Copy link
Contributor

@idning idning commented Jun 14, 2014

localtime() used in _log() is not signal-safe.

see: http://man7.org/linux/man-pages/man7/signal.7.html

if nutcracker run with a lot of log(-v 5 and heavy load) and use kill -HUP to reopen the logfile, nutcracker will dead lock with this:

Loaded symbols for /lib64/ld-linux-x86-64.so.2
__lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
97      ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S: No such file or directory.
        in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) bt
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:97
#1  0x00007f0393918b2d in _L_lock_1934 () at tzset.c:138
#2  0x00007f03939188e7 in __tz_convert (timer=0x7f0393bfd9f4, use_localtime=1, tp=0x7f0393c00360) at tzset.c:619
#3  0x0000000000410fba in _log (file=0x431036 "nc_signal.c", line=122, panic=0, fmt=0x43101c "signal %d (%s) received%s") at nc_log.c:140
#4  0x000000000041016c in signal_handler (signo=<value optimized out>) at nc_signal.c:122
#5  <signal handler called>

#6  0x00007f03939552d5 in *__GI___xstat (vers=<value optimized out>, name=<value optimized out>, buf=0x7fff2a266fc0) at ../sysdeps/unix/sysv/linux/wordsize-64/xstat.c:38
#7  0x00007f03939193f0 in __tzfile_read (file=0x7f03939c5b7c "/etc/localtime", extra=0, extrap=<value optimized out>) at tzfile.c:173
#8  0x00007f03939187a4 in tzset_internal (always=<value optimized out>, explicit=<value optimized out>) at tzset.c:439
#9  0x00007f0393918909 in __tz_convert (timer=0x7fff2a267158, use_localtime=1, tp=0x7f0393c00360) at tzset.c:624
#10 0x0000000000410fba in _log (file=0x42ed0c "nc_server.c", line=645, panic=0, fmt=0x42ece0 "key '%.*s' on dist %d maps to server '%.*s'") at nc_log.c:140

#11 0x000000000040887e in server_pool_server (ctx=0x1c0e090, pool=0x1c0e360, key=0x1c150c2 "key:__rand_int__\r\n", keylen=16) at nc_server.c:644
#12 server_pool_conn (ctx=0x1c0e090, pool=0x1c0e360, key=0x1c150c2 "key:__rand_int__\r\n", keylen=16) at nc_server.c:668
#13 0x000000000040ae04 in req_forward (ctx=0x1c0e090, conn=0x1c14e70, msg=0x1c14f70, nmsg=<value optimized out>) at nc_request.c:482
#14 req_recv_done (ctx=0x1c0e090, conn=0x1c14e70, msg=0x1c14f70, nmsg=<value optimized out>) at nc_request.c:524
#15 0x0000000000409f00 in msg_parsed (ctx=0x1c0e090, conn=0x1c14e70) at nc_message.c:451
#16 msg_parse (ctx=0x1c0e090, conn=0x1c14e70) at nc_message.c:581
#17 msg_recv_chain (ctx=0x1c0e090, conn=0x1c14e70) at nc_message.c:642
#18 msg_recv (ctx=0x1c0e090, conn=0x1c14e70) at nc_message.c:681
#19 0x0000000000406179 in core_recv (arg=0x1c14e70, events=<value optimized out>) at nc_core.c:158
#20 core_core (arg=0x1c14e70, events=<value optimized out>) at nc_core.c:293
#21 0x000000000041bbb8 in event_wait (evb=0x1c0e570, timeout=<value optimized out>) at nc_epoll.c:269
#22 0x0000000000405f89 in core_loop (ctx=0x1c0e090) at nc_core.c:316
#23 0x0000000000412a38 in nc_run (argc=<value optimized out>, argv=0x7fff2a267690) at nc.c:531
#24 main (argc=<value optimized out>, argv=0x7fff2a267690) at nc.c:580
(gdb)

this can reproduction with::

ning@ning-laptop /tmp/r/nutcracker-4000$ cat reopen.sh 
pkill -HUP -f "^bin/nutcracker -c /tmp/r/nutcracker-4000/conf/nutcracker.conf -o /tmp/r/nutcracker-4000/log/nutcracker.log -p /tmp/r/nutcracker-4000/log/nutcracker.pid -s 5000" && ls -l log
ning@ning-laptop /tmp/r/nutcracker-4000$ for i in `seq 1 100` ; do sh reopen.sh && sleep 1 ; done ;

we should use signal-safe function in signal_handler.

redis has the same problem long time ago: redis/redis#213

the new logline of this patch:

signal SIGHUP received, reopening log file

@ideal
Copy link

ideal commented Jun 16, 2014

why not use nc_snprintf instead of multi strcat() ?

@idning
Copy link
Contributor Author

idning commented Jun 16, 2014

@ideal: snprintf is good, but no document said it's signal-safe, strcat is signal-safe according to this:

http://austingroupbugs.net/view.php?id=692#c1609

@buaazp
Copy link

buaazp commented Jun 16, 2014

There are loga() still in log_level_up() and log_level_down() when SIGTTIN or SIGTTOU is received.
Maybe loga() should also be remove in these two actions.

@idning
Copy link
Contributor Author

idning commented Jun 17, 2014

@buaazp thanks for review. here is a new commit to fix this.

log_reopen use log_stderr, which use snprintf, it's not correct, but not a common case, so I leave it untouched.

@allenlz
Copy link
Contributor

allenlz commented Jun 18, 2014

Sadly, nc_stacktrace will call loga and free too.

@manjuraj
Copy link
Collaborator

Apologies for the delay. Things have been busy!

Good catch @idning.

Here is a list of async-safe-signals -- https://www.securecoding.cert.org/confluence/display/seccode/SIG30-C.+Call+only+asynchronous-safe+functions+within+signal+handlers. I believe nc_snprintf() should be signal safe, as it eventually calls write() which is signal safe. Agree?

If the problem is with localtime() - https://github.com/twitter/twemproxy/blob/master/src%2Fnc_log.c#L139-L141, the right thing to do in my opinon is to not call this piece of code in signal handler context. How about introducing a new function log_safe() that is exactly like _log() except that it doesn't call localtime(). So instead of printing time in [] here - https://github.com/twitter/twemproxy/blob/master/src%2Fnc_log.c#L143 you could just leave it blank

So, a log line like this

[Thu Aug  2 00:03:09 2012] nc_proxy.c:336 accepted c 7 on p 6 from '127.0.0.1:54009']

in signal context would look like this:

[........................] nc_proxy.c:336 accepted c 7 on p 6 from '127.0.0.1:54009']

Does that sound good?

@idning
Copy link
Contributor Author

idning commented Jun 18, 2014

@manjuraj xxprintf is not signal-safe because they use malloc/free, see glibc manual:

http://www.gnu.org/software/libc/manual/html_node/Formatted-Output-Functions.html#Formatted-Output-Functions

all these functions is AS-Unsafe

it looks that ngx implement a new printf function to avoid this:

u_char *
ngx_vslprintf(u_char *buf, u_char *last, const char *fmt, va_list args)
{
    u_char                *p, zero;
    ngx_uint_t             width, sign, hex, max_width, frac_width, n;
    ...
    while (*fmt && buf < last) {

        /*
         * "buf < last" means that we could copy at least one character:
         * the plain character, "%%", "%c", and minus without the checking
         */

        if (*fmt == '%') {

            i64 = 0;
            ui64 = 0;

@manjuraj
Copy link
Collaborator

sigh :(

@manjuraj
Copy link
Collaborator

@idning how about repurposing/borrowing code from ngx and calling it log_vslprintf() and calling it from log_safe()

@idning
Copy link
Contributor Author

idning commented Jun 18, 2014

ngx use a cached time to be used in signal handler, it's sophisticated.
let me have a try on log_safe() and use ........... instead of time str.

@idning
Copy link
Contributor Author

idning commented Jun 19, 2014

I have add log_stderr_safe/log_safe/log_stacktrace and use these functions in sighandler.

test case:
https://github.com/idning/test-twemproxy/blob/master/test_redis/test_basic.py#L48-L62

I create a safe_snprintf lib for this:
https://github.com/idning/safe_snprintf

format:

[Thu Jun 19 16:22:02 2014] nc_core.c:201 close c 11 '127.0.0.5:51121' on event 00FF eof 1 done 1 rb 10557913 sb 72  
[........................] signal 1 (SIGHUP) received, reopening log file
[........................] signal 1 (SIGHUP) received, reopening log file
[........................] signal 21 (SIGTTIN) received, up logging level
[........................] up log level to 6
[........................] signal 22 (SIGTTOU) received, down logging level
[........................] down log level to 5
[........................] signal 11 (SIGSEGV) received, core dumping

manjuraj added a commit that referenced this pull request Jun 28, 2014
fix deadlock in sighandler
@manjuraj manjuraj merged commit 54e4fe7 into twitter:master Jun 28, 2014
@manjuraj
Copy link
Collaborator

Great stuff @idning

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants