Debugging Excessive and Contended Mutex Locks

The purpose of this document is to show tools to help finding unexpected lock contention which is slowing down a workload.

When a thread needs a mutex but it's already locked, this causes a context switch which is orders of magnitude slower than an uncontended lock/unlock (ball-park figures are about 10 times slower on a single CPU, about 50 times slower when the lock is held by another CPU). Lock contention may also increase the idle time, if there are no additional threads wanting to run.

In some workloads, we can discover that one mutex is causing much of this lock contention, and perhaps it can be replaced by a lock-free algorithm or RCU (e.g., when the mutex is protecting a read-mostly data structure) which will avoid this contention. In other cases, we discover that while the mutex is not contended, we lock and unlock it excessively, which takes time. This document explains how to find these problematic mutexes.

The text below follows a real example of running "memcached" with UDP, using the image

make image=memcached,httpserver

For more information, check out [Debugging OSv](Debugging OSv), [Trace analysis using trace.py](Trace analysis using trace.py).

Counting mutex locks and waits

The first step is to verify that we do have a problem with excessive mutex-related context switches, by counting the frequency of context switches, mutex-caused context switches, and mutex use in general. One easy way to do this counting is with the "scripts/freq.py" tool which connects to a running OSv image via the REST API.

Our image includes both the REST API ("httpserver") and memcached, so let's just run it:

sudo scripts/run.py --api -m4G -c1 -V

The "--api" option is needed so that the REST API port in the guess (port 8000) is forwarded to port 8000 on the host, where the "scripts/freq.py" script will be able to connect to it. Alternatively, you can also use the "-nv" options to enable bridged networking for the guest - so that the guest will get an IP address fully visible to the host. You'll now see output like:

OSv v0.24-78-g43be0b8
eth0: 192.168.122.15
Memcached for OSV, based on Memcached 1.4.15

In the shell prompt on the host, the scripts/perf.py command can be used to periodically show the frequency of the given tracepoints. In the following example, we look at the sched_switch (number of context switches), sched_wait (number of times a thread yielded because it started to wait), mutex_lock_wait (number of times such a wait was caused by a mutex::lock()), and mutex_lock (the total number of mutex::lock() calls).

$ scripts/freq.py 127.0.0.1 sched_switch sched_wait mutex_lock_wait mutex_lock
Showing number of events per second. Refreshing every 2 seconds
 sched_switch sched_wait mutex_lock_wait mutex_lock
           42         26               0       6443
          157         77               0        366
          145         79               0        299
          137         75               0        350
          140         76               0        257
          149         82               1        790
        89186      45628           12122    4740993
       104427      53004           14461    4714123
        91227      46281           12417    4544367
        97062      49180           13143    4736282
       107685      54459           14739    4651545
        94802      48192           12786    4721227
...

TODO: this mutex_lock_wait number is lower than I saw previously... Or below.

Finding where the mutex waits happen

First, we need to run the workload with the mutex_lock_wait and mutex_lock_wake tracepoints enabled, together with backtraces:

sudo scripts/run.py --trace=mutex_lock_wake,mutex_lock_wait --trace-backtrace -m4G -c1 -Vnv -e "memcached.so -u root -t 1 -m 3000"

And run in the background the "memaslap" benchmark which loads this server. Now, use scripts/trace.py extract to extract these tracepoints pairs. We can use scripts/trace.py summary to show the time the time period these tracepoints span, and their number in this period:

$ scripts/trace.py summary
Collected 30701 samples spanning 535.26 ms

Tracepoint statistics:

  name               count
  ----               -----
  mutex_lock_wait    15350
  mutex_lock_wake    15351

Now, scripts/trace.py prof-lock -FL can extract and summarize these tracepoint pairs.

The first block in the summary shows that 76% of the contended locks happened in in_pcblookup_hash():

38.66 ms (100.00%, #15350) All
 log /data/nyh/osv/include/osv/trace.hh:363
 lockfree::mutex::lock() /data/nyh/osv/core/lfmutex.cc:87
 |-- 29.38 ms (76.01%, #10598) mutex_lock /data/nyh/osv/include/osv/mutex.h:33
 |    |-- 29.04 ms (75.12%, #10409) in_pcblookup_hash /data/nyh/osv/bsd/sys/neti
net/in_pcb.cc:1647
 |    |    udp_input(mbuf*, int) /data/nyh/osv/bsd/sys/netinet/udp_usrreq.cc:546
 |    |    ip_input /data/nyh/osv/bsd/sys/netinet/ip_input.cc:780
 |    |    netisr_dispatch_src /data/nyh/osv/bsd/sys/net/netisr.cc:769
 |    |    ether_input_internal /data/nyh/osv/bsd/sys/net/if_ethersubr.cc:733
 |    |    ether_nh_input /data/nyh/osv/bsd/sys/net/if_ethersubr.cc:746
 |    |    netisr_dispatch_src /data/nyh/osv/bsd/sys/net/netisr.cc:769
 |    |    virtio::net::receiver() /data/nyh/osv/drivers/virtio-net.cc:452
 |    |    sched::thread::main() /data/nyh/osv/core/sched.cc:813
 |    |   
 |    \-- 344.25 us (0.89%, #189) udp_output /data/nyh/osv/bsd/sys/netinet/udp_u
srreq.cc:940
 |         udp_send /data/nyh/osv/bsd/sys/netinet/udp_usrreq.cc:1533
 |         sosend_dgram /data/nyh/osv/bsd/sys/kern/uipc_socket.cc:910
 |         kern_sendit /data/nyh/osv/bsd/sys/kern/uipc_syscalls.cc:506
 |         linux_sendit /data/nyh/osv/bsd/sys/compat/linux/linux_socket.cc:499
 |         sendmsg /data/nyh/osv/bsd/sys/kern/uipc_syscalls_wrap.cc:239
 |         0x1000000136f4
 |         0x100000014585
 |         0x10000001491d
 |         0x100000410a43

We see here two paths which do INP_LOCK(inpcb): The first is the input path, run from the virtio receiver thread. The second is the output path, where the application thread uses sendmsg() to send a reply.

The receiver thread runs udp_input() which runs

		inp = in_pcblookup_mbuf(&V_udbinfo, ip->ip_src, uh->uh_sport,
		    ip->ip_dst, uh->uh_dport, INPLOOKUP_WILDCARD |
		    INPLOOKUP_LOCKPCB, ifp, m);

The INPLOOKUP_LOCKPCB flag is the one which causes the inpcb's locking, and the contention.

TODO: say that netchannels would solve this

Finding where any waits happen

Run with sched_wait tracepoint enabled, and backtraces:

sudo scripts/run.py --trace=sched_wait --trace-backtrace -m4G -c1 -Vnv -e "memcached.so -u root -t 1 -m 3000"

Run the memaslap benchmark on the host, and scripts/trace.py extract.

$ scripts/trace.py summary
Collected 32755 samples spanning 323.73 ms

Tracepoint statistics:

  name          count
  ----          -----
  sched_wait    32755

Then run scripts/trace.py prof -FL to understand where these 32755 waits (in 1/3rd of a second) come from. It turns out that:

59% of the waits, 19314 in 1/3rd second, happen where net::receiver (virtio_net.cc) calls virtio_driver::wait_for_queue(vq, &vring::used_ring_not_empty);
About 35% of the waits happen in mutex locks. As above about 75% of those is the inpcb lock on input, and most of the rest is in the poll() implementation. The rest is negligable.

Finding where the mutex locks happen

While mutex waits are much slower than ordinary lock/unlock, when the number of lock/unlock pairs is huge - almost 5 millions time a second in the example above, this can add up. When the mutex is only used from a single CPU, a lock/unlock pair takes on my machine roughly 25 nanoseconds, so 5 million of them take 0.125 seconds - i.e., 12.5% of the run time is locking and unlocking the mutex!

$ sudo scripts/run.py --trace=mutex_lock --trace-backtrace -m4G -c
1 -Vnv -e "memcached.so -u root -t 1 -m 3000"

$ scripts/trace.py extract
$ scripts/trace.py summary
$ scripts/trace.py summary
Collected 30696 samples spanning 14.71 ms

Tracepoint statistics:

  name          count
  ----          -----
  mutex_lock    30696
$ scripts/trace.py prof -FL |less

The biggest culprits are:

28% in rtalloc1_fib() in rtalloc_ign_fib() in ip_output() or in_pbcladdr() in udp_output():
     14%: rwlock()::runlock() and rlock() in rtalloc1_fib()
     7%: Similarly, a lock RT_LOCK(newrt) in rtalloc1_fib()
     7%: Also RT_LOCK(rt) in_matroute() (rn = rnh->rnh_matchaddr(dst, rnh);) in rtalloc1_fib()

14%: rwlock()::runlock() and rlock() in two places in arpresolve() in ether_output()

11.3% from _GLOBAL__sub_I_gdb_trace_function_entry. Not a fluke - apparently real! What is this?

10.5% in in_pbclookup_hash() in udp_input():
      7% rwlock()::runlock() and rlock() in in_pbclookup_hash()
      3.5% for mutex_lock() in in_pbclookup_hash

10.5% poll related:
      7%: mutex_lock() in pipe_buffer::write_events() socket_file::poll(), called from poll_scan() called from epoll_wait().
      3.5% mutex_lock() in sopoll_generic()

7% rwlock()::runlock() and rlock() in udp_output() (INP_HASH_RUNLOCK(&V_udbinfo), RLOCK).
3.5%: RTFREE() in in_pbcladdr() and ip_output().
3.5%: mutex_lock() in if_transmit()
3.5%: mutex_lock() in sosend_dgram()
3.5%: mutex_lock() in soreceive_dgram()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debugging Excessive and Contended Mutex Locks

Counting mutex locks and waits

Finding where the mutex waits happen

Finding where any waits happen

Finding where the mutex locks happen

Clone this wiki locally