Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump in scylla-server with DHCP in DPDK mode #184

Closed
kthommandra opened this issue Aug 24, 2016 · 2 comments
Closed

Core dump in scylla-server with DHCP in DPDK mode #184

kthommandra opened this issue Aug 24, 2016 · 2 comments

Comments

@kthommandra
Copy link

kthommandra commented Aug 24, 2016

Avi Kivity ..
Thanks for the detailed report. Copying Calle who wrote our dhcp server, some analysis inline.

On 08/23/2016 07:10 AM, kthommandra via ScyllaDB users wrote:
I’m running 1.2.3-20160731.e9b7352.el7.centos.x86_64

When I use the default DHCP mode of DPDK then as soon as the DHCP address is obtained scylla-server dumps core and it is restarted by systemd.
Initially I had failed to notice this due to the automatic restart from systemd, and “looks like” the DHCP based startup is successful sometimes.
Since then , for my tests, I have disabled the automatic restart in systemd.
The core during startup is very easily recreateable.

The core happens when following messages appear in the logs

Aug 22 20:05:18 XYZ scylla[43446]: DHCP sending discover
Aug 22 20:05:40 XYZ scylla[43446]: DHCP Got offer for 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP sending request for 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP Got ack on request
Aug 22 20:05:40 XYZ scylla[43446]: DHCP  ip: 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP  nm: 255.255.248.0
Aug 22 20:05:40 XYZ scylla[43446]: DHCP  gw: 172.30.0.1

The next issue I ran into was that stock gdb on Centos 7.2 crashes when I attempt to examine the core file and load symbols from scylla.debug
I had to do a gdb on gdb :) to see gdb fault signature
Luckily, this is a reported bug in gdb https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68159 [show] and the patch in that bug report fixed it for me.

I did some basic checks using elfutils to verify that the actual binary and the debuginfo file match properly. So I’m not sure why not all symbols are decoded properly (I’m assuming the compilation was done to retain the symbols and then extracted into debuginfo) and why gdb wasn’t able to automatically find and load the scylla.debug file

Assuming the gdb session gave me reliable data, following is the stack of the offending thread

(gdb) bt full
#0  0x0000000000000000 in ?? ()
No symbol table info available.
#1  0x00000000005bfa35 in net::ipv4::handle_received_packet (this=0x62d000253aa0, p=..., from=...) at net/ip.cc:151
        handled = 213
        r = {
          _promise = 0x5c43bc <std::__detail::_Map_base<net::ipv4_address, std::pair<net::ipv4_address const, net::ethernet_address>, std::allocator<std::pair<net::ipv4_address const, net::ethernet_address> >, std::__detail::_Select1st, std::equal_to<net::ipv4_address>, std::hash<net::ipv4_address>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](net::ipv4_address&&)+44>, _local_state = {static copy_noexcept = true, _u = {st = 108645493837840, ex = {
                _M_exception_object = 0x62d000111010}}}, static copy_noexcept = <optimized out>}
        iph = <optimized out>
        h = {ihl = 0 '\000', ver = 8 '\b', dscp = 47 '/', ecn = 3 '\003', len = {raw = 21}, id = {raw = 25296}, frag = {
            raw = 0}, ttl = 128 '\200', ip_proto = 249 '\371', csum = {raw = 59838}, src_ip = {ip = {raw = 32632}},
          dst_ip = {ip = {raw = 3921607008}}, options = 0x7f78e9bef25c "x\177"}
        ip_len = <optimized out>
        ip_hdr_len = 20
        offset = <optimized out>
        mf = <optimized out>
        l4 = <optimized out>
#2  0x000062d0002e6e05 in ?? ()
No symbol table info available.
#3  0x00000000005f63c5 in smp_message_queue::async_work_item<net::dhcp::impl::handle(net::packet&, net::ip_hdr*, net::ethernet_address, bool&)::{lambda()#1}>::~async_work_item() (this=0x7f78e9bef510, __in_chrg=<optimized out>) at core/reactor.hh:356
No locals.
#4  0x8000000000000000 in ?? ()
No symbol table info available.
#5  0x0000000000004000 in ?? ()
No symbol table info available.
#6  0x00007f78e9bef350 in ?? ()
No symbol table info available.
#7  0x00007f4c9da84c44 in ?? ()
No symbol table info available.
#8  0x00007f78e9bef338 in ?? ()
No symbol table info available.
#9  0xfffffff87ffbfcff in ?? ()
No symbol table info available.
#10 0x000062d0002e8f60 in ?? ()
No symbol table info available.
#11 0x000062d0002e8f60 in ?? ()
No symbol table info available.
#12 0x000062d00015efb0 in ?? ()
No symbol table info available.
#13 0x000000000161eb00 in vtable for blockdev_file_impl ()
No symbol table info available.
#14 0x000062d00015ef00 in ?? ()
No symbol table info available.
#15 0x0000000000000000 in ?? ()
No symbol table info available.

Looks like we crash in the call to ->handle():

if (_packet_filter) {
    bool handled = false;
    auto r = _packet_filter->handle(p, &h, from, handled);
    if (handled) {
        return std::move(r);
    }
}

so _packet_filter is set, but *_packet_filter has been destroyed.

Calle, isn't this the culprit?

future<> native_network_stack::run_dhcp(bool is_renew, const dhcp::lease& res) {
lw_shared_ptr d = make_lw_shared(_inet);

// Hijack the ip-stack.
for (unsigned i = 0; i < smp::count; i++) {
    smp::submit_to(i, [d] {
        auto & ns = static_cast<native_network_stack&>(engine().net());
        ns.set_ipv4_packet_filter(d->get_ipv4_filter());
    });
}

net::dhcp::result_type fut = is_renew ? d->renew(res) : d->discover();

return fut.then([this, d, is_renew](bool success, const dhcp::lease & res) {
    for (unsigned i = 0; i < smp::count; i++) {
        smp::submit_to(i, [] {
            auto & ns = static_cast<native_network_stack&>(engine().net());
            ns.set_ipv4_packet_filter(nullptr);
        });
    }
    on_dhcp(success, res, is_renew);
});

}

d will be destroyed once the last continuation is executed, which could be before the set_pv4_packet_filter(nullptr).

There's at least one other fire-and-forget future in the dhcp code, in dhcp::impl::handle().

The complete gdb session output is attached to this email

Finally I decided to avoid the DHCP mode and used the static IP address by doing the following in scylla-server.service

ExecStart=/usr/bin/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET --host-ipv4-addr=ABCD --gw-ipv4-addr=WXYZ --netmask-ipv4-addr=EFGH --dhcp=0

I saw lot of ARP timeout errors of the following format during startup with static IP, but I guess after some time, eventually things settle down and I was able to do RW tests successfully.

Aug 20 09:58:24 XYZ scylla[2403]: ARP timeoutnet::arp_timeout_error' of type ': ARP timeout
Aug 20 09:58:24 XYZ scylla[2403]: WARNING: exceptional future ignored of type 'net::arp_timeout_error'

-krishna

@avikivity
Copy link
Member

@elcallio

@elcallio
Copy link
Contributor

Thanks for the very detailed analysis @kthommandra. I think you are right, and in fact I think both the register and unregister calls are somewhat botched, as we don't wait for either. There should be a clear continuation chain "register_listener->call_dhcp->unregister_listener", and a nice finally to keep the dhcp object alive. I've sent a patch to the list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants