You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Avi Kivity ..
Thanks for the detailed report. Copying Calle who wrote our dhcp server, some analysis inline.
On 08/23/2016 07:10 AM, kthommandra via ScyllaDB users wrote:
I’m running 1.2.3-20160731.e9b7352.el7.centos.x86_64
When I use the default DHCP mode of DPDK then as soon as the DHCP address is obtained scylla-server dumps core and it is restarted by systemd.
Initially I had failed to notice this due to the automatic restart from systemd, and “looks like” the DHCP based startup is successful sometimes.
Since then , for my tests, I have disabled the automatic restart in systemd.
The core during startup is very easily recreateable.
The core happens when following messages appear in the logs
Aug 22 20:05:18 XYZ scylla[43446]: DHCP sending discover
Aug 22 20:05:40 XYZ scylla[43446]: DHCP Got offer for 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP sending request for 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP Got ack on request
Aug 22 20:05:40 XYZ scylla[43446]: DHCP ip: 172.30.0.64
Aug 22 20:05:40 XYZ scylla[43446]: DHCP nm: 255.255.248.0
Aug 22 20:05:40 XYZ scylla[43446]: DHCP gw: 172.30.0.1
The next issue I ran into was that stock gdb on Centos 7.2 crashes when I attempt to examine the core file and load symbols from scylla.debug
I had to do a gdb on gdb :) to see gdb fault signature
Luckily, this is a reported bug in gdb https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68159 [show] and the patch in that bug report fixed it for me.
I did some basic checks using elfutils to verify that the actual binary and the debuginfo file match properly. So I’m not sure why not all symbols are decoded properly (I’m assuming the compilation was done to retain the symbols and then extracted into debuginfo) and why gdb wasn’t able to automatically find and load the scylla.debug file
Assuming the gdb session gave me reliable data, following is the stack of the offending thread
(gdb) bt full
#0 0x0000000000000000 in ?? ()
No symbol table info available.
#1 0x00000000005bfa35 in net::ipv4::handle_received_packet (this=0x62d000253aa0, p=..., from=...) at net/ip.cc:151
handled = 213
r = {
_promise = 0x5c43bc <std::__detail::_Map_base<net::ipv4_address, std::pair<net::ipv4_address const, net::ethernet_address>, std::allocator<std::pair<net::ipv4_address const, net::ethernet_address> >, std::__detail::_Select1st, std::equal_to<net::ipv4_address>, std::hash<net::ipv4_address>, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[](net::ipv4_address&&)+44>, _local_state = {static copy_noexcept = true, _u = {st = 108645493837840, ex = {
_M_exception_object = 0x62d000111010}}}, static copy_noexcept = <optimized out>}
iph = <optimized out>
h = {ihl = 0 '\000', ver = 8 '\b', dscp = 47 '/', ecn = 3 '\003', len = {raw = 21}, id = {raw = 25296}, frag = {
raw = 0}, ttl = 128 '\200', ip_proto = 249 '\371', csum = {raw = 59838}, src_ip = {ip = {raw = 32632}},
dst_ip = {ip = {raw = 3921607008}}, options = 0x7f78e9bef25c "x\177"}
ip_len = <optimized out>
ip_hdr_len = 20
offset = <optimized out>
mf = <optimized out>
l4 = <optimized out>
#2 0x000062d0002e6e05 in ?? ()
No symbol table info available.
#3 0x00000000005f63c5 in smp_message_queue::async_work_item<net::dhcp::impl::handle(net::packet&, net::ip_hdr*, net::ethernet_address, bool&)::{lambda()#1}>::~async_work_item() (this=0x7f78e9bef510, __in_chrg=<optimized out>) at core/reactor.hh:356
No locals.
#4 0x8000000000000000 in ?? ()
No symbol table info available.
#5 0x0000000000004000 in ?? ()
No symbol table info available.
#6 0x00007f78e9bef350 in ?? ()
No symbol table info available.
#7 0x00007f4c9da84c44 in ?? ()
No symbol table info available.
#8 0x00007f78e9bef338 in ?? ()
No symbol table info available.
#9 0xfffffff87ffbfcff in ?? ()
No symbol table info available.
#10 0x000062d0002e8f60 in ?? ()
No symbol table info available.
#11 0x000062d0002e8f60 in ?? ()
No symbol table info available.
#12 0x000062d00015efb0 in ?? ()
No symbol table info available.
#13 0x000000000161eb00 in vtable for blockdev_file_impl ()
No symbol table info available.
#14 0x000062d00015ef00 in ?? ()
No symbol table info available.
#15 0x0000000000000000 in ?? ()
No symbol table info available.
Looks like we crash in the call to ->handle():
if (_packet_filter) {
bool handled = false;
auto r = _packet_filter->handle(p, &h, from, handled);
if (handled) {
return std::move(r);
}
}
so _packet_filter is set, but *_packet_filter has been destroyed.
I saw lot of ARP timeout errors of the following format during startup with static IP, but I guess after some time, eventually things settle down and I was able to do RW tests successfully.
Aug 20 09:58:24 XYZ scylla[2403]: ARP timeoutnet::arp_timeout_error' of type ': ARP timeout
Aug 20 09:58:24 XYZ scylla[2403]: WARNING: exceptional future ignored of type 'net::arp_timeout_error'
-krishna
The text was updated successfully, but these errors were encountered:
Thanks for the very detailed analysis @kthommandra. I think you are right, and in fact I think both the register and unregister calls are somewhat botched, as we don't wait for either. There should be a clear continuation chain "register_listener->call_dhcp->unregister_listener", and a nice finally to keep the dhcp object alive. I've sent a patch to the list.
Avi Kivity ..
Thanks for the detailed report. Copying Calle who wrote our dhcp server, some analysis inline.
On 08/23/2016 07:10 AM, kthommandra via ScyllaDB users wrote:
I’m running 1.2.3-20160731.e9b7352.el7.centos.x86_64
When I use the default DHCP mode of DPDK then as soon as the DHCP address is obtained scylla-server dumps core and it is restarted by systemd.
Initially I had failed to notice this due to the automatic restart from systemd, and “looks like” the DHCP based startup is successful sometimes.
Since then , for my tests, I have disabled the automatic restart in systemd.
The core during startup is very easily recreateable.
The core happens when following messages appear in the logs
The next issue I ran into was that stock gdb on Centos 7.2 crashes when I attempt to examine the core file and load symbols from scylla.debug
I had to do a gdb on gdb :) to see gdb fault signature
Luckily, this is a reported bug in gdb https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68159 [show] and the patch in that bug report fixed it for me.
I did some basic checks using elfutils to verify that the actual binary and the debuginfo file match properly. So I’m not sure why not all symbols are decoded properly (I’m assuming the compilation was done to retain the symbols and then extracted into debuginfo) and why gdb wasn’t able to automatically find and load the scylla.debug file
Assuming the gdb session gave me reliable data, following is the stack of the offending thread
Looks like we crash in the call to ->handle():
so _packet_filter is set, but *_packet_filter has been destroyed.
Calle, isn't this the culprit?
future<> native_network_stack::run_dhcp(bool is_renew, const dhcp::lease& res) {
lw_shared_ptr d = make_lw_shared(_inet);
}
d will be destroyed once the last continuation is executed, which could be before the set_pv4_packet_filter(nullptr).
There's at least one other fire-and-forget future in the dhcp code, in dhcp::impl::handle().
The complete gdb session output is attached to this email
Finally I decided to avoid the DHCP mode and used the static IP address by doing the following in scylla-server.service
I saw lot of ARP timeout errors of the following format during startup with static IP, but I guess after some time, eventually things settle down and I was able to do RW tests successfully.
-krishna
The text was updated successfully, but these errors were encountered: