Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Coredump on abnormal multiget #431

Closed
neverchanje opened this issue Nov 28, 2019 · 6 comments
Closed

Coredump on abnormal multiget #431

neverchanje opened this issue Nov 28, 2019 · 6 comments
Assignees
Labels
type/bug This issue reports a bug.

Comments

@neverchanje
Copy link
Contributor

neverchanje commented Nov 28, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    If possible, provide a recipe for reproducing the error.

Normal bootstrap.

  1. What did you expect to see?

No coredump.

  1. What did you see instead?

Coredump:

#0  0x00007f85a80475c1 in vfprintf () from /lib64/libc.so.6
#1  0x00007f85ac03a5a6 in dsn::tools::simple_logger::dsn_logv (this=0x17f4a80, file=<optimized out>, function=<optimized out>, line=<optimized out>, log_level=LOG_LEVEL_WARNING, 
    fmt=0x52bad38 "[] rocksdb abnormal multi_get from : hash_key = bikan:5c6ba3b20bc5cb2bb241aba4e3e87567, start_sort_key = \\x00\\x00\\x01n/%nm (inclusive), stop_sort_key =  (exclu"..., args=0x7f85705b1628) at /home/wutao1/pegasus-release/rdsn/src/core/tools/common/simple_logger.cpp:235
#2  0x00007f85abfb260f in dsn_logv (file=0x8b5b92 "pegasus_server_impl.cpp", 
    function=0x8b967b <pegasus::server::pegasus_server_impl::on_multi_get(dsn::apps::multi_get_request const&, dsn::rpc_replier<dsn::apps::multi_get_response>&)::__FUNCTION__> "on_multi_get", line=955, 
    log_level=LOG_LEVEL_WARNING, 
    fmt=0x52bad38 "[] rocksdb abnormal multi_get from: hash_key = bikan:5c6ba3b20bc5cb2bb241aba4e3e87567, start_sort_key = \\x00\\x00\\x01n/%nm (inclusive), stop_sort_key =  (exclu"..., args=0x7f85705b1628) at /home/wutao1/pegasus-release/rdsn/src/core/core/logging.cpp:124
#3  0x00007f85abfb273f in dsn_logf (file=file@entry=0x8b5b92 "pegasus_server_impl.cpp", 
    function=function@entry=0x8b967b <pegasus::server::pegasus_server_impl::on_multi_get(dsn::apps::multi_get_request const&, dsn::rpc_replier<dsn::apps::multi_get_response>&)::__FUNCTION__> "on_multi_get", 
    line=line@entry=955, log_level=log_level@entry=LOG_LEVEL_WARNING, fmt=<optimized out>) at /home/wutao1/pegasus-release/rdsn/src/core/core/logging.cpp:141
#4  0x00000000005f4aa2 in pegasus::server::pegasus_server_impl::on_multi_get (this=0x14449c800, request=..., reply=...) at /home/wutao1/pegasus-release/src/server/pegasus_server_impl.cpp:932
#5  0x00000000005daf5d in bool dsn::replication::storage_serverlet<dsn::apps::rrdb_service>::register_async_rpc_handler<dsn::apps::multi_get_request, dsn::apps::multi_get_response>(dsn::task_code, char const*, void (*)(dsn::apps::rrdb_service*, dsn::apps::multi_get_request const&, dsn::rpc_replier<dsn::apps::multi_get_response>&))::{lambda(dsn::apps::rrdb_service*, dsn::message_ex*)#1}::operator()(dsn::apps::rrdb_service*, dsn::message_ex*) const () at /home/wutao1/pegasus-release/DSN_ROOT/include/dsn/dist/replication/storage_serverlet.h:29
#6  0x00000000005f92cc in operator() (__args#1=0x985c5d3c8, __args#0=0x14449c800, this=<optimized out>) at /home/wutao1/app/include/c++/4.8.2/functional:2464
#7  handle_request (request=0x985c5d3c8, this=0x14449c800) at /home/wutao1/pegasus-release/DSN_ROOT/include/dsn/dist/replication/storage_serverlet.h:80
#8  dsn::apps::rrdb_service::on_request (this=0x14449c800, request=0x985c5d3c8) at /home/wutao1/pegasus-release/src/include/rrdb/rrdb.server.h:17
#9  0x00007f85abe71e75 in dsn::replication::replica::on_client_read (this=0xe1f10f00, request=request@entry=0x985c5d3c8) at /home/wutao1/pegasus-release/rdsn/src/dist/replication/lib/replica.cpp:171
#10 0x00007f85abed036f in dsn::replication::replica_stub::on_client_read (this=0x1c12580, id=..., request=0x985c5d3c8) at /home/wutao1/pegasus-release/rdsn/src/dist/replication/lib/replica_stub.cpp:770
#11 0x00007f85abfd20e9 in dsn::task::exec_internal (this=this@entry=0x985c5d560) at /home/wutao1/pegasus-release/rdsn/src/core/core/task.cpp:180
#12 0x00007f85abfe62ad in dsn::task_worker::loop (this=0x20b2a50) at /home/wutao1/pegasus-release/rdsn/src/core/core/task_worker.cpp:211
#13 0x00007f85abfe6479 in dsn::task_worker::run_internal (this=0x20b2a50) at /home/wutao1/pegasus-release/rdsn/src/core/core/task_worker.cpp:191
#14 0x00007f85a898a600 in std::(anonymous namespace)::execute_native_thread_routine (__p=<optimized out>) at /home/qinzuoyan/git.xiaomi/pegasus/toolchain/objdir/../gcc-4.8.2/libstdc++-v3/src/c++11/thread.cc:84
#15 0x00007f85a94efdc5 in start_thread () from /lib64/libpthread.so.0
#16 0x00007f85a80f473d in clone () from /lib64/libc.so.6

Related code point: pegasus/src/server/pegasus_server_impl.cpp:932

    uint64_t time_used = dsn_now_ns() - start_time;
    if (is_multi_get_abnormal(time_used, size, iterate_count)) {
        dwarn_replica(
            "rocksdb abnormal multi_get from {}: hash_key = {}, "
            "start_sort_key = {} ({}), stop_sort_key = {} ({}), "
            "sort_key_filter_type = {}, sort_key_filter_pattern = {}, "
            "max_kv_count = {}, max_kv_size = {}, reverse = {}, "
            "result_count = {}, result_size = {}, iterate_count = {}, "
            "expire_count = {}, filter_count = {}, time_used = {} ns",
            reply.to_address().to_string(),
            ::pegasus::utils::c_escape_string(request.hash_key),
            ::pegasus::utils::c_escape_string(request.start_sortkey),
            request.start_inclusive ? "inclusive" : "exclusive",
            ::pegasus::utils::c_escape_string(request.stop_sortkey),
            request.stop_inclusive ? "inclusive" : "exclusive",
            ::dsn::apps::_filter_type_VALUES_TO_NAMES.find(request.sort_key_filter_type)->second,
            ::pegasus::utils::c_escape_string(request.sort_key_filter_pattern),
            request.max_kv_count,
            request.max_kv_size,
            request.reverse ? "true" : "false",
            count,
            size,
            iterate_count,
            expire_count,
            filter_count,
            time_used);
        _pfc_recent_abnormal_count->increment();
    }
  1. What version of Pegasus are you using?

1.12.0 Pegasus Server 1.12.0 (44a5293) release

@neverchanje neverchanje added the type/bug This issue reports a bug. label Nov 28, 2019
@levy5307
Copy link
Contributor

levy5307 commented Nov 29, 2019

1.通过gdb查看core文件,是在打印rocksdb abnormal multi_get from: hash_key = bikan:5c6ba3b20bc5cb2bb241aba4e3e87567, " "start_sort_key = \x00\x00\x01%n/nm (inclusive), stop_sort_key = (exclusive), sort_key_filter_type = FT_NO_FILTER, " "sort_key_filter_pattern = , max_kv_count = 8000, max_kv_size = 0, reverse = true, result_count = 2736, result_size = 395854, " "iterate_count = 2736, expire_count = 0, filter_count = 0, time_used = 6033235 ns的时候coredump的。

2.在onebox上尝试打印该log内容,同样coredump,并提示*** %n in writable segment detected ***。将%n去掉就不再coredump。

3.由于1.11.6是使用dwarn打印,1.12.0使用dwarn_replica打印,但是1.11.6一直没有问题,所以尝试使用
std::string sort_key = "\\x00\\x00\\x01%n/nm";
dwarn("sort_key = %s", sort_key.c_str())
发现是正常的,但是用dwarn_replcia和dwarn_f就会coredump

4.由于dwarn_replcia和dwarn_f相比dwarn使用了fmt进行了一次format操作,所以怀疑是fmt的问题,然后升级fmt到6.0.0,发现仍然存在同样的问题

5.分别将使用dwarn和dwarn_f/dwarn_replcia最终执行到simple_logger::dsn_logv接口里的fmt变量获取到,发现两者不同:
dwarn: sort_key = %s
dwarn_f/warn_replica: sort_key = \x00\x00\x01%n/nm

6.最后通过查看代码发现,由于dwarn_f/dwarn_replcia最开始做了一次format,将fmt从sort_key = %s转换成了sort_key = \x00\x00\x01%n/nm,然后底层的simple_logger写的时候,使用的vfprintf(_log, fmt, args),由于字符串里包含了%n,会尝试再做一次格式化,然而此时的可变参数列表是空的,所以出现coredump。所以最终确定是dwarn_f和dwarn_replica的接口实现问题

解决方案:重新实现dwan_f和dwarn_replica的实现,不要依赖于dwarn实现。重新实现底层的日志接口供dwan_f和dwarn_replica调用,该底层日志接口不要使用vfprintf(_log, fmt, args),而是使用fprintf(_log, "%s", fmt)(已经试用过,改成fpintf没问题)

@vagetablechicken
Copy link
Contributor

解决方案:重新实现dwan_f和dwarn_replica的实现,不要依赖于dwarn实现。

不依赖与dwarn实现是什么意思?

@levy5307
Copy link
Contributor

解决方案:重新实现dwan_f和dwarn_replica的实现,不要依赖于dwarn实现。

不依赖与dwarn实现是什么意思?

dwarn_f和dwarn_replica最后通过调用dwarn来实现的:#define dwarn_f(...) dwarn(fmt::format(__VA_ARGS__).c_str()),然后如上面所说出现了coredump,后面想不依赖这个接口实现了,重新实现一个。我再看看可不可以通过修改底层接口方式,改动小一点

@vagetablechicken
Copy link
Contributor

解决方案:重新实现dwan_f和dwarn_replica的实现,不要依赖于dwarn实现。

不依赖与dwarn实现是什么意思?

dwarn_f和dwarn_replica最后通过调用dwarn来实现的:#define dwarn_f(...) dwarn(fmt::format(__VA_ARGS__).c_str()),然后如上面所说出现了coredump,后面想不依赖这个接口实现了,重新实现一个。我再看看可不可以通过修改底层接口方式,改动小一点

dwarn_replica和dwarn_f都只是固定格式的dwarn。
简单来看的话,避免它们转义是不是就可以了?
如果考虑更仔细的话,dwarn是不是也有可能被转义字符搞挂?

@neverchanje
Copy link
Contributor Author

@vagetablechicken 是不是立伟没讲清楚,是这样:

dwarn("%s", "%n") -> vfprintf("%s", "%n");
dwarn_f("{}", "%n") -> dwarn("%n") -> vfprintf("%n")

上面那个跑的是没问题的,下面这个是有问题的。这个问题怎么改呢,一种改的方式是下面这个:

dwarn_f("{}", "%n") -> vfprintf("%s", fmt::format("{}", "%n"))

@levy5307
Copy link
Contributor

levy5307 commented Nov 29, 2019

解决方案:重新实现dwan_f和dwarn_replica的实现,不要依赖于dwarn实现。

不依赖与dwarn实现是什么意思?

dwarn_f和dwarn_replica最后通过调用dwarn来实现的:#define dwarn_f(...) dwarn(fmt::format(__VA_ARGS__).c_str()),然后如上面所说出现了coredump,后面想不依赖这个接口实现了,重新实现一个。我再看看可不可以通过修改底层接口方式,改动小一点

dwarn_replica和dwarn_f都只是固定格式的dwarn。
简单来看的话,避免它们转义是不是就可以了?
如果考虑更仔细的话,dwarn是不是也有可能被转义字符搞挂?

dwarn确实是有可能的,我试过了,如果dwarn("\x00\x00\x01%n/nm")的话,一样会coredump。printf的时候如果参数列表和fmt不匹配的话会提示,但是dwarn没有

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

3 participants