Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Improvement](Meta) add a fault check for create_rowset to prevent meta corrupt #11355

Closed
wants to merge 1 commit into from

Conversation

BiteTheDDDDt
Copy link
Contributor

@BiteTheDDDDt BiteTheDDDDt commented Jul 30, 2022

Proposed changes

Sometime we have corrupt meta, then backend will coredump when start.
This pr add a check to prevent coredump.

 0# doris::signal::(anonymous namespace)::FailureSignalHandler(int, siginfo_t*, void*) at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/common/signal_handler.h:420
 1# 0x000000318AE32920 in /lib64/libc.so.6
 2# std::__shared_ptr<doris::RowsetMeta, (__gnu_cxx::_Lock_policy)2>::__shared_ptr(std::__shared_ptr<doris::RowsetMeta, (__gnu_cxx::_Lock_policy)2> const&) at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr_base.h:1147
 3# std::shared_ptr<doris::RowsetMeta>::shared_ptr(std::shared_ptr<doris::RowsetMeta> const&) at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/shared_ptr.h:150
 4# doris::Tablet::rowset_meta_with_max_schema_version(std::vector<std::shared_ptr<doris::RowsetMeta>, std::allocator<std::shared_ptr<doris::RowsetMeta> > > const&) at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/olap/tablet.cpp:406
 5# doris::Tablet::tablet_schema() const at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/olap/tablet.cpp:1873
 6# doris::Tablet::create_rowset(std::shared_ptr<doris::RowsetMeta>, std::shared_ptr<doris::Rowset>*) at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/olap/tablet.cpp:1692
 7# doris::DataDir::load() at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/olap/data_dir.cpp:468
 8# doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3::operator()() const at /home/disk2/pxl/dev/baidu/bdg/doris/core/be/src/olap/storage_engine.cpp:172
 9# void std::__invoke_impl<void, doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3>(std::__invoke_other, doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3&&) at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:61
10# std::__invoke_result<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3>::type std::__invoke<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3>(doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3&&) at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/invoke.h:96
11# void std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:253
12# std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3> >::operator()() at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:260
13# std::thread::_State_impl<std::thread::_Invoker<std::tuple<doris::StorageEngine::load_data_dirs(std::vector<doris::DataDir*, std::allocator<doris::DataDir*> > const&)::$_3> > >::_M_run() at /home/disk2/pxl/dev/ldb_toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_thread.h:211
14# execute_native_thread_routine in /home/disk2/pxl/dev/baidu/bdg/doris/core/output/be/lib/doris_be
15# start_thread in /lib64/libpthread.so.0
16# clone in /lib64/libc.so.6

Problem Summary:

Checklist(Required)

  1. Does it affect the original behavior:
    • Yes
    • No
    • I don't know
  2. Has unit tests been added:
    • Yes
    • No
    • No Need
  3. Has document been added or modified:
    • Yes
    • No
    • No Need
  4. Does it need to update dependencies:
    • Yes
    • No
  5. Are there any changes that cannot be rolled back:
    • Yes
    • No

Further comments

If this is a relatively large or complex change, kick off the discussion at [email protected] by explaining why you chose the solution you did and what alternatives you considered, etc...

@yiguolei
Copy link
Contributor

Hi, is there a repeatable env? I want to check if #11131 this PR merged and there is still bug.

@yiguolei
Copy link
Contributor

#11131

@BiteTheDDDDt
Copy link
Contributor Author

I'm not very able to find a way to reproduce it.
I did a series of operations such as create table/create mv/drop table. During the process, there were some core dumps caused by bugs(#11334), and then I found that be could no longer be started.
I think we can add a check to skip these tablets with unknown corruption,

@Lchangliang
Copy link
Contributor

I'm not very able to find a way to reproduce it. I did a series of operations such as create table/create mv/drop table. During the process, there were some core dumps caused by bugs(#11334), and then I found that be could no longer be started. I think we can add a check to skip these tablets with unknown corruption,

I meet such problem too. The problem is that tablet.all_rs_meta() is empty. But i don't know why tablet will have no rowset meta sometime. I fix it that when tablet.all_rs_meta() is empty, return tablet's original schema in #11131. Do you think whether it need a check or not?

@BiteTheDDDDt
Copy link
Contributor Author

I'm not very able to find a way to reproduce it. I did a series of operations such as create table/create mv/drop table. During the process, there were some core dumps caused by bugs(#11334), and then I found that be could no longer be started. I think we can add a check to skip these tablets with unknown corruption,

I meet such problem too. The problem is that tablet.all_rs_meta() is empty. But i don't know why tablet will have no rowset meta sometime. I fix it that when tablet.all_rs_meta() is empty, return tablet's original schema in #11131. Do you think whether it need a check or not?

I feel like when a tablet meta has an unknown error, we should skip it, I'm not so sure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants