Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiFlash crash in AggregateFunctionSumAddImpl::add #2607

Closed
LittleFall opened this issue Aug 6, 2021 · 7 comments · Fixed by #2617
Closed

TiFlash crash in AggregateFunctionSumAddImpl::add #2607

LittleFall opened this issue Aug 6, 2021 · 7 comments · Fixed by #2617
Assignees
Labels
severity/major type/bug The issue is confirmed as a bug.

Comments

@LittleFall
Copy link
Contributor

LittleFall commented Aug 6, 2021

log

178484:[2021/08/06 16:26:09.693 +08:00] [ERROR] [<unknown>] ["BaseDaemon: ########################################"] [thread_id=71]
178485:[2021/08/06 16:26:09.693 +08:00] [ERROR] [<unknown>] ["BaseDaemon: (from thread 69) Received signal Segmentation fault: 11 (11)."] [thread_id=71]
178486:[2021/08/06 16:26:09.693 +08:00] [ERROR] [<unknown>] ["BaseDaemon: Address: NULL pointer."] [thread_id=71]
178487:[2021/08/06 16:26:09.693 +08:00] [ERROR] [<unknown>] ["BaseDaemon: Address not mapped to object."] [thread_id=71]
178488:[2021/08/06 16:26:09.706 +08:00] [ERROR] [<unknown>] ["BaseDaemon: 0. 0   tiflash                             0x0000000111694b90 _ZN2DB24AggregateFunctionSumDataINS_7DecimalInEEE5mergeERKS3_ + 32"] [thread_id=71]

reproduce:

tiup playground nightly
tiup bench tpch prepare --tiflash --analyze
mysql -h127.0.0.1 -P4000 -Dtest -uroot -e"source ~/tmp/q18.sql"
@LittleFall
Copy link
Contributor Author

LittleFall commented Aug 6, 2021

in HashMap.h:200, this->size=0, but it iteratering.

    template <typename Func>
    void ALWAYS_INLINE mergeToViaEmplace(Self & that, Func && func)
    {
        for (auto it = this->begin(), end = this->end(); it != end; ++it)
        {
            typename Self::LookupResult res_it;
            bool inserted;
            that.emplace(Cell::getKey(it->getValue()), res_it, inserted, it.getHash());
            func(res_it->getMapped(), it->getMapped(), inserted);
        }
    }

@LittleFall
Copy link
Contributor Author

the root cause is a race in function scheduleThreadForNextBucket, member variable max_scheduled_bucket_num is not atomic but will be modified and used in multi-threads like:

    void scheduleThreadForNextBucket()
    {
        ++max_scheduled_bucket_num;
        if (max_scheduled_bucket_num >= NUM_BUCKETS)
            return;

        parallel_merge_data->pool.schedule(
            ThreadFactory(true, "MergingAggregtd").newJob([this]{ thread(max_scheduled_bucket_num); }));
    }

then threadFactory will create some job with same max_scheduled_bucket_num, then they will have the same HashMapTable object, and race in above function mergeToViaEmplace.

@LittleFall
Copy link
Contributor Author

LittleFall commented Aug 6, 2021

this bug is imported in #2496, it added the ThreadFactory, change the call model from std::bind to lambda, which can read max_scheduled_bucket_num async.

ClickHouse do some similar fix in ClickHouse/ClickHouse#21818.

@LittleFall LittleFall self-assigned this Aug 6, 2021
@fuzhe1989
Copy link
Contributor

the root cause is a race in function scheduleThreadForNextBucket, member variable max_scheduled_bucket_num is not atomic but will be modified and used in multi-threads like:

    void scheduleThreadForNextBucket()
    {
        ++max_scheduled_bucket_num;
        if (max_scheduled_bucket_num >= NUM_BUCKETS)
            return;

        parallel_merge_data->pool.schedule(
            ThreadFactory(true, "MergingAggregtd").newJob([this]{ thread(max_scheduled_bucket_num); }));
    }

then threadFactory will create some job with same max_scheduled_bucket_num, then they will have the same HashMapTable object, and race in above function mergeToViaEmplace.

amazing finding ...

@LittleFall
Copy link
Contributor Author

however, after I apply #2617, there is another problem ERROR 1105 (HY000) at line 1 in file: '/Users/lf/tmp/q18.sql': CHECK failed: (index) < (current_size_):.

the deeper reason is in MPPTask::prepare, exchangeSender has a 0size types, and 5size partition_keys.

need furthur investigation.

@lilinghai lilinghai added type/bug The issue is confirmed as a bug. severity/major labels Aug 9, 2021
@LittleFall
Copy link
Contributor Author

however, after I apply #2617, there is another problem ERROR 1105 (HY000) at line 1 in file: '/Users/lf/tmp/q18.sql': CHECK failed: (index) < (current_size_):.

the deeper reason is in MPPTask::prepare, exchangeSender has a 0size types, and 5size partition_keys.

need furthur investigation.

after I update the tidb version to keep compatible with tiflash, this error disappeared.

@LittleFall
Copy link
Contributor Author

full stack

2021.08.13 16:58:58.474323 [ 17019 ] <Error> BaseDaemon: (from thread 17079) Received signal Segmentation fault (11).
2021.08.13 16:58:58.474332 [ 17019 ] <Error> BaseDaemon: Address: NULL pointer.
2021.08.13 16:58:58.474339 [ 17019 ] <Error> BaseDaemon: Access: read.
2021.08.13 16:58:58.474347 [ 17019 ] <Error> BaseDaemon: Address not mapped to object.
2021.08.13 16:58:58.482853 [ 17019 ] <Error> BaseDaemon: 0. bin/tiflash/tiflash(DB::AggregateFunctionCount::merge(char*, char const*, DB::Arena*) const+0x1) [0x6884921]
2021.08.13 16:58:58.482894 [ 17019 ] <Error> BaseDaemon: 1. bin/tiflash/tiflash(void DB::Aggregator::mergeDataImpl<DB::AggregationMethodSerialized<TwoLevelHashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, char*, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, TwoLevelHashTableGrower<8ul>, Allocator<true>, HashMapTable> >, HashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, char*, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, TwoLevelHashTableGrower<8ul>, Allocator<true> > >(HashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, char*, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, TwoLevelHashTableGrower<8ul>, Allocator<true> >&, HashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, char*, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, TwoLevelHashTableGrower<8ul>, Allocator<true> >&, DB::Arena*) const+0x270) [0x7f65180]
2021.08.13 16:58:58.482914 [ 17019 ] <Error> BaseDaemon: 2. bin/tiflash/tiflash(void DB::Aggregator::mergeBucketImpl<DB::AggregationMethodSerialized<TwoLevelHashMapTable<StringRef, HashMapCellWithSavedHash<StringRef, char*, DefaultHash<StringRef, void>, HashTableNoState>, DefaultHash<StringRef, void>, TwoLevelHashTableGrower<8ul>, Allocator<true>, HashMapTable> > >(std::vector<std::shared_ptr<DB::AggregatedDataVariants>, std::allocator<std::shared_ptr<DB::AggregatedDataVariants> > >&, int, DB::Arena*) const+0x9c) [0x7f653ac]
2021.08.13 16:58:58.482946 [ 17019 ] <Error> BaseDaemon: 3. bin/tiflash/tiflash(DB::MergingAndConvertingBlockInputStream::thread(int)+0xc9e) [0x7f91b6e]
2021.08.13 16:58:58.482957 [ 17019 ] <Error> BaseDaemon: 4. bin/tiflash/tiflash(ThreadPool::worker()+0x167) [0x8192297]
2021.08.13 16:58:58.482966 [ 17019 ] <Error> BaseDaemon: 5. bin/tiflash/tiflash() [0x8e371bf]
2021.08.13 16:58:58.482983 [ 17019 ] <Error> BaseDaemon: 6. /lib64/libpthread.so.0(+0x7dd5) [0x7fd7fd995dd5]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
severity/major type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants