Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock during fetching part (only when I use a non-production feature "zero-copy replication") #37423

Closed
metahys opened this issue May 22, 2022 · 1 comment · Fixed by #37424
Labels
experimental feature Bug in the feature that should not be used in production

Comments

@metahys
Copy link
Contributor

metahys commented May 22, 2022

There is a possible deadlock in Fetcher::fetchPart when fail to do fetching with zero-copy.

...

Thread 1612 (Thread 0x7f4437dff700 (LWP 692837)):
#0  0x00007f448499a42d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4484995dcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007f4484995c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000009ea4491 in pthread_mutex_lock ()
#4  0x0000000015c0fa86 in std::__1::mutex::lock() ()
#5  0x0000000009f7bd1d in DB::makePooledHTTPSession(Poco::URI const&, Poco::URI const&, DB::ConnectionTimeouts const&, unsigned long, bool) ()
#6  0x0000000009f7bc9c in DB::makePooledHTTPSession(Poco::URI const&, DB::ConnectionTimeouts const&, unsigned long, bool) ()
#7  0x0000000012bd9611 in DB::UpdatablePooledSession::UpdatablePooledSession(Poco::URI, DB::ConnectionTimeouts const&, unsigned long, unsigned long) ()
#8  0x0000000012bd9422 in std::__1::shared_ptr<DB::UpdatablePooledSession> std::__1::allocate_shared<DB::UpdatablePooledSession, std::__1::allocator<DB::UpdatablePooledSession>, Poco::URI&, DB::ConnectionTimeouts const&, unsigned long const&, unsigned long&, void>(std::__1::allocator<DB::UpdatablePooledSession> const&, Poco::URI&, DB::ConnectionTimeouts const&, unsigned long const&, unsigned long&) ()
#9  0x0000000012bd6b85 in DB::PooledReadWriteBufferFromHTTP::PooledReadWriteBufferFromHTTP(Poco::URI, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (std::__1::basic_ostream<char, std::__1::char_traits<char> >&)>, DB::ConnectionTimeouts const&, Poco::Net::HTTPBasicCredentials const&, unsigned long, unsigned long, unsigned long) ()
#10 0x0000000012bce80d in DB::DataPartsExchange::Fetcher::fetchPart(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context const>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, DB::ConnectionTimeouts const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Throttler>, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::__1::shared_ptr<DB::IDisk>) ()
#11 0x0000000012a9d7f5 in std::__1::shared_ptr<DB::IMergeTreeDataPart> std::__1::__function::__policy_invoker<std::__1::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl<std::__1::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, std::__1::shared_ptr<zkutil::ZooKeeper>)::$_19, std::__1::shared_ptr<DB::IMergeTreeDataPart> ()> >(std::__1::__function::__policy_storage const*) ()
#12 0x0000000012a2796f in DB::StorageReplicatedMergeTree::fetchPart(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, std::__1::shared_ptr<zkutil::ZooKeeper>) ()
#13 0x0000000012a1c23d in DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&) ()
#14 0x0000000012da989d in DB::ReplicatedMergeMutateTaskBase::executeImpl() ()
#15 0x0000000012da8be2 in DB::ReplicatedMergeMutateTaskBase::executeStep() ()
#16 0x0000000012c3f20b in DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::routine(std::__1::shared_ptr<DB::TaskRuntimeData>) ()
#17 0x0000000012c3ff7b in DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction() ()
#18 0x0000000009ea768d in ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) ()
#19 0x0000000009ea8f50 in ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}>(void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}&&)::{lambda()#1}::operator()() ()
#20 0x0000000009ea5c0e in ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) ()
#21 0x0000000009ea80ae in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}> >(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}>) ()
#22 0x00007f4484993e25 in start_thread () from /lib64/libpthread.so.0
#23 0x00007f44846c035d in clone () from /lib64/libc.so.6

...

Thread 1607 (Thread 0x7f44355fa700 (LWP 692842)):
#0  0x00007f448499a42d in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007f4484995dcb in _L_lock_812 () from /lib64/libpthread.so.0
#2  0x00007f4484995c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000000009ea4491 in pthread_mutex_lock ()
#4  0x0000000015c0fa86 in std::__1::mutex::lock() ()
#5  0x0000000009f7bd1d in DB::makePooledHTTPSession(Poco::URI const&, Poco::URI const&, DB::ConnectionTimeouts const&, unsigned long, bool) ()
#6  0x0000000009f7bc9c in DB::makePooledHTTPSession(Poco::URI const&, DB::ConnectionTimeouts const&, unsigned long, bool) ()
#7  0x0000000012bd9611 in DB::UpdatablePooledSession::UpdatablePooledSession(Poco::URI, DB::ConnectionTimeouts const&, unsigned long, unsigned long) ()
#8  0x0000000012bd9422 in std::__1::shared_ptr<DB::UpdatablePooledSession> std::__1::allocate_shared<DB::UpdatablePooledSession, std::__1::allocator<DB::UpdatablePooledSession>, Poco::URI&, DB::ConnectionTimeouts const&, unsigned long const&, unsigned long&, void>(std::__1::allocator<DB::UpdatablePooledSession> const&, Poco::URI&, DB::ConnectionTimeouts const&, unsigned long const&, unsigned long&) ()
#9  0x0000000012bd6b85 in DB::PooledReadWriteBufferFromHTTP::PooledReadWriteBufferFromHTTP(Poco::URI, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::function<void (std::__1::basic_ostream<char, std::__1::char_traits<char> >&)>, DB::ConnectionTimeouts const&, Poco::Net::HTTPBasicCredentials const&, unsigned long, unsigned long, unsigned long) ()
#10 0x0000000012bce80d in DB::DataPartsExchange::Fetcher::fetchPart(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context const>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, DB::ConnectionTimeouts const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Throttler>, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::__1::shared_ptr<DB::IDisk>) ()
#11 0x0000000012bd06d9 in DB::DataPartsExchange::Fetcher::fetchPart(std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::shared_ptr<DB::Context const>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, DB::ConnectionTimeouts const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::Throttler>, bool, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::optional<DB::CurrentlySubmergingEmergingTagger>*, bool, std::__1::shared_ptr<DB::IDisk>) ()
#12 0x0000000012a9d7f5 in std::__1::shared_ptr<DB::IMergeTreeDataPart> std::__1::__function::__policy_invoker<std::__1::shared_ptr<DB::IMergeTreeDataPart> ()>::__call_impl<std::__1::__function::__default_alloc_func<DB::StorageReplicatedMergeTree::fetchPart(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, std::__1::shared_ptr<zkutil::ZooKeeper>)::$_19, std::__1::shared_ptr<DB::IMergeTreeDataPart> ()> >(std::__1::__function::__policy_storage const*) ()
#13 0x0000000012a2796f in DB::StorageReplicatedMergeTree::fetchPart(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::shared_ptr<DB::StorageInMemoryMetadata const> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, std::__1::shared_ptr<zkutil::ZooKeeper>) ()
#14 0x0000000012a1c23d in DB::StorageReplicatedMergeTree::executeFetch(DB::ReplicatedMergeTreeLogEntry&) ()
#15 0x0000000012da989d in DB::ReplicatedMergeMutateTaskBase::executeImpl() ()
#16 0x0000000012da8be2 in DB::ReplicatedMergeMutateTaskBase::executeStep() ()
#17 0x0000000012c3f20b in DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::routine(std::__1::shared_ptr<DB::TaskRuntimeData>) ()
#18 0x0000000012c3ff7b in DB::MergeTreeBackgroundExecutor<DB::MergeMutateRuntimeQueue>::threadFunction() ()
#19 0x0000000009ea768d in ThreadPoolImpl<ThreadFromGlobalPool>::worker(std::__1::__list_iterator<ThreadFromGlobalPool, void*>) ()
#20 0x0000000009ea8f50 in ThreadFromGlobalPool::ThreadFromGlobalPool<void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}>(void ThreadPoolImpl<ThreadFromGlobalPool>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}&&)::{lambda()#1}::operator()() ()
#21 0x0000000009ea5c0e in ThreadPoolImpl<std::__1::thread>::worker(std::__1::__list_iterator<std::__1::thread, void*>) ()
#22 0x0000000009ea80ae in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}> >(std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void ThreadPoolImpl<std::__1::thread>::scheduleImpl<void>(std::__1::function<void ()>, int, std::__1::optional<unsigned long>)::{lambda()#2}>) ()
#23 0x00007f4484993e25 in start_thread () from /lib64/libpthread.so.0
#24 0x00007f44846c035d in clone () from /lib64/libc.so.6

...

As the stack trace shows, there is a recursive call to fetchPart when fail to do fetch with zero-copy. Each call requires obtain an http session from a session pool which has a maximum size limit. If there are many threads calling fetchPart simultaneously, the pool may be exhausted on the first call. And all threads will block on the second, eventually leading to a deadlock.

@metahys metahys added the potential bug To be reviewed by developers and confirmed/rejected. label May 22, 2022
@alexey-milovidov alexey-milovidov added experimental feature Bug in the feature that should not be used in production and removed potential bug To be reviewed by developers and confirmed/rejected. labels May 22, 2022
@alexey-milovidov
Copy link
Member

Changing to "bug experimental" because "zero-copy replication" is still considered not production-ready.

@alexey-milovidov alexey-milovidov changed the title Possible deadlock during fetching part Possible deadlock during fetching part (only when I use a non-production feature "zero-copy replication") May 24, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
experimental feature Bug in the feature that should not be used in production
Projects
None yet
2 participants