Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[fix](scanner) Fix deadlock when scanner submit failed (apache#40495)
We have dead lock when submit scanner to scheduler failed. pstack looks like ```txt Thread 2012 (Thread 0x7f87363fb700 (LWP 4179707) "Pipe_normal [wo"): #0 0x00007f8b8f3dc82d in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007f8b8f3d5ad9 in pthread_mutex_lock () from /lib64/libpthread.so.0 apache#2 0x000055b20f333e7a in __gthread_mutex_lock (__mutex=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/x86_64-linux-gnu/c++/11/bits/gthr-default .h:749 apache#3 std::mutex::lock (this=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:100 apache#4 std::lock_guard<std::mutex>::lock_guard (__m=..., this=<optimized out>) at /mnt/disk1/hezhiqiang/toolchains/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/std_mutex.h:229 apache#5 doris::vectorized::ScannerContext::append_block_to_queue (this=<optimized out>, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:234 apache#6 0x000055b20f32c0f9 in doris::vectorized::ScannerScheduler::submit (this=<optimized out>, ctx=..., scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_scheduler.cpp:209 apache#7 0x000055b20f3338fc in doris::vectorized::ScannerContext::submit_scan_task (this=this@entry=0x7f8733d96010, scan_task=...) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:217 apache#8 0x000055b20f3346cd in doris::vectorized::ScannerContext::get_block_from_queue (this=0x7f8733d96010, state=<optimized out>, block=0x7f871f728de0, eos=0x7f871abce470, id=<optimized out>) at /mnt/disk1/hezhiqiang/doris/be/src/vec/exec/scan/scanner_context.cpp:290 apache#9 0x000055b214cb4f13 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block (this=<optimized out>, state=0x7f872f0eb400, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f871abce470) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.cpp:1292 apache#10 0x000055b2142b5772 in doris::pipeline::ScanOperatorX<doris::pipeline::OlapScanLocalState>::get_block_after_projects (this=0x80, state=0x0, block=0x7f8b8f3dc82d <__lll_lock_wait+29>, eos=0x7f8733d960a8) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/scan_operator.h:363 apache#11 0x000055b2142e7880 in doris::pipeline::StatefulOperatorX<doris::pipeline::StreamingAggLocalState>::get_block (this=0x7f871f9bee00, state=0x7f872f0eb400, block=0x7f8716d49060, eos=0x7f87363f4937) at /mnt/disk1/hezhiqiang/doris/be/src/pipeline/exec/operator.cpp:587 ``` Deallock happens with following ```cpp Status ScannerContext::get_block_from_queue { std::unique_lock l(_transfer_lock); ... if (scan_task->is_eos()) { ... } else { // resubmit current running scanner to read the next block submit_scan_task(scan_task); } } ScannerContext::submit_scan_task(std::shared_ptr<ScanTask> scan_task) { _scanner_scheduler->submit(shared_from_this(), scan_task); } void ScannerScheduler::submit(std::shared_ptr<ScannerContext> ctx, std::shared_ptr<ScanTask> scan_task) { ... if (auto ret = sumbit_task(); !ret) { scan_task->set_status(Status::InternalError( "Failed to submit scanner to scanner pool reason:" + std::string(ret.msg()) + "|type:" + std::to_string(type))); ctx->append_block_to_queue(scan_task); return; } } void ScannerContext::append_block_to_queue(std::shared_ptr<ScanTask> scan_task) { ... std::lock_guard<std::mutex> l(_transfer_lock); ... } ``` Since mutex in cpp is not re-enterable, so the scanner thread will deadlock with itself. This pr fix the problem by making `ScannerScheduler::submit` return a Status instead of doing append failed task to the ScannerContext. The caller itself will decide where resubmit the scanner or just abort the execution of the query.
- Loading branch information