-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DocDB] Some client connections hang after master leader is restarted #13659
Labels
2.14 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Comments
rthallamko3
added
area/docdb
YugabyteDB core features
status/awaiting-triage
Issue awaiting triage
labels
Aug 17, 2022
yugabyte-ci
added
kind/bug
This issue is a bug
priority/medium
Medium priority issue
labels
Aug 17, 2022
rthallamko3
changed the title
[DocDB] Client connections hang after master leader is restarted
[DocDB] Some client connections hang after master leader is restarted
Aug 17, 2022
samiahmedsiddiqui
pushed a commit
to samiahmedsiddiqui/yugabyte-db
that referenced
this issue
Aug 19, 2022
Summary: YBTransaction starts with looking up for status tablet. When this lookup is failed the transaction is marked as failed. We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list. But no more action will be taken, so callback will never be notified. The side effect is master leader restarts can cause prepared statement connections to hang. This diff fixed the issue by checking failure status in Prepare. Test Plan: Launch AWS cluster. Run the following workload against it: java -jar yb-sample-apps.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES Wait 10 minutes. Stop node, that runs master leader, using AWS console. W/o the fix there will be stuck queries. Reviewers: esheng, rthallam Reviewed By: esheng, rthallam Subscribers: rthallam, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D18994
spolitov
added a commit
that referenced
this issue
Aug 19, 2022
Summary: YBTransaction starts with looking up for status tablet. When this lookup is failed the transaction is marked as failed. We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list. But no more action will be taken, so callback will never be notified. The side effect is master leader restarts can cause prepared statement connections to hang. This diff fixed the issue by checking failure status in Prepare. Original commit: 11cdcc6/D18994 Test Plan: Launch AWS cluster. Run the following workload against it: java -jar yb-sample-apps.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES Wait 10 minutes. Stop node, that runs master leader, using AWS console. W/o the fix there will be stuck queries. Reviewers: esheng, rthallam Reviewed By: rthallam Subscribers: rthallam, ybase Differential Revision: https://phabricator.dev.yugabyte.com/D19043
spolitov
added a commit
that referenced
this issue
Aug 21, 2022
Summary: YBTransaction starts with looking up for status tablet. When this lookup is failed the transaction is marked as failed. We lack failure state check at Prepare, so calling Prepare after transaction failure will result in adding callback to waiters list. But no more action will be taken, so callback will never be notified. The side effect is master leader restarts can cause prepared statement connections to hang. This diff fixed the issue by checking failure status in Prepare. Test Plan: Launch AWS cluster. Run the following workload against it: java -jar yb-sample-apps.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8 --nodes $NODES Wait 10 minutes. Stop node, that runs master leader, using AWS console. W/o the fix there will be stuck queries. Original commit: 11cdcc6/D18994 Reviewers: esheng, rthallam Reviewed By: rthallam Subscribers: ybase, rthallam Differential Revision: https://phabricator.dev.yugabyte.com/D19064
Seems like the 2.12 base backport also landed 0a9521a, so good to close this |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
2.14 Backport Required
area/docdb
YugabyteDB core features
kind/bug
This issue is a bug
priority/medium
Medium priority issue
Jira Link: DB-3224
Description
Steps to repro (From Stepan):
run 20 nodes cluster with RF=5 (i think we can reproduce with less RF and nodes amount)
start sample-apps load - just inserts in SQL table with 20 columns, batch_size=700
java -jar yb-sample-apps-1.8.15-b3.jar --workload SqlDataLoad --num_writes -1 --num_threads_write 5 --num_threads_read 0 --num_reads 0 --num_unique_keys 100000000000 --batch_size 774 --num_value_columns 8
after 10 min stop AWS instance manually in AWS console
wait several minutes and then start it back
go in live queries tab on portal and see stuck query
Here is running universes with CDC configured (connector is killed right now, load is going) - http://10.9.66.35/universes/45a79116-0925-4e80-ac2a-f43c9ef254a0
Here is running universe without CDC - http://10.9.66.35/universes/38479ee9-3a1f-45fb-af1f-432be47fdf75
creds: [email protected] Password#123
The backtrace of the hang:
Thread 1 (LWP 23845):
#0 syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1 0x00007fc045ac6748 in std::__atomic_futex_unsigned_base::_M_futex_wait_until(unsigned int*, unsigned int, bool, std::chrono::duration<long, std::ratio<1l, 1l> >, std::chrono::duration<long, std::ratio<1l, 1000000000l> >) ()
from /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/linuxbrew/lib/libstdc++.so.6
#2 0x00007fc046265d42 in _M_load_and_test_until (__ns=..., __s=..., __has_timeout=, __mo=, __equal=, __operand=, __assumed=, this=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:104
#3 _M_load_and_test (__mo=, __equal=, __operand=, __assumed=, this=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:122
#4 _M_load_when_equal (__mo=std::memory_order_acquire, __val=1, this=0x2bf3c10) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/bits/atomic_futex.h:162
#5 wait (this=0x2bf3c00) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:322
#6 _M_get_result (this=0x7fff096b71d0) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:681
#7 get (this=0x7fff096b71d0) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/future:760
#8 yb::pggate::PgSession::FlushOperations (this=0x2422c40, ops=..., transactional=...) at ../../src/yb/yql/pggate/pg_session.cc:987
#9 0x00007fc046266209 in operator()<std::vectoryb::pggate::BufferableOperation, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag > (txn=...,
ops=, __closure=) at ../../src/yb/yql/pggate/pg_session.cc:870
#10 std::_Function_handler<yb::Status(std::vector<yb::pggate::BufferableOperation, std::allocatoryb::pggate::BufferableOperation >, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag), yb::pggate::PgSession::FlushBufferedOperations()::<lambda(auto:6, auto:7)> >::_M_invoke(const std::_Any_data &, <unknown type in /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/lib/yb/libyb_pggate.so, CU 0x39f473, DIE 0x47889b>, <unknown type in /home/yugabyte/yb-software/yugabyte-2.12.9.1-b1-centos-x86_64/lib/yb/libyb_pggate.so, CU 0x39f473, DIE 0x4788a0>) (__functor=..., __args#0=, __args#1=)
at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:1857
#11 0x00007fc0462621ba in operator() (__args#1=..., __args#0=..., this=0x7fff096b7370) at /tmp/core/yugabyte-2.12.9.1/linuxbrew-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/Cellar/gcc/5.5.0_4/include/c++/5.5.0/functional:2267
#12 yb::pggate::PgSession::FlushBufferedOperationsImpl(std::function<yb::Status (std::vector<yb::pggate::BufferableOperation, std::allocatoryb::pggate::BufferableOperation >, yb::StronglyTypedBoolyb::pggate::IsTransactionalSession_Tag)> const&) (
this=0x2422c40, flusher=...) at ../../src/yb/yql/pggate/pg_session.cc:898
#13 0x00007fc046262454 in yb::pggate::PgSession::FlushBufferedOperations (this=) at ../../src/yb/yql/pggate/pg_session.cc:870
#14 0x00007fc0462624c1 in yb::pggate::PgSession::StopOperationsBuffering (this=) at ../../src/yb/yql/pggate/pg_session.cc:860
#15 0x00007fc04624b218 in yb::pggate::PgApiImpl::StopOperationsBuffering (this=) at ../../src/yb/yql/pggate/pggate.cc:1078
#16 0x00007fc04623e63c in yb::pggate::YBCPgStopOperationsBuffering () at ../../src/yb/yql/pggate/ybc_pggate.cc:652
#17 0x0000000000a3b7d9 in YBEndOperationsBuffering () at ../../../../../../../src/postgres/src/backend/utils/misc/pg_yb_utils.c:1409
#18 0x00000000006d03c9 in standard_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../../src/postgres/src/backend/executor/execMain.c:447
#19 0x00007fc03a589875 in pgss_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../src/postgres/contrib/pg_stat_statements/pg_stat_statements.c:970
#20 0x00007fc03a3822b3 in ybpgm_ExecutorFinish (queryDesc=0x34fdc38) at ../../../../../src/postgres/contrib/yb_pg_metrics/yb_pg_metrics.c:519
#21 0x0000000000888550 in ProcessQuery (isSingleRowModifyTxn=, completionTag=, dest=, queryEnv=, params=, sourceText=, plan=)
at ../../../../../../src/postgres/src/backend/tcop/pquery.c:217
#22 PortalRunMulti (portal=portal@entry=0x26e4118, isTopLevel=isTopLevel@entry=true, setHoldSnapshot=setHoldSnapshot@entry=false, dest=0xed2760 , dest@entry=0x25dc558, altdest=0xed2760 , altdest@entry=0x25dc558,
completionTag=completionTag@entry=0x7fff096b78c0 "INSERT 0 1") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:1331
#23 0x00000000008893f2 in PortalRun (portal=portal@entry=0x26e4118, count=count@entry=1, isTopLevel=isTopLevel@entry=true, run_once=, dest=dest@entry=0x25dc558, altdest=altdest@entry=0x25dc558,
completionTag=0x7fff096b78c0 "INSERT 0 1") at ../../../../../../src/postgres/src/backend/tcop/pquery.c:811
#24 0x0000000000882a3c in exec_execute_message (max_rows=1, portal_name=0x25dc118 "") at ../../../../../../src/postgres/src/backend/tcop/postgres.c:2066
#25 yb_exec_execute_message_impl (raw_ctx=raw_ctx@entry=0x7fff096b7c10) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4564
#26 0x0000000000881875 in yb_exec_query_wrapper (exec_context=exec_context@entry=0x25dc000, restart_data=restart_data@entry=0x25dc528, functor=functor@entry=0x882760 <yb_exec_execute_message_impl>,
---Type to continue, or q to quit---
functor_context=functor_context@entry=0x7fff096b7c10) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4517
#27 0x0000000000886536 in yb_exec_execute_message (exec_context=0x25dc000, restart_data=0x25dc528, max_rows=, portal_name=) at ../../../../../../src/postgres/src/backend/tcop/postgres.c:4581
The text was updated successfully, but these errors were encountered: