-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed #96543
Comments
These are likely related to the overload going on in this test. I see them in the node logs too:
It's curious that connections aren't stable under high CPU load. I wish there were more information in the logs. But, it looks like this is not a new failure mode, as the test addresses it: cockroach/pkg/cmd/roachtest/tests/admission_control_tpcc_overload.go Lines 96 to 111 in c68e94a
The retrying seems to not have been sufficient in this case. Ideally we would look into making gRPC more resilient here or at least produce better errors. The error we're getting is from here Lines 1813 to 1845 in 79e3bd7
meaning that we dialed successfully, but then the connection broke and we're preventing gRPC from re-dialing. |
While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally redialing a connection internally. This improved logging and the test makes it more obvious that things are working as intended. Touches cockroachdb#96543. Epic: none Release note: None
I ran the test[^1] and it passed, so hopefully this isn't obviously breaking anything. See cockroachdb#96543. [^1]: `GCE_PROJECT=andrei-jepsen ./pkg/cmd/roachtest/roachstress.sh -c 1 -u admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 -- tag:weekly` Epic: none Release note: None
96768: server: remove stale comment r=andreimatei a=andreimatei Release note: None Epic: None 96781: rpc: improve detection of onlyOnceDialer redials r=erikgrinaker a=tbg While looking into cockroachdb#96543, I wasn't 100% sure we weren't accidentally redialing a connection internally. This improved logging and the test makes it more obvious that things are working as intended. Touches cockroachdb#96543. Epic: none Release note: None 96782: roachtest: verbose gRPC logging for admission-control/tpcc-olap r=irfansharif a=tbg Closes cockroachdb#96543. Next time we'll know more. Epic: none Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>
roachtest.admission-control/tpcc-olap/nodes=3/cpu=8/w=50/c=96 failed with artifacts on master @ 5fbcd8a8deac0205c7df38e340c1eb9692854383:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=8
,ROACHTEST_encrypted=true
,ROACHTEST_fs=ext4
,ROACHTEST_localSSD=true
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
This test on roachdash | Improve this report!
Jira issue: CRDB-24178
The text was updated successfully, but these errors were encountered: