server: TestDatabasesTablesV2 is flakey under testrace #87074

adityamaru · 2022-08-29T21:37:57Z

Recently TestDatabasesTablesV2 has been flakey under testrace - https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Ci_Tests&testNameId=-5148906532967248544&tab=testDetails. All the failures seem to be off the form:

Error:      	Received unexpected error:
        	            	Get "https://127.0.0.1:33103/api/v2/databases/testdb/tables/public.testtable/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestDatabasesTablesV2

Jira issue: CRDB-19138

The text was updated successfully, but these errors were encountered:

Refs: cockroachdb#87074 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

rafiss · 2022-08-29T21:47:07Z

This is not a SQL experience test. Moving to SQL Observability.

87075: server: skip TestDatabasesTablesV2 under race r=srosenberg a=adityamaru Refs: #87074 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None Co-authored-by: adityamaru <[email protected]>

matthewtodd · 2022-09-08T21:32:40Z

Before I understand anything here, on a gceworker, I can get ./dev test pkg/server --filter=TestDatabasesTablesV2 --stress --race --timeout=10m to pass by bumping the client timeout from 10 to 30 seconds:

--- a/pkg/rpc/tls.go
+++ b/pkg/rpc/tls.go
@@ -231,7 +231,7 @@ func (ctx *SecurityContext) GetUIServerTLSConfig() (*tls.Config, error) {
 // if needed. It uses the client TLS config.
 func (ctx *SecurityContext) GetHTTPClient() (http.Client, error) {
        ctx.lazy.httpClient.Do(func() {
-               ctx.lazy.httpClient.httpClient.Timeout = 10 * time.Second
+               ctx.lazy.httpClient.httpClient.Timeout = 30 * time.Second
                var transport http.Transport
                ctx.lazy.httpClient.httpClient.Transport = &transport
                transport.TLSClientConfig, ctx.lazy.httpClient.err = ctx.getUIClientTLSConfig()

I also notice that the failures, both historically and that I've observed in my own testing, have all been on the same endpoint, the databases/testdb/tables/public.testtable one.

Both of which lead me to believe that this endpoint could just be slow?

I'll dial down the timeout to see if I can find a reliable threshold (trying 20s now) and also start looking into the implementation to see if there's anything obvious there.

If not, options seem to be bumping the timeout, keeping the test skipped, or commenting out just the slow part.

matthewtodd · 2022-09-08T21:48:07Z

And, a clue, the first failures here happened in the development of #82617, which would corroborate the hypothesis that the endpoint just got slower. I'll look for the base commit there and try a git bisect to confirm.

matthewtodd · 2022-10-26T15:56:38Z

It's indeed the tableDetails endpoint that's now too slow for this test client's 10 second timeout.

While we could bump the timeout globally (unnecessarily lowering our signal for other slow endpoints), refactor to handle overriding the timeout on a request basis (less-than-straightforward with the http client library), or make a copy of the client with a higher timeout for this one request (current code would require dubious restructuring), I'm inclined to keep this test skipped until the underlying work in #84105 helps speed up the endpoint for real.

See also #90196.

106072: server: close net.Conn unconditionally after accepting r=rafiss,knz a=ecwall Informs #105448 Previously after the server accepted a connection, it was closed in multiple paths and in multiple layers after it was done being used. Now it is always closed in the same layer that accepted it after the serveConn callback returns. Release note: None 106238: server: fix crdb_internal.cluster_inflight_traces in shared process mode r=yuzefovich a=yuzefovich This commit fixes a panic that would previously occur when querying `crdb_internal.cluster_inflight_traces` virtual table when running in shared-process multi-tenant mode. In particular, the problem was that we tried to access node liveness which isn't available, and now we will fall back to the multi-tenant way of doing things (using the instances reader). Additionally, this commit extends the existing test to also run in shared-process multi-tenant config which serves as a regression test for this bug. There is no release note since it's not a user-visible bug. Fixes: #106182. Epic: CRDB-26691 Release note: None 106320: sql: unskip TestSchemaChangeGCJob r=chengxiong-ruan a=chengxiong-ruan Informs: #85876 fixes #60664 A few assumptions has been changed since the test was skipped. For example, the first user defined table descriptor ID and expected error messages. There was actually also bugs in the error assertion in the test for some reason, that is fixed by this PR too. Release note: None 106407: sql: add tests for CTAS, CMVAS with every SHOW statement r=chengxiong-ruan a=ecwall Fixes #105895 Adds tests for CREATE TABLE AS, CREATE MATERIALIZED VIEW AS sourcing from all SHOW statements. Release note: None 106420: server: unskip TestDatabasesTablesV2 r=THardy98 a=THardy98 Resolves: #87074 This test no longer seems flaky after testing (via `./dev test pkg/server --filter=TestDatabasesTablesV2 --stress --stress-args="-p=4" --race --timeout=30m` on gceworker). Release note: None Co-authored-by: Evan Wall <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Chengxiong Ruan <[email protected]> Co-authored-by: Thomas Hardy <[email protected]>

Resolves: #87074 This test no longer seems flaky after testing (via `./dev test pkg/server --filter=TestDatabasesTablesV2 --stress --stress-args="-p=4" --race --timeout=30m` on gceworker). Release note: None

adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Aug 29, 2022

adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 29, 2022

server: skip TestDatabasesTablesV2 under race

5f76336

Refs: cockroachdb#87074 Reason: flaky test Generated by bin/skip-test. Release justification: non-production code changes Release note: None

adityamaru mentioned this issue Aug 29, 2022

server: skip TestDatabasesTablesV2 under race #87075

Merged

adityamaru added the skipped-test label Aug 29, 2022

rafiss added T-sql-observability and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Aug 29, 2022

maryliag assigned matthewtodd Sep 8, 2022

maryliag unassigned matthewtodd Feb 27, 2023

THardy98 self-assigned this Mar 3, 2023

maryliag added T-cluster-observability and removed T-sql-observability labels Jul 5, 2023

THardy98 mentioned this issue Jul 7, 2023

server: unskip TestDatabasesTablesV2 #106420

Merged

craig bot closed this as completed in 84e8dc7 Jul 7, 2023

blathers-crl bot mentioned this issue Jul 7, 2023

release-23.1: server: unskip TestDatabasesTablesV2 #106440

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: TestDatabasesTablesV2 is flakey under testrace #87074

server: TestDatabasesTablesV2 is flakey under testrace #87074

adityamaru commented Aug 29, 2022 •

edited by cockroach-jira-scripts

Loading

rafiss commented Aug 29, 2022

matthewtodd commented Sep 8, 2022

matthewtodd commented Sep 8, 2022

matthewtodd commented Oct 26, 2022

server: TestDatabasesTablesV2 is flakey under testrace #87074

server: TestDatabasesTablesV2 is flakey under testrace #87074

Comments

adityamaru commented Aug 29, 2022 • edited by cockroach-jira-scripts Loading

rafiss commented Aug 29, 2022

matthewtodd commented Sep 8, 2022

matthewtodd commented Sep 8, 2022

matthewtodd commented Oct 26, 2022

adityamaru commented Aug 29, 2022 •

edited by cockroach-jira-scripts

Loading