Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: TestDatabasesTablesV2 is flakey under testrace #87074

Closed
adityamaru opened this issue Aug 29, 2022 · 4 comments · Fixed by #106420
Closed

server: TestDatabasesTablesV2 is flakey under testrace #87074

adityamaru opened this issue Aug 29, 2022 · 4 comments · Fixed by #106420
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test

Comments

@adityamaru
Copy link
Contributor

adityamaru commented Aug 29, 2022

Recently TestDatabasesTablesV2 has been flakey under testrace - https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Ci_Tests&testNameId=-5148906532967248544&tab=testDetails. All the failures seem to be off the form:

Error:      	Received unexpected error:
        	            	Get "https://127.0.0.1:33103/api/v2/databases/testdb/tables/public.testtable/": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
        	Test:       	TestDatabasesTablesV2

Jira issue: CRDB-19138

@adityamaru adityamaru added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Aug 29, 2022
adityamaru added a commit to adityamaru/cockroach that referenced this issue Aug 29, 2022
Refs: cockroachdb#87074

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None
@rafiss
Copy link
Collaborator

rafiss commented Aug 29, 2022

This is not a SQL experience test. Moving to SQL Observability.

@rafiss rafiss added T-sql-observability and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Aug 29, 2022
craig bot pushed a commit that referenced this issue Aug 30, 2022
87075: server: skip TestDatabasesTablesV2 under race r=srosenberg a=adityamaru

Refs: #87074

Reason: flaky test

Generated by bin/skip-test.

Release justification: non-production code changes

Release note: None

Co-authored-by: adityamaru <[email protected]>
@matthewtodd
Copy link
Contributor

Before I understand anything here, on a gceworker, I can get ./dev test pkg/server --filter=TestDatabasesTablesV2 --stress --race --timeout=10m to pass by bumping the client timeout from 10 to 30 seconds:

--- a/pkg/rpc/tls.go
+++ b/pkg/rpc/tls.go
@@ -231,7 +231,7 @@ func (ctx *SecurityContext) GetUIServerTLSConfig() (*tls.Config, error) {
 // if needed. It uses the client TLS config.
 func (ctx *SecurityContext) GetHTTPClient() (http.Client, error) {
        ctx.lazy.httpClient.Do(func() {
-               ctx.lazy.httpClient.httpClient.Timeout = 10 * time.Second
+               ctx.lazy.httpClient.httpClient.Timeout = 30 * time.Second
                var transport http.Transport
                ctx.lazy.httpClient.httpClient.Transport = &transport
                transport.TLSClientConfig, ctx.lazy.httpClient.err = ctx.getUIClientTLSConfig()

I also notice that the failures, both historically and that I've observed in my own testing, have all been on the same endpoint, the databases/testdb/tables/public.testtable one.

Both of which lead me to believe that this endpoint could just be slow?

I'll dial down the timeout to see if I can find a reliable threshold (trying 20s now) and also start looking into the implementation to see if there's anything obvious there.

If not, options seem to be bumping the timeout, keeping the test skipped, or commenting out just the slow part.

@matthewtodd
Copy link
Contributor

And, a clue, the first failures here happened in the development of #82617, which would corroborate the hypothesis that the endpoint just got slower. I'll look for the base commit there and try a git bisect to confirm.

@matthewtodd
Copy link
Contributor

It's indeed the tableDetails endpoint that's now too slow for this test client's 10 second timeout.

While we could bump the timeout globally (unnecessarily lowering our signal for other slow endpoints), refactor to handle overriding the timeout on a request basis (less-than-straightforward with the http client library), or make a copy of the client with a higher timeout for this one request (current code would require dubious restructuring), I'm inclined to keep this test skipped until the underlying work in #84105 helps speed up the endpoint for real.

See also #90196.

@THardy98 THardy98 self-assigned this Mar 3, 2023
craig bot pushed a commit that referenced this issue Jul 7, 2023
106072: server: close net.Conn unconditionally after accepting r=rafiss,knz a=ecwall

Informs #105448

Previously after the server accepted a connection, it was closed in multiple paths and in multiple layers after it was done being used. Now it is always closed in the same layer that accepted it after the serveConn callback returns.

Release note: None

106238: server: fix crdb_internal.cluster_inflight_traces in shared process mode r=yuzefovich a=yuzefovich

This commit fixes a panic that would previously occur when querying `crdb_internal.cluster_inflight_traces` virtual table when running in shared-process multi-tenant mode. In particular, the problem was that we tried to access node liveness which isn't available, and now we will fall back to the multi-tenant way of doing things (using the instances reader). Additionally, this commit extends the existing test to also run in shared-process multi-tenant config which serves as a regression test for this bug. There is no release note since it's not a user-visible bug.

Fixes: #106182.
Epic: CRDB-26691

Release note: None

106320: sql: unskip TestSchemaChangeGCJob r=chengxiong-ruan a=chengxiong-ruan

Informs: #85876
fixes #60664

A few assumptions has been changed since the test was skipped. For example, the first user defined table descriptor ID and expected error messages. There was actually also bugs in the error assertion in the test for some reason, that is fixed by this PR too.

Release note: None

106407: sql: add tests for CTAS, CMVAS with every SHOW statement r=chengxiong-ruan a=ecwall

Fixes #105895

Adds tests for CREATE TABLE AS, CREATE MATERIALIZED VIEW AS sourcing from
all SHOW statements.

Release note: None


106420: server: unskip TestDatabasesTablesV2 r=THardy98 a=THardy98

Resolves: #87074

This test no longer seems flaky after testing (via `./dev test pkg/server --filter=TestDatabasesTablesV2 --stress --stress-args="-p=4" --race --timeout=30m` on gceworker).

Release note: None

Co-authored-by: Evan Wall <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Chengxiong Ruan <[email protected]>
Co-authored-by: Thomas Hardy <[email protected]>
@craig craig bot closed this as completed in 84e8dc7 Jul 7, 2023
blathers-crl bot pushed a commit that referenced this issue Jul 7, 2023
Resolves: #87074

This test no longer seems flaky after testing (via `./dev test
pkg/server --filter=TestDatabasesTablesV2 --stress --stress-args="-p=4"
--race --timeout=30m` on gceworker).

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. skipped-test
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants