-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: set 23.1.8 as MinSupportedVersion for sql-stats/mixed-version #123321
Conversation
@@ -203,7 +203,9 @@ func (s *sqlStatsRequestHelper) requestSQLStats( | |||
for _, addr := range adminUIAddrs { | |||
url := getCombinedStatementStatsURL(addr, statsType, requestedRange) | |||
statsResponse := &serverpb.StatementsResponse{} | |||
if err := s.client.GetJSON(ctx, url, statsResponse, httputil.IgnoreUnknownFields()); err != nil { | |||
if err := retry.ForDuration(10*time.Second, func() error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the retry logic? What kinds of spurious failures do we expect here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly motivated from seeing it applied here:
return client.GetJSON(ctx, url, &details) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrm, but that change was done in 2018 (#29492), does it still apply today? Have we seen this test flake because the status server was not ready to handle requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay that seems pretty outdated. So far this test has failed twice in the nightly run at this step, but with different error messages:
-
Get "https://3.147.89.154:26258/_status/combinedstmts?fetch_mode.stats_type=0&start=0&end=100&limit=10": EOF
[ref] -
roachtestutil.addCookie: unable to extract sessionID: roachtestutil.GetSessionID: failed to authenticate: COMMAND_PROBLEM: exit status 1
[ref]
Do these point to an issue with the mixed version test or an issue within roachtest? Is there guidance on restricting the test to certain cloud providers? Running the test locally passed so these failures seemed sporadic to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For failure 2, I see: ERROR: column "user_id" does not exist
when trying to auth on 23.1
. Since I removed the version gate for that, I think the fix is for the mvt framework to also drop support for 23.1 on master?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fix is for the mvt framework to also drop support for 23.1 on master?
Ah my bad, I didn't realize this is why the PR had MaxUpgrades(2)
originally. I forgot the details on when the auth stuff works. I suggest passing the MinimumSupportedVersion("v23.1.0")
option to the test, which will make sure user functions will only run once the cluster is at least that version.
Do these point to an issue with the mixed version test or an issue within roachtest?
I think neither in this case? Failure 2 is a limitation in cockroach with authenticating in mixed-version state, as I understand it. It can be worked around by specifying the minimum version as I mentioned above.
Issue 1 actually seems to be a bug in the stats code -- n4 crashed with this error (only a few frames here for clarity):
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 a panic has occurred!
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 +runtime error: invalid memory address or nil pointer dereference
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 +(1) attached stack trace
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + -- stack trace:
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | runtime.gopanic
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | GOROOT/src/runtime/panic.go:884
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | runtime.panicmem
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | GOROOT/src/runtime/panic.go:260
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | runtime.sigpanic
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | GOROOT/src/runtime/signal_unix.go:835
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server.activityTablesHaveFullData
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server/combined_statement_stats.go:232
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server.getCombinedStatementStats
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server/combined_statement_stats.go:100
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server.(*statusServer).CombinedStatementStats
E240430 07:54:10.992587 21115 1@util/log/logcrash/crash_reporting.go:188 ⋮ [-] 506 + | github.com/cockroachdb/cockroach/pkg/server/combined_statement_stats.go:65
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Granted, the error message could be more explicit (and it could be if the monitor had noticed the process death soon enough, but the client saw the EOF first, and that was the error reported). A client seeing an EOF from a connection often means that the server crashed, FWIW.
Running the test locally passed so these failures seemed sporadic to me.
As a bit of context, there's a lot of randomization happening in these tests so passing locally doesn't mean that test failures are flakes or spurious necessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah gotcha, thanks, all of that was super helpful. I ended up setting the min supported version as 23.1.8 - the node crash is from a bug prior to that version (my bad for not checking n4 logs).
There is a bug in the sql stats api prior to 23.1.8, where passing a requested start time of 0 or nil results in a nil pointer dereference. Setting this min supported version also resolves some mixed version authentication limitaitons. Epic: none Fixes: cockroachdb#123278 Release note: None
TFTR! |
There is a bug in the sql stats api prior to 23.1.8, where
passing a requested start time of 0 or nil results in a
nil pointer dereference.
Setting this min supported version also resolves some mixed
version authentication limitations.
Epic: none
Fixes: #123278
Release note: None