Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: follower-reads/mixed-version/single-region failed #119064

Closed
cockroach-teamcity opened this issue Feb 10, 2024 · 9 comments · Fixed by #119143
Closed

roachtest: follower-reads/mixed-version/single-region failed #119064

cockroach-teamcity opened this issue Feb 10, 2024 · 9 comments · Fixed by #119143
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Milestone

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented Feb 10, 2024

roachtest.follower-reads/mixed-version/single-region failed with artifacts on master @ b2e31876366324c2ebe5c2ad8bbd644997e90864:

(follower_reads.go:312).runFollowerReadsTest: failed to get follower read counts: roachtestutil.addCookie: unable to extract sessionID: roachtestutil.GetSessionID: failed to authenticate: COMMAND_PROBLEM: exit status 1
(mixedversion.go:586).Run: panic (stack trace above): t.Fatal() was called
test artifacts and logs in: /artifacts/follower-reads/mixed-version/single-region/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/kv-triage

This test on roachdash | Improve this report!

Jira issue: CRDB-36015

@cockroach-teamcity cockroach-teamcity added branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-kv KV Team labels Feb 10, 2024
@cockroach-teamcity cockroach-teamcity added this to the 24.1 milestone Feb 10, 2024
@DarrylWong
Copy link
Contributor

Looks like it fails trying to retrieve the sessionID, which was just added in #118504. Weird that it can't connect to any of the four nodes in the cluster. Let me dig into this more, it's most likely related to those changes.

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/mixed-version/single-region failed with artifacts on master @ b2e31876366324c2ebe5c2ad8bbd644997e90864:

(follower_reads.go:312).runFollowerReadsTest: failed to get follower read counts: roachtestutil.addCookie: unable to extract sessionID: roachtestutil.GetSessionID: failed to authenticate: COMMAND_PROBLEM: exit status 1
(mixedversion.go:586).Run: panic (stack trace above): t.Fatal() was called
test artifacts and logs in: /artifacts/follower-reads/mixed-version/single-region/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@erikgrinaker erikgrinaker added T-testeng TestEng Team and removed T-kv KV Team labels Feb 12, 2024
Copy link

blathers-crl bot commented Feb 12, 2024

cc @cockroachdb/test-eng

@DarrylWong
Copy link
Contributor

After some more digging, it looks like:

  • This failure only happens when we are upgrading cluster versions from 22.2 to 23.1 and we don't wait for all nodes to reach cluster version 23.1 first. Waiting for the cluster version to finalize works, and upgrading from any other two versions works fine.
  • The actual error is an EOF error from calling rows.Next(row) on the system.web_sessions table.
  • I notice that sql: add user_id column to system.web_sessions table #96896 was added in 23.1 which adds a user_id column.

Based on the above, my educated guess is that the cluster upgrades to 23.1 while we are in the middle of authenticating. The initial query for cluster version is 22.2 and makes auth-session believe there is no user_id column. But then when it goes to insert the values, the cluster version upgrades and the query fails as we didn't give it a user_id.

@renatolabs, if my understanding of the failure is correct, what is the preferred way to fix it? Do we just make the minimum supported version 23.1 for this test? Or maybe we think the loss in coverage isn't worth it and we should run this test on insecure mode?

Or maybe this is just a bug with auth-session login and it should be able to support this state?

@DarrylWong
Copy link
Contributor

DarrylWong commented Feb 12, 2024

Not sure if this proves my theory, but I just noticed the following happen after adding a 5 second sleep before authenticating:

  1. DefaultHTTPClient attempts to retrieve the session ID on node 1. auth-session login says the cluster version is 22.2 and sends an INSERT without user_id.
  2. The above fails but the DefaultHTTPClient logic retries it on node 2.
  3. This time, auth-session login says the cluster version is 23.1 and sends an INSERT with user_id.
  4. It works and the test passes.

@renatolabs
Copy link
Contributor

Ah, thanks for looking into this. I don't see a clean solution here 😞

Making the minimum supported version 23.1 would fix the issue in the short term, but we would have to face it again if: 1) the secure clusters work gets backported or 2) we add some other migration that affects this system table (or any other tables that this operation might use).

Brainstorming:

  • we could pipe mixed-version state all the way down to the function that creates the HTTP client, allowing us to write code to deal with this specific issue/upgrade path. That said, a lot of these functions seem to be pretty unaware they are running in mixed-version state because they are shared with many other non-upgrade roachtests.
  • DefaultHTTPClient itself could "recognize" the shape of this error (I assume the error says something about "missing user_id column") and retry. Or maybe the client could take a "retry function" that determines if an error is retryable (so we would scope this change just to tests that need it).

Neither are very elegant, so maybe making this test run in insecure mode is the most convenient / acceptable compromise. We should be bumping the min supported version on master to 23.1 soon anyway (as 22.2 will be EOLed in a few months1), at which point we can make the cluster secure again.

Footnotes

  1. https://www.cockroachlabs.com/docs/releases/release-support-policy

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/mixed-version/single-region failed with artifacts on master @ 814a375d4c0e79d875c42452725f05f6c27294e3:

(follower_reads.go:312).runFollowerReadsTest: failed to get follower read counts: roachtestutil.addCookie: unable to extract sessionID: roachtestutil.GetSessionID: failed to authenticate: COMMAND_PROBLEM: exit status 1
(mixedversion.go:592).Run: panic (stack trace above): t.Fatal() was called
test artifacts and logs in: /artifacts/follower-reads/mixed-version/single-region/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

@DarrylWong
Copy link
Contributor

I assume the error says something about "missing user_id column

It doesn't 😞 The error returned is connection lost and EOF, which occurs when auth-session login tries to read the non existent session ID.

cockroach/pkg/cli/auth.go

Lines 151 to 155 in cb332a8

insertSessionStmt = `
INSERT INTO system.web_sessions ("hashedSecret", username, "expiresAt", user_id)
VALUES ($1, $2, $3, (SELECT user_id FROM system.users WHERE username = $2))
RETURNING id
`

cockroach/pkg/cli/auth.go

Lines 166 to 168 in cb332a8

if err := rows.Next(row); err != nil {
return err
}

Not very helpful and I'm not sure why that INSERT doesn't return a error if it failed? To further confirm, I removed the username from the INSERT statement above, which makes all auth-session login calls fail with the same unhelpful connection lost and EOF error.

maybe making this test run in insecure mode is the most convenient / acceptable compromise

Sounds good to me. I'll put out a fix. A little unfortunate that this test has failed 3/4 times on master but never failed in my own testing 😅

@cockroach-teamcity
Copy link
Member Author

roachtest.follower-reads/mixed-version/single-region failed with artifacts on master @ 254dbd247fb8ed352a11439063b29f23a0767f28:

(follower_reads.go:312).runFollowerReadsTest: failed to get follower read counts: roachtestutil.addCookie: unable to extract sessionID: roachtestutil.GetSessionID: failed to authenticate: COMMAND_PROBLEM: exit status 1
(mixedversion.go:592).Run: panic (stack trace above): t.Fatal() was called
test artifacts and logs in: /artifacts/follower-reads/mixed-version/single-region/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=2
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=true
  • ROACHTEST_metamorphicBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-master Failures and bugs on the master branch. C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-testeng TestEng Team
Projects
No open projects
Status: Closed
Status: Done
Development

Successfully merging a pull request may close this issue.

4 participants