Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: 20.1.0-beta.3 #45994

Closed
23 of 24 tasks
nathanstilwell opened this issue Mar 11, 2020 · 17 comments
Closed
23 of 24 tasks

release: 20.1.0-beta.3 #45994

nathanstilwell opened this issue Mar 11, 2020 · 17 comments
Assignees

Comments

@nathanstilwell
Copy link
Contributor

nathanstilwell commented Mar 11, 2020

Candidate SHA: 62783b33de905d516f72146fcc4234a27dc8638e
fcd74cd6ebdc64980b13e36b5be1736ed6938a18
Deployment status: v20.1.0-beta.2 / v20.1.0-beta.3
Nightly Suite: Release Qualification / Nightly Suite

Release process checklist

Prep date: 2020-03-18

Release Qualification

One day after prep date:

Release date: 2020-03-25

@nathanstilwell nathanstilwell self-assigned this Mar 11, 2020
@danhhz
Copy link
Contributor

danhhz commented Mar 16, 2020

As mentioned in my email to releases@, we're going to wait until Wednesday and then cut a new sha from HEAD for this 20.1-beta.3 release. See that email for more details.

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2020

New sha is 62783b3.

Checked in on the roachprod clusters and there's slightly elevated p99 and cpu usage. The number of Batches is also still up at 9.2k vs 11k. Nathan has signed off on this as close enough that it should no longer block the beta. (Looks like the beta.3 cluster is slightly more imbalanced so the p99 may be a result of that.)

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2020

Roachtest failures

Note: This is currently a partial list. The build got kicked off late yesterday so it didn't quite finish overnight. I'll update this list with the final set of failures once it's done.

Edit (release triage meeting): All but two were determined to be not beta-blockers. Those two have been fixed and qualification will continue with those fixes cherrypicked.

CDC

  • cdc/tpcc-1000/rangefeed=true (beta-blocker. fixed.)

SQL Exec

  • tpchvec/disk
  • alterpk-tpcc (release-triage meeting: already fixed but not a beta blocker, leaving as-is).

Bulk IO

  • backupTPCC (beta-blocker. fixed.)

KV

  • scaledata/distributed_semaphore/nodes=3
  • scaledata/distributed_semaphore/nodes=6
  • scaledata/filesystem_simulator/nodes=3
  • scaledata/filesystem_simulator/nodes=6
  • scaledata/jobcoordinator/nodes=3
  • scaledata/jobcoordinator/nodes=6

App Dev

  • gopg
  • hibernate
  • pgjdbc
  • sqlalchemy

@dt
Copy link
Member

dt commented Mar 20, 2020

backupTPCC looks bad, certainly release-blocker bad and i think probably beta release blocker too. If we can verify it is TBI, and the others issues get sign-off, I'd be fine cherrypicking a revert of TBI flip on top of the existing SHA in which case it would not need to restart qualification (since that code isn't exercised in qualification yet). Tracking over on #46350 (comment)

@danhhz
Copy link
Contributor

danhhz commented Mar 20, 2020

In the cdc/tpcc-1000/rangefeed=true test a node failed with the following. @nvanbenschoten is this you?

F200320 08:16:35.719895 257 kv/kvserver/concurrency/concurrency_manager.go:308  [n1,s1,r481/1:/Table/59/1/34{7/8/-…-9/6/-…}] caller violated contract
goroutine 257 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.getStacks(0x74b0401, 0xed60672e3, 0x0, 0x774589)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/get_stacks.go:25 +0xb8
github.com/cockroachdb/cockroach/pkg/util/log.(*loggerT).outputLogEntry(0x74ad0a0, 0xc000000004, 0x6a1285c, 0x2e, 0x134, 0xc02f46a360, 0x49)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog.go:212 +0xa0c
github.com/cockroachdb/cockroach/pkg/util/log.addStructured(0x4b83040, 0xc029bced20, 0x4, 0x2, 0x0, 0x0, 0xc0098eaa30, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/structured.go:66 +0x2c9
github.com/cockroachdb/cockroach/pkg/util/log.logDepth(0x4b83040, 0xc029bced20, 0x1, 0x4, 0x0, 0x0, 0xc0098eaa30, 0x1, 0x1)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:44 +0x8c
github.com/cockroachdb/cockroach/pkg/util/log.Fatal(...)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/log/log.go:164
github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency.(*managerImpl).OnLockAcquired(0xc004bd2380, 0x4b83040, 0xc029bced20, 0xc058668a50)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/concurrency/concurrency_manager.go:308 +0x116
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleReadWriteLocalEvalResult(0xc005f55500, 0x4b83040, 0xc029bced20, 0x0, 0x0, 0x0, 0x0, 0xc058668000, 0x3d, 0x45, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_proposal.go:610 +0xeb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*replicaStateMachine).ApplySideEffects(0xc005f555b8, 0x4bbbfa0, 0xc03348e008, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_application_state_machine.go:1025 +0x659
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.mapCheckedCmdIter(0x7f4996ec29f0, 0xc005f557c0, 0xc0098eb300, 0x0, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/cmd.go:184 +0x161
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).applyOneBatch(0xc0098eb830, 0x4b83040, 0xc0299639b0, 0x4bbc060, 0xc005f55760, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:281 +0x259
github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply.(*Task).ApplyCommittedEntries(0xc0098eb830, 0x4b83040, 0xc0299639b0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/apply/task.go:247 +0xfb
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReadyRaftMuLocked(0xc005f55500, 0x4b83040, 0xc0299639b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:732 +0xdf9
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Replica).handleRaftReady(0xc005f55500, 0x4b83040, 0xc0299639b0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/replica_raft.go:392 +0x186
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*Store).processReady(0xc00096ce00, 0x4b83040, 0xc0006c5290, 0x1e1)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/store_raft.go:499 +0x136
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).worker(0xc000810880, 0x4b83040, 0xc0006c5290)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:226 +0x29f
github.com/cockroachdb/cockroach/pkg/kv/kvserver.(*raftScheduler).Start.func2(0x4b83040, 0xc0006c5290)
	/go/src/github.com/cockroachdb/cockroach/pkg/kv/kvserver/scheduler.go:166 +0x3e
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker.func1(0xc0002ffb10, 0xc0006b4120, 0xc0002ffaf0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:198 +0x13e
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunWorker
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:191 +0xa8

@rohany
Copy link
Contributor

rohany commented Mar 20, 2020

The alterpk-tpcc failure here is because it is missing a fix (#46162) causing the oom on the node that died.

@yuzefovich
Copy link
Member

tpchvec/disk failure is concerning, but I'm not sure whether it is a beta blocker (I would think of it as a release blocker though). It needs more investigation (#46256). cc @asubiotto @jordanlewis

@nvanbenschoten
Copy link
Member

In the cdc/tpcc-1000/rangefeed=true test a node failed with the following. @nvanbenschoten is this you?

Yes, this is being tracked in #46290 and was added to the release blocker list yesterday. I'll have a fix out later today, which should be small. I'll defer to you on whether it needs to hold up the beta though, as we only seem to see it on that test. I think that's because the rangefeed processor pushes transaction timestamps with high priority (see rangefeedTxnPusher), so it is effective in creating the series of events enumerated in #46290 (comment). We don't really push with high priority anywhere else in the system.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 20, 2020

Signing off on all the scaledata tests. #45827 broke https://github.com/scaledata/rksql/blob/a313e43fdf6846bb86f2f9e02458442f646cf196/src/go/src/rubrik/util/crdbutil/connection.go#L221, which is expecting 1 column, not 3. We're pulling the filesystem_simulator, distributed_semaphore and jobcoordinator binaries from edge-binaries, built whatever SHA we last picked off of https://github.com/scaledata/rksql last, instead of pinning the dependency somehow (the same kind of problem as #42083, #28069. +cc @apantel).

Informs recent failures on #44066, #43284, #45645, #43273, #43839, #41528.

@nvanbenschoten
Copy link
Member

I'll have a fix out later today, which should be small.

Here's the fix for this: #46391.

@dt
Copy link
Member

dt commented Mar 21, 2020

#46390 fixes the backup bug. It also includes a revert of the revert in #46390 that I merged while I was debugging it. If we're cherrypicking, we could either cherrypick all three (revert, fix, revert-revert), or just the fix commit: 7e288ca

@dt
Copy link
Member

dt commented Mar 23, 2020

Plan is to restart the qualification cluster with cherrypicks of 7e288ca and #46391 but not reset the clock on the QA run as the former is not exercised and the latter is considered low-risk/high confidence, so absent anything else coming up, this should be ready to ship Wednesday 3/25.

@danhhz danhhz assigned dt and unassigned danhhz Mar 23, 2020
@dt
Copy link
Member

dt commented Mar 23, 2020

tag provisional_202003231705_v20.1.0-beta.3 is pushed and publish provisional is underway.

g checkout provisional_202003200044_v20.1.0-beta.3
g cherry-pick 7e288ca4c0ffc57ae059dffb616aa5cefbeff78b
g cherry-pick 94e7b96bb857c067ad4909bbf17cdf0ac72427d6~..3069773267ee1aaa7d28d300dea9ab8fc361d315
g tag provisional_202003231705_v20.1.0-beta.3 

@yuzefovich
Copy link
Member

Signing off on tpchvec/disk as we do not consider it a beta blocker.

@jseldess
Copy link
Contributor

Docs published and external comms sent: https://www.cockroachlabs.com/docs/releases/v20.1.0-beta.3.html

@jseldess
Copy link
Contributor

@dt, anything else to do here?

@dt
Copy link
Member

dt commented Mar 26, 2020

Nothing on eng side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants