release: 19.2.8 #50465

asubiotto · 2020-06-22T08:35:05Z

Candidate SHA: 0421678
Deployment status: Qualifying
Qualification Suite: https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_ReleaseQualification&tab=buildTypeStatusDiv&branch_Cockroach=provisional_202006230817_v19.2.8
Nightly Suite: https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_NightlySuite&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=provisional_202006230817_v19.2.8

Roachtest GCE failed due to quota issues. Running independently: https://teamcity.cockroachdb.com/viewLog.html?buildId=2032238&buildTypeId=Cockroach_Nightlies_WorkloadNightly

Admin UI for Qualification Clusters:

Release process checklist

Prep date: Monday 6/22/2020

Pick a SHA
- fill in Candidate SHA above
- email thread on releases@
Tag the provisional SHA
Publish provisional binaries
Ack security@ on the generated Stackdriver Alert to confirm these writes were part of a planned release (Just reply on the email received alert email acking that this was part of the release process)

Release Qualification

Check binaries
- Update 19.2.8 VERSION to 0421678 bincheck#202
Deploy to test clusters
Verify node crash reports
- https://sentry.io/organizations/cockroach-labs/issues/1258371060/?project=164528&query=panic&statsPeriod=14d
Start release qualification suite
Start nightly suite
fill in all WIP elements above (Nightly Suite, Admin UI, etc.)

One day after prep date:

Get signoff on roachtest failures
Keep an eye on clusters until release date. Do not proceed below until the release date.

Release date: Monday 6/29/2020

Check cluster status
Tag release
Bless provisional binaries
If applicable, update the map that roachtests use to map a version to a previous version, to reference the newly tagged version
- roachtest: update version map and create fixtures #50770
For production or stable releases in the latest major release series
- ~~Update the Homebrew Formula~~
- ~~Update the image tag in our Orchestrator configurations~~
For production or stable releases
- Announce version to the registration cluster
Update docs
- https://www.cockroachlabs.com/docs/releases/v19.2.8.html
External communications for release
Clean up provisional tag from repository

The text was updated successfully, but these errors were encountered:

asubiotto · 2020-06-23T08:01:19Z

Restarting with a SHA that includes the security fix: 0421678

asubiotto · 2020-06-23T11:07:08Z

Looks like Roachtest GCE nightly could not be started due to exceeding a CPU quota. Can we restart just that part of the suite? (cc @jlinder) We also have the option to pass different roachprod zones to the build, but I'm slightly confused since I only see the option to do so at the Nightly Suite level, not the Roachtest GCE level (although I haven't looked very hard).

Update: re-running Roachtest GCE nightly independently. We'll see if we still run into quota issues. If we do, I'll change zones.

asubiotto · 2020-06-23T11:42:19Z

Looks like it still ran into an issue, we're running close to the 4k cpus limit in us-central. Will use us-east (2400 unused CPUs) instead of us-central and see what happens: https://teamcity.cockroachdb.com/viewLog.html?buildId=2032238&buildTypeId=Cockroach_Nightlies_WorkloadNightly

asubiotto · 2020-06-23T13:48:06Z

Starting the test failure checkoff process while the build is still running to save time.

Test Failures List

Roachtest GCE

Failures: https://teamcity.cockroachdb.com/viewLog.html?buildId=2032238&buildTypeId=Cockroach_Nightlies_WorkloadNightly

[kv]

kv/contention/nodes=4
tpccbench/nodes=9/cpu=4/chaos/partition

[appdev]

django
lib/pq
pgx
psycopg

[sql-schema]

schemachange/during/tpcc

tbg · 2020-06-23T14:20:23Z

ts_util.go:130,kv.go:259,cluster.go:2460,errgroup.go:57: spent 47.368421% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@nvanbenschoten looks like your wheelhouse

rafiss · 2020-06-23T14:47:47Z

Signed off on the AppDev tests.

jlinder · 2020-06-23T15:04:24Z

I suspect the quota issue was hit because we are running three releases at the same time on top of the normal nightlies. I've put in a request to raise the quota on CPU as well as in-use IPs and local SSDs as those numbers were close enough that the limit might be reached when running all three as well.

jlinder · 2020-06-23T15:05:14Z

The quota limit increases were approved.

nvanbenschoten · 2020-06-23T15:50:30Z

kv/contention/nodes=4 has always been flaky on v19.2. See #40786 and https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&buildTypeId=&tab=testDetails&testNameId=-9215103075698950051&order=START_DATE_DESC&branch_Cockroach_Nightlies=release-19.2&itemsCount=100. Last time @irfansharif looked, he diagnosed that it had something to do with fairness issues around the contentionQueue. This was part of the reason we redesigned this in v20.1.

I think we should reduce the aggressiveness of the test to avoid some of the starvation that results from these fairness issues under such severe contention. In the meantime, signing off.

tbg · 2020-06-24T08:31:14Z

tpccbench timed out. It did run some of the workloads, but look at the last one's logs:

Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200623 22:47:28.365328 1 workload/cli/run.go:362  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200623 22:48:30.563745 1 workload/cli/run.go:362  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
I200623 22:49:34.501519 1 workload/cli/run.go:362  retrying after error while creating load: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...

This basically just goes on and on, we never manage to prep all statements before a node gets chaos-killed. @nvanbenschoten is that something we're aware of? I feel like I've seen this test time out in that way a few times. What do we do to fix? Is the chaos timing too aggressive? I assume it's not possible (feasible) to prepare the statements before starting the workload+chaos.

ajwerner · 2020-06-24T17:07:40Z

The schema change one seems potentially bad:

schemachange.go:476,schemachange.go:439,cluster.go:2460,errgroup.go:57: pq: foreign key violation: "district" row d_w_id=288, d_id=1 has no match in "warehouse"

I cannot imagine how that's true.

ajwerner · 2020-06-24T17:45:09Z

Chalking the schema change failure up as #44301 which has been newly prioritized.

nvanbenschoten · 2020-06-25T19:50:21Z

This basically just goes on and on, we never manage to prep all statements before a node gets chaos-killed. @nvanbenschoten is that something we're aware of? I feel like I've seen this test time out in that way a few times. What do we do to fix? Is the chaos timing too aggressive? I assume it's not possible (feasible) to prepare the statements before starting the workload+chaos.

I think there's something more going on. We've dropped the chaos aggressiveness in the past and it didn't seem to help. These prepared statements are usually very quick, so it's surprising to see them stall for over a minute. It makes me wonder if they're getting stuck in some backoff loop if they start the process at the wrong time.

I'll confirm the part about prepared statements being quick though. We are initializing 2100 connections and 10500 workers, so maybe I'm misremembering how long that scale takes. There was also some movement in this area when we moved tpcc to pgx. That could be related.

cockroachdb deleted a comment from blathers-crl bot Jun 22, 2020

asubiotto self-assigned this Jun 22, 2020

asubiotto mentioned this issue Jun 23, 2020

release: 19.1.10 #50464

Closed

22 tasks

asubiotto closed this as completed Jun 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: 19.2.8 #50465

release: 19.2.8 #50465

asubiotto commented Jun 22, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020

asubiotto commented Jun 23, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020 •

edited by ajwerner

Loading

tbg commented Jun 23, 2020

rafiss commented Jun 23, 2020 •

edited

Loading

jlinder commented Jun 23, 2020

jlinder commented Jun 23, 2020

nvanbenschoten commented Jun 23, 2020

tbg commented Jun 24, 2020

ajwerner commented Jun 24, 2020

ajwerner commented Jun 24, 2020

nvanbenschoten commented Jun 25, 2020

release: 19.2.8 #50465

release: 19.2.8 #50465

Comments

asubiotto commented Jun 22, 2020 • edited Loading

asubiotto commented Jun 23, 2020

asubiotto commented Jun 23, 2020 • edited Loading

asubiotto commented Jun 23, 2020 • edited Loading

asubiotto commented Jun 23, 2020 • edited by ajwerner Loading

Test Failures List

Roachtest GCE

[kv]

[appdev]

[sql-schema]

tbg commented Jun 23, 2020

rafiss commented Jun 23, 2020 • edited Loading

jlinder commented Jun 23, 2020

jlinder commented Jun 23, 2020

nvanbenschoten commented Jun 23, 2020

tbg commented Jun 24, 2020

ajwerner commented Jun 24, 2020

ajwerner commented Jun 24, 2020

nvanbenschoten commented Jun 25, 2020

asubiotto commented Jun 22, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020 •

edited

Loading

asubiotto commented Jun 23, 2020 •

edited by ajwerner

Loading

rafiss commented Jun 23, 2020 •

edited

Loading