Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release: 19.2.8 #50465

Closed
22 tasks done
asubiotto opened this issue Jun 22, 2020 · 13 comments
Closed
22 tasks done

release: 19.2.8 #50465

asubiotto opened this issue Jun 22, 2020 · 13 comments
Assignees

Comments

@asubiotto
Copy link
Contributor

asubiotto commented Jun 22, 2020

Candidate SHA: 0421678
Deployment status: Qualifying
Qualification Suite: https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_ReleaseQualification&tab=buildTypeStatusDiv&branch_Cockroach=provisional_202006230817_v19.2.8
Nightly Suite: https://teamcity.cockroachdb.com/viewType.html?buildTypeId=Cockroach_Nightlies_NightlySuite&tab=buildTypeStatusDiv&branch_Cockroach_Nightlies=provisional_202006230817_v19.2.8

Admin UI for Qualification Clusters:

Release process checklist

Prep date: Monday 6/22/2020

  • Pick a SHA
    • fill in Candidate SHA above
    • email thread on releases@
  • Tag the provisional SHA
  • Publish provisional binaries
  • Ack security@ on the generated Stackdriver Alert to confirm these writes were part of a planned release (Just reply on the email received alert email acking that this was part of the release process)

Release Qualification

One day after prep date:

Release date: Monday 6/29/2020

@cockroachdb cockroachdb deleted a comment from blathers-crl bot Jun 22, 2020
@asubiotto asubiotto self-assigned this Jun 22, 2020
@asubiotto
Copy link
Contributor Author

Restarting with a SHA that includes the security fix: 0421678

@asubiotto
Copy link
Contributor Author

asubiotto commented Jun 23, 2020

Looks like Roachtest GCE nightly could not be started due to exceeding a CPU quota. Can we restart just that part of the suite? (cc @jlinder) We also have the option to pass different roachprod zones to the build, but I'm slightly confused since I only see the option to do so at the Nightly Suite level, not the Roachtest GCE level (although I haven't looked very hard).

Update: re-running Roachtest GCE nightly independently. We'll see if we still run into quota issues. If we do, I'll change zones.

@asubiotto
Copy link
Contributor Author

asubiotto commented Jun 23, 2020

Looks like it still ran into an issue, we're running close to the 4k cpus limit in us-central. Will use us-east (2400 unused CPUs) instead of us-central and see what happens: https://teamcity.cockroachdb.com/viewLog.html?buildId=2032238&buildTypeId=Cockroach_Nightlies_WorkloadNightly

@asubiotto
Copy link
Contributor Author

asubiotto commented Jun 23, 2020

Starting the test failure checkoff process while the build is still running to save time.

Test Failures List

Roachtest GCE

Failures: https://teamcity.cockroachdb.com/viewLog.html?buildId=2032238&buildTypeId=Cockroach_Nightlies_WorkloadNightly

[kv]

  • kv/contention/nodes=4
  • tpccbench/nodes=9/cpu=4/chaos/partition

[appdev]

  • django
  • lib/pq
  • pgx
  • psycopg

[sql-schema]

  • schemachange/during/tpcc

@tbg
Copy link
Member

tbg commented Jun 23, 2020

ts_util.go:130,kv.go:259,cluster.go:2460,errgroup.go:57: spent 47.368421% of time below target of 100.000000 txn/s, wanted no more than 5.000000%

@nvanbenschoten looks like your wheelhouse

@asubiotto asubiotto mentioned this issue Jun 23, 2020
22 tasks
@rafiss
Copy link
Collaborator

rafiss commented Jun 23, 2020

Signed off on the AppDev tests.

@jlinder
Copy link
Collaborator

jlinder commented Jun 23, 2020

I suspect the quota issue was hit because we are running three releases at the same time on top of the normal nightlies. I've put in a request to raise the quota on CPU as well as in-use IPs and local SSDs as those numbers were close enough that the limit might be reached when running all three as well.

@jlinder
Copy link
Collaborator

jlinder commented Jun 23, 2020

The quota limit increases were approved.

@nvanbenschoten
Copy link
Member

kv/contention/nodes=4 has always been flaky on v19.2. See #40786 and https://teamcity.cockroachdb.com/project.html?projectId=Cockroach_Nightlies&buildTypeId=&tab=testDetails&testNameId=-9215103075698950051&order=START_DATE_DESC&branch_Cockroach_Nightlies=release-19.2&itemsCount=100. Last time @irfansharif looked, he diagnosed that it had something to do with fairness issues around the contentionQueue. This was part of the reason we redesigned this in v20.1.

I think we should reduce the aggressiveness of the test to avoid some of the starvation that results from these fairness issues under such severe contention. In the meantime, signing off.

@tbg
Copy link
Member

tbg commented Jun 24, 2020

tpccbench timed out. It did run some of the workloads, but look at the last one's logs:

Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200623 22:47:28.365328 1 workload/cli/run.go:362  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...
I200623 22:48:30.563745 1 workload/cli/run.go:362  retrying after error while creating load: preparing 
		UPDATE district
		SET d_next_o_id = d_next_o_id + 1
		WHERE d_w_id = $1 AND d_id = $2
		RETURNING d_tax, d_next_o_id: EOF
Initializing 2100 connections...
I200623 22:49:34.501519 1 workload/cli/run.go:362  retrying after error while creating load: EOF
Initializing 2100 connections...
Initializing 10500 workers and preparing statements...

This basically just goes on and on, we never manage to prep all statements before a node gets chaos-killed. @nvanbenschoten is that something we're aware of? I feel like I've seen this test time out in that way a few times. What do we do to fix? Is the chaos timing too aggressive? I assume it's not possible (feasible) to prepare the statements before starting the workload+chaos.

@ajwerner
Copy link
Contributor

The schema change one seems potentially bad:

schemachange.go:476,schemachange.go:439,cluster.go:2460,errgroup.go:57: pq: foreign key violation: "district" row d_w_id=288, d_id=1 has no match in "warehouse"

I cannot imagine how that's true.

@ajwerner
Copy link
Contributor

Chalking the schema change failure up as #44301 which has been newly prioritized.

@nvanbenschoten
Copy link
Member

This basically just goes on and on, we never manage to prep all statements before a node gets chaos-killed. @nvanbenschoten is that something we're aware of? I feel like I've seen this test time out in that way a few times. What do we do to fix? Is the chaos timing too aggressive? I assume it's not possible (feasible) to prepare the statements before starting the workload+chaos.

I think there's something more going on. We've dropped the chaos aggressiveness in the past and it didn't seem to help. These prepared statements are usually very quick, so it's surprising to see them stall for over a minute. It makes me wonder if they're getting stuck in some backoff loop if they start the process at the wrong time.

I'll confirm the part about prepared statements being quick though. We are initializing 2100 connections and 10500 workers, so maybe I'm misremembering how long that scale takes. There was also some movement in this area when we moved tpcc to pgx. That could be related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants