Improve reliability of the regression detection workflow #758

tgregg · 2024-02-28T20:40:22Z

Currently the performance regression detection workflow executed via GitHub Actions is unreliable due to high variance in the results. We suspect we can drive down this variance by executing the regression detection workflows on hardware that we can guarantee is reserved for these workflows, and that executes the workflows serially.

Below are some options that have been discussed:

Configure GitHub Actions to submit the jobs to reserved AWS hardware that we control. This is the more complicated of the two options listed here, but has the benefit of not shifting any burden onto the PR requester.
Change the workflow so that it verifies reports uploaded by the requester of a PR, creating a build phase that executes the regression detection workflow (before/after runs) on the requester's hardware. This is the simpler option, but has the drawback that submitting a PR becomes a bit more onerous.

popematt · 2024-02-28T22:27:14Z

In #746, I was having a hard time reproducing my results consistently even though I was always running the tests serially using the same hardware. I had other processes running at the time (which would also be the case for option 2), but I was running the tests in a single JVM using a single core of my 8-core M1 Pro CPU. It's unclear to me whether HotSpot optimizations are applied deterministically (i.e. will two runs of the same program with the same inputs result in the same HotSpot optimizations), and that may be a confounding factor here.

TLDR; dedicated hardware certainly can't hurt the test reliability, but it might not improve it either.

tgregg · 2024-02-28T22:48:27Z

GitHub Action runner in AWS CodeBuild: https://docs.aws.amazon.com/codebuild/latest/userguide/action-runner.html

jobarr-amzn · 2024-02-28T23:43:38Z

The regression detector came up after a point release caused a substantial performance regression, right?
How substantial was that, are we overtuned here? Are we trying to detect any regression at all or only prevent disastrous regression?

Have we considered some approach like JProffa, which measures byte codes executed instead of wall clock time?

If contention is impacting, could we try to control for that by making both halves of the comparison run concurrently? That will make contention even worse but it ought to effect both sides of the split evenly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reliability of the regression detection workflow #758

Improve reliability of the regression detection workflow #758

tgregg commented Feb 28, 2024

popematt commented Feb 28, 2024

tgregg commented Feb 28, 2024

jobarr-amzn commented Feb 28, 2024

Improve reliability of the regression detection workflow #758

Improve reliability of the regression detection workflow #758

Comments

tgregg commented Feb 28, 2024

popematt commented Feb 28, 2024

tgregg commented Feb 28, 2024

jobarr-amzn commented Feb 28, 2024