You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the performance regression detection workflow executed via GitHub Actions is unreliable due to high variance in the results. We suspect we can drive down this variance by executing the regression detection workflows on hardware that we can guarantee is reserved for these workflows, and that executes the workflows serially.
Below are some options that have been discussed:
Configure GitHub Actions to submit the jobs to reserved AWS hardware that we control. This is the more complicated of the two options listed here, but has the benefit of not shifting any burden onto the PR requester.
Change the workflow so that it verifies reports uploaded by the requester of a PR, creating a build phase that executes the regression detection workflow (before/after runs) on the requester's hardware. This is the simpler option, but has the drawback that submitting a PR becomes a bit more onerous.
The text was updated successfully, but these errors were encountered:
In #746, I was having a hard time reproducing my results consistently even though I was always running the tests serially using the same hardware. I had other processes running at the time (which would also be the case for option 2), but I was running the tests in a single JVM using a single core of my 8-core M1 Pro CPU. It's unclear to me whether HotSpot optimizations are applied deterministically (i.e. will two runs of the same program with the same inputs result in the same HotSpot optimizations), and that may be a confounding factor here.
TLDR; dedicated hardware certainly can't hurt the test reliability, but it might not improve it either.
The regression detector came up after a point release caused a substantial performance regression, right?
How substantial was that, are we overtuned here? Are we trying to detect any regression at all or only prevent disastrous regression?
Have we considered some approach like JProffa, which measures byte codes executed instead of wall clock time?
If contention is impacting, could we try to control for that by making both halves of the comparison run concurrently? That will make contention even worse but it ought to effect both sides of the split evenly.
Currently the performance regression detection workflow executed via GitHub Actions is unreliable due to high variance in the results. We suspect we can drive down this variance by executing the regression detection workflows on hardware that we can guarantee is reserved for these workflows, and that executes the workflows serially.
Below are some options that have been discussed:
The text was updated successfully, but these errors were encountered: