-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test load balancing (migrating changes from TypeScript-Demo-Lib) #394
Conversation
Some additional Junit options that I've now added on top of the ones brought in from TypeScript-Demo-Lib:
|
Updating jest to 28.* seems to mess with our usage of We use it in a few places ( mockedPrompts.mockImplementation(async (_opts: any) => {
return { password };
}); Which throws a type error:
|
Changing the mock from jest.mock('prompts');
const mockedPrompts = mocked(prompts.prompt); |
I'm running into a problem with choosing a shard count for the mac/windows runners. The main issue is that the windows runners start to fail if I set the shard count too high with the error
With 12 shards they all failed with this error, and with 20 shards 8 of them failed. The problem is that, even with 12 shards, a lot of the tests are timing out (on the mac runners, not sure about windows since they won't run). But I can't increase the shard count due to the issue with the windows runners. There's the possibility that bringing in a very simple algorithm to try to optimise the shards (e.g. https://github.com/kamilkisiela/split-tests) would allow us to reduce the shard count without the tests timing out, since I ran some of the shards locally and there are a few minutes' difference between shard run times even in the few that I ran. |
429 means there's a rate limit. Either in gitlab or on chocolate's side. So you know which side? For now stick with just shard count of 2. Just because a shared count is low, doesn't mean tests should timeout. Timeout clock only starts when the test function starts execution. There's only 2 cores on the windows runners. If they are timing out, we need to fix the bug. We may have introduced this but with beforeAll using global agent locking. But you need to provide more diagnostic information in this PR about which tests are failing and timing out. Try with Mac first, it's closer to Unix. This uses homebrew, which I think doesn't have as much rate limiting going on as chocolatey. |
Chocolatey caching should be setup. Refer to:
We should be able to solve the 429 and even speed up this. |
@emmacasolin if you connect a PR to issues on zenhub, they are all moved together. This PR only addresses #392, so the other issues are related issues, but are not connected issues. |
I have a job running on the CI/CD now with just two shards for the mac runner to see how it goes, but during the meantime I ran the two shards locally as well. All of the failures are timeouts or timeout-related and fall into the general categories of nodes/notifications/vaults/NAT tests. To summarise:
Full failure output: Shard 1/2
Shard 2/2
The pipeline just finished so here are the results of that as well: https://gitlab.com/MatrixAI/open-source/js-polykey/-/pipelines/577509931/test_report. More failures on mac than locally and the tests took significantly longer, but that might be including setup time. |
This is one of the tests that sometimes times out. While most of the KeyManager tests mock the key generation, this one uses the actual implementation. However, does this test really need to use the real test(
'creates a recovery code and can recover from the same code',
async () => {
// Use the real generateDeterministicKeyPair
mockedGenerateDeterministicKeyPair.mockRestore();
const keysPath = `${dataDir}/keys`;
// Minimum key pair size is 1024
// Key pair generation can take 4 to 15 seconds
const keyManager = await KeyManager.createKeyManager({
password,
keysPath,
rootKeyPairBits: 1024,
logger,
});
const nodeId = keyManager.getNodeId();
// Acquire the recovery code
const recoveryCode = keyManager.getRecoveryCode()!;
expect(recoveryCode).toBeDefined();
await keyManager.stop();
// Oops forgot the password
// Use the recovery code to recover and set the new password
await keyManager.start({
password: 'newpassword',
recoveryCode,
});
expect(await keyManager.checkPassword('newpassword')).toBe(true);
expect(keyManager.getNodeId()).toStrictEqual(nodeId);
await keyManager.stop();
},
global.defaultTimeout * 2,
); |
The For native addons, an additional work must be done for caching the build. I'm going to have that applied to TS demo lib native, and js-db, since js-polykey doesn't have native addons in this repository. |
For windows and macos runners, we have to solve the caching problem of their dependencies before we can do sharding. For windows see: #394 (comment) For macos remember that mac machines on CICD is x86 and on Mac in office is different. So we can debug how to get homebrew to cache properly on our mac in office, must make it portable to the CICD. You can try first by trying to cache both the cache and the formula directory. Check if There's a few other env variables to set to prevent further updates... but I hadn't investigated further how homebrew works. |
For shard timeout debugging, make sure to use |
The But if jest is creating more than two workers even when running only two test files, this might be why we get timeouts. If just a single test file spreads across multiple cores then having too many at the same time will start to block each other. Even when using only one worker (or running just one test file) there are 11 of these in |
Even if the number of processes is greater than 2, use perf hooks to see
if those tests are really running concurrently.
…On 7/1/22 16:42, emmacasolin wrote:
The |--maxWorkers| option is supposed to specify the maximum number of
workers that will be spawned for running tests, and while I can see
that only two test files are running at a time when I set
|--maxWorkers=2|, |htop| shows way more jest child processes than 2
running at a time - it's closer to 10 on average. When setting
|--runInBand| these child processes don't appear, so they're
definitely child processes from the worker pool since |--runInBand|
doesn't create a worker pool.
But if jest is creating more than two workers even when running only
two test files, this might be why we get timeouts. If just a single
test file spreads across multiple cores then having too many at the
same time will start to block each other. Even when using only one
worker (or running just one test file) there are 11 of these in |htop|
image
<https://user-images.githubusercontent.com/86550035/176838817-73905787-cb20-4a39-a457-dc6b2adedc90.png>
—
Reply to this email directly, view it on GitHub
<#394 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAE4OHJ2DMFRTYPKJNOARKTVR2HM3ANCNFSM52HZK3JA>.
You are receiving this because you commented.Message ID:
***@***.***>
|
I'm going to add an additional task here relating to chocolatey package security. The caching of chocolatey package will be essential for CI/CD usage and shards for the windows runner. A new issue will be linked to this PR. |
Added #397 to address chocolatey usage. |
I think the issues we're seeing with timeouts when we run multiple tests are just a problem with Jest consuming resources, as this and many other articles point out: https://medium.com/stuart-engineering/a-practical-guide-to-removing-frustration-with-jest-and-jenkins-cc73dd332152 Jest will use all of the resources available, and most people seem to recommend setting |
Would that help on the cicd? We only have 1 core on the gitlab Linux runners. Perhaps it will help for the windows and Mac runners. |
I read the post, why don't you try with 50% and see if it helps free up some CPU to do CPU intensive work and reduce timeouts? Do it on your local system to check first. |
Yeah setting it to 50% locally reduces the timeouts significantly, and 40% removes them entirely. I'm running the mac CI/CD tests right now with 50% and they still have timeouts, but I'll need to wait to see the test output at the end to know if the number of timeouts was reduced. As for Linux, we don't really see all that many timeouts, but we also aren't using shards there yet. |
Setting |
I think the cache is working correctly. It gets downloaded at the beginning with no issues:
And while downloading dependencies it won't redownload all of them, e.g.:
I noticed this message after node gets installed though, which made me realise that there was some updating being done, which is why the setup was taking so long:
Setting
There are actually still these dependencies that still need to be installed each time, even with the above env variable set
There are quite a few options/variables you can set that affect |
This option might be useful here:
|
Even when setting
I think the first theory is most likely, and it has the most evidence pointing towards it. The question is why would the cache not be working? |
Just a note that the files |
Interestingly, the Jest cache (on CI/CD runs) seems to be working fine on Windows - it's just Mac that isn't using the results of past runs to optimise future ones. I'll have to see if this is the case on |
There seem to be fewer timeouts on Windows compared to Mac, however there are quite a lot of test failures. A lot of them seem to be simple things like expecting a path to be broken up with |
1290813
to
934039b
Compare
This will need to rebase on top of #396 when that merges. I'm on the final leg to merge it. |
Needs rebase on staging. In particular take note of the new
It would be good to have those in your child pipelines too. Once the above is addressed, we merge into staging for further testing. I'm guessing alot of tests will need to be fixed before merging into master. At the very least, linux tests should be passing. |
- Includes test suites that failed to run - Adds separator between names of nested describes
Introduced sharding for macOS and Windows runners and test directory pipelines for Linux runner
Stopped Mac CI jobs from updating dependencies since this was increasing setup time
934039b
to
e476835
Compare
Optimised Windows CI/CD setup by internalising and caching chocolatey packages Fixes #397
e42087d
to
0909fff
Compare
Ok if all are resolved, proceed to merge. (Tick off everything). |
Description
Test load balancing allows us to run our tests on the CI/CD more efficiently. We can apply it to our macOS and Windows runners while continuing to use the existing test pipelines script for Linux tests (for easier debugging during development until we have better Junit reporting on Gitlab MatrixAI/TypeScript-Demo-Lib#66.
We need to migrate the changes made in TS-Demo-Lib, as well as determine the most efficient number of shards for macOS/Windows parallel test runners. This doesn't need to be calculated dynamically, just a static value from trial and error is fine for now (a proper algorithm can be implemented later in #393)
Issues Fixed
stdout
/stderr
in Junit test reports on Gitlab TypeScript-Demo-Lib#66Tasks
build:linux/windows/masos
jobs into abuild-generate.sh
scriptbuild:linux/windows/masos
jobs from uploading dists and move this to a separate job (to prevent multiple dists from being uploaded)classNameTemplate
,titleTemplate
, andaddFileAttribute
options (plusancestorSeparator
andreportTestSuiteErrors
)[ ] 7. Determine the most efficient number of shards for test load balancing on masOS/Windows (static value)Leaving as 2 for now, we can optimise when doing Apply bin packing algorithms to the setup of our parallel CI/CD test runners #393Final checklist