-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up npm ci
by caching node_modules
#45932
Conversation
Open in CodeSandbox Web Editor | VS Code | VS Code Insiders |
Size Change: +223 B (0%) Total Size: 1.34 MB
ℹ️ View Unchanged
|
52e2887
to
38797f9
Compare
setup-node
composite action for aggressive cachingnpm ci
by caching node_modules
.github/setup-node/action.yml
Outdated
# Comparing against strings is necessary because of a known bug in composite actions. | ||
# See https://github.com/actions/runner/issues/1483#issuecomment-1279933184 | ||
if: ${{ inputs.should-rebuild == 'true' }} | ||
run: npm rebuild |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is normally not needed because most of the packages don't have postinstall
scripts. However, some react-native
related packages do need them so we make this an opt-in.
This is not future-proof as we might include a package that needs postinstall
in the future. We can run npx can-i-ignore-scripts
to manually inspect these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we not run npm ci
above instead of npm ci --ignore-scripts
?
also I wonder how much of the impact on performance is related to --ignore-scripts
- I might measure it and find out.
this seems more of a danger on this change: is there a way we can make it more obvious to someone when their PRs are failing multiple test suites that the reason is that they need to supply should-rebuild: true
?
I'm not sure how I would expect someone to start from the failure and work back up to this setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as I said above, it's not entirely "safe", but I think it's good enough for most cases. We will invalidate the cache whenever OS/Node/package-lock.json
changes so it's more or less like what we have during local development.
also I wonder how much of the impact on performance is related to
--ignore-scripts
Probably not much, because npm ci
is only run when there isn't a cache hit. Most runs in our measurements have cache hit.
However, npm rebuild
does contribute to some running time. We use it here to trigger any underlying postinstall
script in the dependencies that gets ignored by -ignore-scripts
. Tested with can-i-ignore-scripts
, most dependencies that have the postinstall
scripts are react-native
-related (and local development related), hence I think it makes sense to make it an opt-in when we're building for react-native.
We can try to just use npm ci
without the --ignore-scripts
and always run npm rebuild
though. I think that's unnecessary in most cases and probably will add up 20 to 30 seconds of runtime but it's technically "safer". Caching node_modules
won't behave exactly like running npm ci
every time and I think that's okay.
I'm not sure how I would expect someone to start from the failure and work back up to this setting.
I'd expect it to be pretty visible as it would probably fail the CI and we can look into that. I think it'll only happen when there's a new dependency with postinstall
get added though, which should be pretty uncommon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most runs in our measurements have cache hit.
where did you measure this? and what was the ratio for pull_request
events in PRs?
probably will add up 20 to 30 seconds of runtime but it's technically "safer"
it's usually faster to do the wrong thing 😉
I'm really nervous about introducing a very subtle behavior that could break things if you don't already happen to know it exists and remember that you need to apply a workaround when it's appropriate.
it would probably fail the CI and we can look into that
I agree here, but how is it that you would propose someone moves from "my PR failed and I don't know why" to finding this flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's unnecessary in most cases and probably will add up 20 to 30 seconds of runtime
Not that I have investigated why yet, but if I did the test right it doesn't seem to make much difference beyond a couple seconds. There is not enough of an effect with the data collected so far to conclude that there is any measurable impact. I could have done something wrong, or the flag might not be having the impact we expect it to.
Granted, for the unit tests and end-to-end tests there's still that issue of having the multi-modal distribution which could be throwing off the statistics. I haven't investigated the impact that has on the results.
My test branch is #46318
Unit Tests workflow
Performance tests workflow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where did you measure this? and what was the ratio for
pull_request
events in PRs?
I directly looked into the details of each run and see if there's a cache hit. I only looked into this PR's runs as this is where we're measuring.
it's usually faster to do the wrong thing 😉
It depends on what you think is "wrong". I'm just saying that upon my research this is the trade-off that I'm willing to take. It's obviously subjective and it's totally "right" to think otherwise.
but how is it that you would propose someone moves from "my PR failed and I don't know why" to finding this flag?
I think we should be able to find this out in a code review. Just like how people work in different OSes than in CI, things are gonna be different and weird things happen in CI. We can also write some guide or log about this if necessary. I agree we should minimize the breaking changes whenever possible but I believe this is something that rarely breaks. However, if the added time of removing --ignore-scripts
is so little that it doesn't matter then I'll immediately stand on the other side of the fence 😝 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed --ignore-scripts
and npm rebuild
in 1949d5f. Seems like it's still working nicely!
I updated the postinstall step to also run the postinstall
scripts in the packages folder too. This technically doesn't match the behavior of npm ci
entirely though as it only runs in the packages
folder but not recursively for all dependencies.
If the postinstall
scripts only update the node_modules
folder (which is what most dependencies are doing) then it should be safely cached. However, in react-native-editor
, the i18n-cache
files are generated both in node_modules
and in the packages/
folder because of linking. During the tests, the i18n-cache
files are read from the packages
folder for some reason, hence we need to re-run the postinstall
script to populate it again. I think this can be seen as an edge case, and possibly there are other ways to solve it too (for example reading from node_modules
as a fallback).
@dmsnell Pinging you as seems like you have worked on similar things before and might be interested in this 😉 . |
@kevin940726 I can monitor this PR and start collecting data on if it has the intended impact. is two minutes something we think is worth the added complexity of the cache and the predictable cache invalidation problems? I haven't looked at the other workflows, but there was plenty of performance improvement available in the Performance workflow that didn't involve adding new complexity. for reference, while working on the performance test suite I was attempting to collect data on at least thirty test runs in the branch to compare against |
Impact on Performance Tests Workflow
Original comment@kevin940726 after around 30 runs of the performance tests here is some initial data; I'm starting to sequence the E2E tests and compare thereThere's not enough evidence after 30 runs to conclude if this makes a difference or not. The results we're seeing are close enough that if we ran this experiment over and over, assuming there were no change in the PR and we weren't caching the The data does seem like it's a bit more normal than the test runs in trunk, evidenced on the normality graphs. That suggests there could be an element at play whereby this change cuts down on some long-tail skew, but as mentioned above, there is not sufficient evidence to draw any conclusion from this work. The graphs show that in the approximately thirty test runs, the mean test run was almost identical (a measured difference of 5.3 seconds), marginally shorter in this branch, while the median runtime was measured higher than in |
Impact on End-to-End Tests Workflow
Right now we're jumbling all four end-to-end tests into one. This explains the divergence from normality. I'll see if I can separate out the four test jobs since they are independent tests. I don't believe this invalidates the results overall. Original commentWith the E2E tests, at around 10 test runs the difference is looking more likely, at a measured decrease in mean test run time of 1 minute, 7 secondsSomething look a little fishy about this data, which may be some oversight on my part when using the scripts I wrote for the Performance workflow with the E2E (which have all four E2E jobs). I'll update of course as I can. |
@dmsnell Thanks for testing it out! I'm curious how you tested this though. Is there code for the testing script used here? I think you're missing some important variables in the performance test, we have to take caching into account. If we want to compare it against trunk, we will need to make sure they have the same baseline, do runs with both cache hit and cache miss and compare the results. In addition, the main motivation of this PR is to increase the performance for running |
I'm sure I am!
What I'm doing is having GitHub re-run the test workflows once they complete. I'm not running the tests locally or in any way that's different than in any typical PR experience. That is, the results I'm measuring are looking at the wall-clock runtime of the CI workflow as reported by GitHub's API which gives the result for each run.
I'm not sure I agree with this. What I'm trying to measure is the real impact on how long the CI workflows run, and if we make the claim that the caching will help then I think we have to test it against real-world test runs. There's room to argue that if we merge this than more PRs will start with cached data, but I think we'd still have the possibility that that assumption doesn't hold because among our 1700 branches there are plenty of PRs that won't be forked from the latest
There are plenty of optimizations that I've investigated in improving the CI workflows that looked promising in isolation but which don't have the same effect when considered inside the system. Some things seem unintuitive, which is why I stress measuring the real results so much. One possibility here is that GitHub's own My script is quite tacky, as I was developing it while exploring the problem and not focusing on building maintainable code. Mostly it amounts to polling GitHub for CI workflow runs and parsing the data; some of what seems unnecessary in the script is the result of a bug in GitHub's API response that I reported and they are working on fixing (when requesting jobs for a given workflow run attempt they only report a link to the jobs for the most-recent run). I accidentally stopped running the script so I'll pick it up again. Happy to take advice! |
First of all, I don't think the comparison of performance tests makes sense in this context. They are not even doing the same job in trunk and in PRs. In PRs, the performance test is running "Compare performance with trunk" while in trunk it's running "Compare performance with base branch". Comparing them is not an apples-to-apples comparison, and we should at least control our variables to make them as close as possible. However, I don't think we need to take the full performance test into our equation. We split the job into steps so that we can debug them individually. The steps "Use desired version of NodeJS" and "Npm install" in trunk are responsible for setting up Node.js and installing the dependencies. We replace them with "Setup Node and install dependencies" in this PR. That's the only change in the workflow, anything else doesn't matter. We should tweak our benchmark to only include them (and their cleanup steps if any).
Yep! This is a valid concern but this is already how it currently works for npm caching in CI. This enhancement won't benefit every PR but it should be beneficial when the cache hits. In the end, that's how caching works, isn't it 😆? Worth noting that we'll only revalidate the cache when
Well, if you look at the installation time reported by GH Actions, you can see that the time reduces significantly when There might be situations where cached Thank you for sharing the script! I'm having trouble understanding some of the contexts though. Which |
Maybe there's a misunderstanding. I'm measuring the runtimes of the "End-to-End Tests" and also the "Performance Tests" workflows in CI. I chose these because they are two of the workflows affected by this PR, so I assume if the PR changes them, we should expect an impact there. I'm not comparing End-to-End tests against Performance tests.
I'm still challenged by this assertion, as the goal seems to be "Speed up the CI pipeline by caching node_modules." My measurements are looking at the CI pipeline runtime for currently two of the workflows, the End-to-End tests (which take the longest) and the Performance tests (which used to take the longest). We can start measuring any or all of the workflows to see how this affects everything, but the perf tests were convenient since I've been monitoring those for the past month or so, and the end-to-end tests seem relevant since any speedup in the overall process will have to speed those up, or else the improvements won't be felt by anyone (as they will still be waiting for the slowest workflow to complete). So I don't want to limit ourselves by only looking at a piece in isolation when that piece sits inside a more complicated system where the effects may be different.
This appears to be accurate for the E2E tests, but inaccurate for the performance tests.
That's absolutely right in theory, and given that we're just sitting here telling GitHub to re-run the CI tests as soon as they finish, we should expect the best-case of cache hits when testing the impact of this PR. If the theory is right then it should be clear from the measurements that this accomplishes its goal.
According to my latest numbers with 70 test runs, it looks fairly conclusive that it is faster in the end-to-end tests, and the measured difference in average runtime is 1m 37s. After around 30 test runs on the performance tests though, it's not conclusive, but it's looking to move that way in showing that this is slower in the performance tests with a current measured slowdown of adding 47s to the average CI workflow runtime. I just started re-running the Playwright workflow (last completion took 26min) and the JS Unit Tests workflow (last completion took 4min). It'll be interesting to see what impact this has on those workflows.
All that is to say, given how many false assumptions I found while optimizing the performance tests workflow, I want us to be able to have a credible reason to claim that a change we are making is worth the additional complexity vs. arguing from opinions, guesses, or doubts 😄 It seems clear that this will cut off some time from the E2E workflow, but probably add time to the Performance Tests. When more data comes in it just gives us a better picture to say "I want this tradeoff" or "I don't want this tradeoff."
I'm using #44907 as a base against which to compare. It's two weeks behind |
@kevin940726 it seems like many of the workflows are failing due to a |
Ahh yes definitely. I thought that you were using
That's fair! My argument will be that the enhancement in this PR can be viewed in isolation while many of the improvements contributed to the perf test alone cannot be easily split out. I think this PR can still improve the perf test run time, just that we don't have enough data to prove that yet. The run time of the perf test varies a lot and the previous runs might just underperform because of the random factors. We might get a different result if we have more samples. That's just my guess though because the result is counter-intuitive to me and I don't have any explanation for now.
This is fair too! But I think we should not stop after we create a benchmark, we should try to figure out what happened there. The benchmark is just a tool to help verify that something works. If it creates unexpected results, it's our opportunity to debug that, either in our code (TN) or in the benchmark itself (FN).
Absolutely! I'm also curious about it and might spend some time creating a benchmark myself too. I'll share more info.
Yeah, that's a very interesting observation! I'll look into that! Thanks! 🙇 |
d8363fc
to
c42f384
Compare
Oh, seems like we just need a rebase after #46093! |
@kevin940726 not sure why things changed after the rebase, but now all three of the tests I've been measuring are reporting improvements in this branch. I'd like to try comparing it against a version without Out of curiosity, what led you to add that flag to |
bb38b16
to
1949d5f
Compare
947ce6a
to
136bbd0
Compare
136bbd0
to
504da63
Compare
@kevin940726 I think one of the best things that can be done for this is simply to monitor it after deploy to measure if it has the expected impact. I expect the impact of the caching to drop once this is in how much I expect it to drop I don't know, and I don't know a good way to test this before merging. the more important question is how much we want it to drop in order to balance the costs of increased maintenance, additional debugging when the caching fails us, and the frustration induced when someone is trying to understand why something isn't right. looks like some recent React-native tests are completing in times similar to the E2E suites. this is good news because as long as those are taking upwards of 30-40 minutes to run, the benefit this caching brings for all the other suites would be locked, but if the longest workflow is taking 20 min right now, then an estimated speedup by around 1 minute and a half could translate into around an 8% speedup, which is pretty good. we can probably test by iterating over all the open PRs where all workflows have completed, and looking at the one that took the longest to complete. in fact, we may even get this information from the check run or checks API endpoints from Github. I'm still unclear on the terminology even after absorbing myself in it for a while. we should have a reasonable guess within a day or two of data as long as it's mid-week, I would imagine. the determining factor in experiment length is probably how frequently any PR involves changes in the node modules or versions, as that's the thing on which all value in this PR depends. also I think old data sticks around for a month or so? so we're likely limited to a maximum two-week experiment period and that starting immediately after deploy, since after that time we start losing data from the before times. at some point the benefit this brings won't be worth the development costs, either in a raw reduction of check suite runtime duration or a percentage thereof. I'm curious to know what your target goal is and what you think a reasonable threshold should be for that tradeoff point. |
We rarely change the dependencies, at least compared to the number of workflow runs we run. I did a quick search on runs on trunk, out of 642 runs (the workflows that will be impacted by this PR), only 10 of them have changed the
What's the cost of maintenance? There's no debugging needed if cache misses for regular contributors, it will simply fallback to what we have now: reinstall everything.
The goal is simple: reduce the time spent on CI, this is only the first step of it. I still fail to understand why you think this is not worth it. If the cache hits, which I expect to be 99% of the cases, then there's going to be roughly 90 to 120 seconds of improvement. If the cache misses, then nothing is changed, it's just going to do whatever we are doing now. All the numbers collected in this PR show positive results. Unless you have a concrete case against it, I suggest we try this on |
The cost is in understanding and modifying the run scripts.
There are two costs here:
this is definitely not what I said or meant. what I said and meant is that there is a point in all optimizations which come by adding complexity where the benefits and costs balance. I suggested not that this isn't worth it, but that we measure the impact and use that knowledge to confirm it did what we wanted, and if not, that is a clue we can revert it. why do I keep bringing this up? am I just a fuddy-duddy? 🙃 so the question is a good one, because if you say "90s is a success" and we measure 120s then you know your work made a positive impact. if we measure 10s what do we make of that? or if unintentionally for some quirky reason this slows things down, how do we know?
this was literally how I started my last comment, so we are in agreement 😄 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still marked as a draft, but it's ready for merge.
would still be good to set a target performance impact for this so that after merge we can measure and see if it did what we expected it to.
Thank you @dmsnell !
I'd expect all the workflows affected by this PR to get a 1m30s to 2m performance boost if the cache hits. Let's see if that's the case then! Yeah, I forgot to mark this ready for review 😅. I'll wait for a few other folks to chime in once they get time though! |
Note that the "Report to GitHub" workflow currently fails because it depends on the workflow on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking great! Thanks for working on this. I've been meaning to look at this for the Core repo because a few people have mentioned that they still cache their node_modules
directory in their other projects and it does significantly improve the speed of the related commands.
A bit about caching that I thought would be good to detail. Each repository is allowed 10GB of total cache storage. Once that storage is exceeded, then GHA works to evict caches until bringing the total storage under the limit. When I looked yesterday, the Gutenberg repository was at ~25 GB and GHA had not cleaned up after itself. But looking today, we're at 8.67 GB. The oldest cache was last used 12 hours ago.
I think that we do need to spend some time holistically examining cache key practices to see if we are creating multiple cache entries for the same things anywhere across different workflows. I do think that avoiding any repetition will help us keep things consistent across workflows and potentially help with this.
I made some suggestions and posted a few questions. Would love to give it a second look after these are addressed!
.github/setup-node/action.yml
Outdated
uses: actions/cache@v3 | ||
with: | ||
path: '**/node_modules' | ||
key: node_modules-${{ runner.os }}-${{ steps.node-npm-version.outputs.NODE_VERSION }}/${{ steps.node-npm-version.outputs.NPM_VERSION }}-${{ hashFiles('package-lock.json') }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking that we don't need to include the npm version in the cache key here.
I can't think of a scenario where the version of npm would change independent of the Node.js version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't this happen regularly in Gutenberg since our npm
version is still gated at <7
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need to include npm
version here too. Including it will potentially bust the cache more often.
doesn't this happen regularly in Gutenberg since our
npm
version is still gated at<7
?
That's because we're using Node 14, which comes with npm v6 by default. npm versions could change independently from Node.js, but they should just be minor releases and backward-compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a new version of Node.js is released, a specific version of npm is bundled with that version. You'll find an all inclusive manifest here. As long ass the npm install -g npm
command is not run, then the version originally bundled with that version of Node.js will be the one installed.
I looked through the code of actions/setup-node. While I wasn't able to confirm this 100%, it seems that new version are pulled and built for use within Action runners using this manifest through the actions/node-versions and actions/versions-package-tools.
Because we only install Node 14.x, the version of npm will always be < 7
, but the specific version of npm 6.x
installed will be tied to the specific version of Node installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but according to the project's package.json
file we're good to use node@latest
, which is I think what most people end up doing locally.
Lines 17 to 20 in 6434e0b
"engines": { | |
"node": ">=14.0.0", | |
"npm": ">=6.9.0 <7" | |
}, |
so maybe by happenstance we're only using node@14
in the test suites, but even then, are we guaranteed to run the specific version of 14? or are we pinning it entirely at a specific version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe by happenstance we're only using node@14 in the test suites
.nvmrc
pins the Node.js version to 14 on CI. FWIW, there's an ongoing PR (#48950) that hopefully can be merged soon to upgrade Node.js and npm to latest on CI.
but even then, are we guaranteed to run the specific version of 14
Nope, if there's a minor/patch release then we'll automatically use the newer version on CI (basically ^14
). But that's okay since we include the full node version in the cache key.
Note that this PR doesn't affect anyone's local setup whatsoever, they can continue using whatever supported Node.js/npm version they like locally. And the cache only exists on CI, so they won't be affected by it either. This PR only caches node_modules
on CI, which is a controlled environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if there's a minor/patch release then we'll automatically use the newer version on CI (basically ^14). But that's okay since we include the full node version in the cache key.
I think we're in agreement here. my comment was specifically commenting on how we do expect changes to node
to occur that don't match npm
. we're intentionally not using the bundled version of npm
in the project, but we build the project with changing versions of node
quite often, since we pin node
at >=
a major version.
@desrosj Thanks for the reviews! I addressed the feedback 🙇♂️
Yeah, the caching limits trade-off is mentioned in the PR description too. Note that there are going to be two separate caches while we're still experimenting with this PR though. Caching size will also increase when we're in between updates to the |
2cbcac3
to
7446f9f
Compare
7446f9f
to
1c9ca27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Here's a few more things that popped out on re-review.
if: ${{ steps.cache-node_modules.outputs.cache-hit == 'true' }} | ||
run: | | ||
npm run postinstall | ||
npx lerna run postinstall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between running postinstall
for npm
and npx lerna
? Do we need both?
If so, I'm wondering if we should split into two steps. That would make it more clear which of the two fails should either encounter a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some comments explaining this ❤️ .
I don't think we need to split it into two steps though, as they have the same goal.
@@ -20,20 +20,17 @@ jobs: | |||
strategy: | |||
fail-fast: false | |||
matrix: | |||
node: [14] | |||
node: ['14'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why this is needed. Does this cause a failure in the composite workflow because the type is defined as string
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think either works, but everywhere else uses string. Just want to keep it consistent.
|
||
- name: Npm install and build | ||
- name: Npm build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, this seems to be running the Lerna build script.
- name: Npm build | |
- name: Build Lerna |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Build Lerna" isn't really descriptive either though. I think this should be "Build packages" and just run npm run build:packages
. But it's already the way it is in trunk, so maybe we should change it in another PR?
run: | | ||
npm ci | ||
npm run build | ||
- name: Npm build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're running npm run build
, but it's not just an npm build. Maybe "Run build scripts" would be more general and accurate. If we change this, it should get changed everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, let's do this in another PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kevin940726 I'm OK with merging this to try it out!
Thanks! Let's try it and revert it if anything goes wrong 😅 ! |
@kevin940726 I know it's only been ~4 days, but wanted to circle back to see if you'd observed any impacts of this change, either positive or negative. |
Sure! I did some basic analysis on the "Unit Tests" workflow before and after the commit on trunk. I took 30 runs each and here are the results: Before:
After:
Will be interesting to do more analysis on other workflows as well, but I don't have the time rn 😅 .
Yeah, turns out we do want the |
* Try custom setup-node composite action for aggressive caching * Add comment about GHA bug * Try without rebuild and ignore-scripts * Include npm version to the cache key * Try reverting the change of graceful-fs * Update step name * Code review * Add some comments
* Try custom setup-node composite action for aggressive caching * Add comment about GHA bug * Try without rebuild and ignore-scripts * Include npm version to the cache key * Try reverting the change of graceful-fs * Update step name * Code review * Add some comments
What?
Speed up the CI pipeline by caching
node_modules
.Why?
Faster CI === Happier developers? ❤️
Slightly related to #33532, #44679, and other similar performance related issues.
The official
setup-node
action by default only caches the global cache (~/.npm
in *nix). However, it's still taking roughly 2 minutes to install all the packages in every workflow. By also cachingnode_modules
, we can dramatically cut down the installation time by 75%.Note that in the official document, it's not recommended to cache
node_modules
.However,
npm ci
becausenpm ci
will deletenode_modules
before installing the modules. We can simply avoid runningnpm ci
if there's a cache hit.TBH, I'm not really sure why it's not more of an established pattern. Probably someone knows better?
How?
I'm abstracting the steps into a composite action called
setup-node
. It roughly does the following:~/.npm
) to speed up download time.node_modules
to speed up installation time.npm ci --ignore-scripts
to download the modules.npm rebuild
) if necessary.npm run postinstall
.Then, we just have to replace every
actions/setup-node
+npm ci
duo with this composite actionsetup-node
in each workflow. Note that this action doesn't help if the workflow doesn't need to runnpm ci
though. What's good is that such workflows can continue using the originalaction/setup-node
since they share the same global npm cache.Testing Instructions
The result can be verified by comparing the installation time in the workflows.
An example run in this PR took
1m 57s
to run the composite action when the cache missed. Re-running the job again, it then only took20s
to run the same composite action when the cache hit.Compare that to
trunk
. The equivalent actions duo (actions/setup-node
+npm ci
) took2m 5s
to run when the cache hit. We can draw this table below with limited source data:trunk
)138s
125s
node_modules
(this PR)117s
120s
Even though sometimes the duration varies a lot between runs, the performance improvement is still undeniably outstanding.
Trade-offs
The improvement is obviously not free, there's always a cost. In this case, the main cost is the cache size. The original global npm cache has a size of (roughly)
270MB
per key. On top of that, this PR adds an additional270MB
ofnode_modules
cache per key. Becausenode_modules
won't work across different OSes or Node versions, we will have to also cache them forWindows
,macOS
, etc. They are adding up and will soon exceed the10GB
threshold of the maximum cache size in a repository.At the time of writing, we currently have 77 caches (counting these
node_modules
cache) and the least used cache is only 3 days old. If we're not careful, we could cause cache thrashing and defeat the whole purpose of caching.One possible solution to this is to avoid caching
node_modules
or fallback to a common cache in some of the less-used situations. For instance, we could try grouping all OSes in the same key, but require Windows and macOS to always re-runnpm rebuild
after restoring the cache. The same could also apply to different Node versions as well. This technique is potentially unsafe though, and might cause confusion to contributors if the cache fails.With that being said, I think the additional total of 1GB of cache is still worth the time save. We can monitor the impact of these caches for a while to be sure.
Screenshots or screencast
Footnotes
There's a post action below that took
1m 29s
. It's likely a race condition issue when multiple jobs were trying to save the same cache. ↩