split on preserved comments before running minification, fixes #222 #320
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Okay, I'll admit up front that I'm not actually sure exactly what the problem was that caused the issue described in #222, but this first commit seems to fix it for me an an effected server and I have a reasonable intuition of why that makes sense.
I've included @timhunts regex fix for #333 in this, since that regex is pretty key to this and seems to be more reliably correct with his update. So this also has the better output (no longer accidentally skipping some minification opportunities) of that PR.
I think this change should help in some rare, extreme cases where people are minifying giant files, but also shouldn't have any bad effect on more normal usage.
In the test file for #222, and I'd guess many other times when you end up with >1MB javascript files, which is the size where it seems to start mattering, it will be because you have pre-concatenated the seperate files, and many of those will have comments they wish preserved.
Previously, preserved comments where extracted from the file and replaced with a counter, and then injected back into the same places later on in the process. To avoid the memory paging/swapping behaviour that I believe contributes to #222 I wanted to split the large file up and process each one individually. Since the minifications shouldn't cross the boundary of these extracted comments, they seemed like a good place to split. And since they're just a special kind of CSS comment, they could also be added manually to break up large files. (I did look at more brutally chopping up the input into byte chunks to see if there was a sweet spot in size but wasn't as confident that wouldn't cause issues and as long as the chunks remain below about 850K you seem to get the benefit)
Side note: the library itself already supports passing individual files in one at a time and minifying them in one go, which is as fast as this approach and a good workaround if you control the input going into the library, in some cases it might be easier or make more sense just to pass in the individual files one at a time in the first place rather than concatenate them just for the minification process to split them up again and then recombine them. Might be worth noting that in the documentation somewhere so people can avoid hitting this issue.
As far as I can tell, when you pass the large file as a single piece then the repeated substr (and possibly the regex, though I think the C implementation of those may be so optimised that it doesn't matter) on these long strings creates a lot of memory churn and this can slow things down as files get bigger. There's no memory leak as such, it just burns through a lot of quickly disposible strings. This causes slowness on most servers as the file size increases but can be a showstopper on weak/overloaded servers (in the latter case adding a degree of unpredictably too).
This patch tries to mimick that behaviour when the input is out of your control. Rather than extracting the preserved comments at the same time as the other search and replaces, it does it first with a preg_split and then deals with the file as the resulting chunks of code / comment / code / comment / code / comment, minifying the code ones before stitching them back together.
I'm hoping that keeping the output of the minification process as arrays of non-modified strings as much as possible will help further with the memory usage / speed, though I've kept that to a seperate commit and it's hard to see/measure much benefit of that seperate from the other change.
While running this with some diagnostic output (from Tim Hunt's patch for Moodle) I noticed that two very similar regexes were getting run the same number of times after each other. I believe the first one is redundant if you're also running the second one, that's the 3rd commit.