split on preserved comments before running minification, fixes #222 #320

davidscotson · 2020-03-22T14:54:02Z

Okay, I'll admit up front that I'm not actually sure exactly what the problem was that caused the issue described in #222, but this first commit seems to fix it for me an an effected server and I have a reasonable intuition of why that makes sense.

I've included @timhunts regex fix for #333 in this, since that regex is pretty key to this and seems to be more reliably correct with his update. So this also has the better output (no longer accidentally skipping some minification opportunities) of that PR.

I think this change should help in some rare, extreme cases where people are minifying giant files, but also shouldn't have any bad effect on more normal usage.

In the test file for #222, and I'd guess many other times when you end up with >1MB javascript files, which is the size where it seems to start mattering, it will be because you have pre-concatenated the seperate files, and many of those will have comments they wish preserved.

Previously, preserved comments where extracted from the file and replaced with a counter, and then injected back into the same places later on in the process. To avoid the memory paging/swapping behaviour that I believe contributes to #222 I wanted to split the large file up and process each one individually. Since the minifications shouldn't cross the boundary of these extracted comments, they seemed like a good place to split. And since they're just a special kind of CSS comment, they could also be added manually to break up large files. (I did look at more brutally chopping up the input into byte chunks to see if there was a sweet spot in size but wasn't as confident that wouldn't cause issues and as long as the chunks remain below about 850K you seem to get the benefit)

Side note: the library itself already supports passing individual files in one at a time and minifying them in one go, which is as fast as this approach and a good workaround if you control the input going into the library, in some cases it might be easier or make more sense just to pass in the individual files one at a time in the first place rather than concatenate them just for the minification process to split them up again and then recombine them. Might be worth noting that in the documentation somewhere so people can avoid hitting this issue.

As far as I can tell, when you pass the large file as a single piece then the repeated substr (and possibly the regex, though I think the C implementation of those may be so optimised that it doesn't matter) on these long strings creates a lot of memory churn and this can slow things down as files get bigger. There's no memory leak as such, it just burns through a lot of quickly disposible strings. This causes slowness on most servers as the file size increases but can be a showstopper on weak/overloaded servers (in the latter case adding a degree of unpredictably too).

This patch tries to mimick that behaviour when the input is out of your control. Rather than extracting the preserved comments at the same time as the other search and replaces, it does it first with a preg_split and then deals with the file as the resulting chunks of code / comment / code / comment / code / comment, minifying the code ones before stitching them back together.

I'm hoping that keeping the output of the minification process as arrays of non-modified strings as much as possible will help further with the memory usage / speed, though I've kept that to a seperate commit and it's hard to see/measure much benefit of that seperate from the other change.

While running this with some diagnostic output (from Tim Hunt's patch for Moodle) I noticed that two very similar regexes were getting run the same number of times after each other. I believe the first one is redundant if you're also running the second one, that's the 3rd commit.

uses Tim Hunt's regex fix for matthiasmullie#333, but moves it so it's done in a different place to split on, rather than remove, special comments

It's not clear if both of these are necessary, the second one should just match everything the first one matches plus things within internal lines, so it seems redundant. The unit test commited with it, plus all the others still pass, and there's no diff in the output on a long test file.

davidscotson · 2020-03-23T12:15:03Z

More full disclosure on this, my investigation leads me to think that the workaround (mentioned above) of passing the files to the library individually will work for my use cases, so I'm not going to be trying to put this into production, though I think it's generally a helpful improvement if anyone else wants to take it on.

David Scotson and others added 3 commits March 22, 2020 14:18

fixes matthiasmullie#222 and matthiasmullie#280

f14ce7d

uses Tim Hunt's regex fix for matthiasmullie#333, but moves it so it's done in a different place to split on, rather than remove, special comments

speed/memory optimisations

279046f

matthiasmullie force-pushed the master branch 5 times, most recently from 85ee174 to 227f190 Compare December 27, 2020 21:43

matthiasmullie force-pushed the master branch 3 times, most recently from 949e01d to 51d1aa4 Compare November 18, 2022 12:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split on preserved comments before running minification, fixes #222 #320

split on preserved comments before running minification, fixes #222 #320

davidscotson commented Mar 22, 2020

davidscotson commented Mar 23, 2020

split on preserved comments before running minification, fixes #222 #320

Are you sure you want to change the base?

split on preserved comments before running minification, fixes #222 #320

Conversation

davidscotson commented Mar 22, 2020

davidscotson commented Mar 23, 2020