-
Notifications
You must be signed in to change notification settings - Fork 199
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring on match_data may increase regression on peak memory usage #194
Comments
This is a nice example of how solving one problem introduces another. The change was made to reduce stack usage, for the benefit of environments with limited stack sizes (e.g. applications with many threads). If you are seeing 4MB heap usage, this means that some matches would, in the past, have had to use the heap (the stack allocation being too small), but they would have freed the heap afterwards. In the new arrangement, the heap is retained in order to reduce the number of malloc/free calls. The HEAD code, in the last few days, has been updated to add a new function called pcre2_get_match_data_heapframes_size which returns the size of the heap that is being retained. This should allow you to do selective rather than aggressive pcre2_match_data_free, and is intended to help with your situation. Adding a flag to use the stack sometimes would be extremely messy, and I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size. |
Actually it isn't always the case, I think. It depends on how we use pcre2. For example, here is how
So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.
Of course.
Then one question: should we always use |
You mean have a function that calls realloc()? Something like pcre2_match_data_realloc_heapframes()? I suppose that would save freeing and allocating the match data block itself. It could realloc() down to the minimum size, which is 28K. Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use). |
Or we could add a "realloc" to it, which will come handy here and in a few other places. It will also help systems that have less memory (or address space) to keep going even when they need bigger memory blocks, as now we have to keep the old block around until memory is copied to the new one to free it. |
One important thing to consider is that "heap" doesn't need to be always allocated, PCRE2 provides a custom memory manager, so you could use that to allocate memory from a pool in the stack instead up to a limit. For an example of how that works, take a look at the Apache httpd code. |
Adding a custom realloc() facility would need a new function such as pcre2_general_context_create_with_realloc, but I suppose it would not be too much upheaval. It would increase the size of a number of data blocks by a pointer (probably 8 bytes in most cases). I suppose the obvious specification is "If realloc() is not supplied, use free+malloc". Note, however, that in the case we are discussing, that is, reducing the heapframes vector that is remembered in match data, there is no need to remember the contents of the memory. So perhaps just using free+malloc would be good enough. On the other hand, having realloc() available when increasing the size of the heapframes vector could be useful. |
Yes, sorry to hijack the conversation, as you pointed out is not very relevant to the issue presented here which IMHO could be better served with a different solution (like the one I proposed and used by Apache) or the selective freeing that is allowed by the new API. Not sure where the 10 x framesize or 20K minimum came from, but it would seem that there are users that require normally much less, and for them there will always be increased VMM sizes compared with the pre 10.41 library. One way to soften the blow would be to make the minimum smaller through a configuration instead of shrinking it after the fact IMHO. |
It would be easy, I think, to provide pcre2_match_data_reset_heapframes() which just frees them, and it could have an argument giving a size -- only free if larger. Also, we could provide a "threshold" setting (like heap limit) which would trigger automatic freeing after a match if the frames got too big. As for the 10 x framesize and 20K minimum, the latter was, I think the size of the previous stack vector, and 10x came off the top of my head as probably enough for many straightforward matches (perhaps I looked at the tests). |
The problem with doing that (which BTW I proposed and discarded in #190) was that there is no practical difference between doing a reset or recreating the match_data anyway. Also, it wouldn't help the case presented here, since the problem is not that the frames had grown too much, but that now they are being kept allocated while before the use of that memory was hidden and discarded when the stack was destroyed, except for the matches that were bigger than what could fit in the stack. |
There is a (possibly small) difference: if you just free the heap frames, then only one malloc is needed to re-create them. If you free the whole match data that's two frees and two mallocs. However, I think I prefer my "threshold" suggestion above -- which amounts to "only keep the heapframes vector if it isn't too big". Setting the threshold to zero would be the same as a reset. |
but that would just force another malloc for the next use of the match_data, making the performance worst (as compared with pcre < 10.41). the interesting thing about the new heapframes is that it actually improves performance and reduces memory usage if used correctly, which in this case means:
after these changes, the overall maximum memory usage will be only bounded to the concurrency instead to the number of expressions evaluated and the process will be faster and more lean. |
My "threshold" suggestion is just an automatic way of implementing your last point - though no need to recreate because that will happen automatically at the next match. |
FWIW went ahead and implemented a proposal to fix libselinux based on the first two points and the results look encouraging even if the code is not finished and my setup is not ideal. Running
There might be a bug somewhere (maybe in heaptrack, because ASAN doesn't see it) as it seems we leaked a whole heapframe, but otherwise the improvements are significant (using new pcre2):
|
Well, if there's nothing for me to do, I am very happy to do it. |
I tried the proposal and the result looks promising! |
Has this issue gone away? |
@iskim517 and @carenas, what is the status of the SELinux change? Has it been accepted upstream, and do you have any actions which PCRE2 could take to improve the situation? I will close this issue soon if there are no further requests for changes to PCRE2. Thank you very much Inseob Kim for reporting this and discussing the problem in detail; and thank you Carlo for submitting a fix to libselinux! |
it is shipped with Android's fork AFAIK, but not yet upstreamed (probably my fault for not being pushy/responsive enough) I don't think a change in PCRE2 was expected either way, but there are related issues that might be worth fixing in the serialization part long term that might refer to this issue. |
Closing as discussed. No further changes needed in PCRE2. |
Hi,
We observed that this refactoring commit (Refactor match_data() to always use the heap instead of having an initial frames vector on the stack) causes about 4 MB increase of peak memory usage on Android device, mainly due to libselinux. So we tried aggressively doing pcre2_match_data_free(), but that of course massively impacted runtime on some allocators like jemalloc.
Can pcre2 add a flag, so we can choose to use stack as before?
The text was updated successfully, but these errors were encountered: