Refactoring on match_data may increase regression on peak memory usage #194

iskim517 · 2023-01-20T02:33:34Z

Hi,

We observed that this refactoring commit (Refactor match_data() to always use the heap instead of having an initial frames vector on the stack) causes about 4 MB increase of peak memory usage on Android device, mainly due to libselinux. So we tried aggressively doing pcre2_match_data_free(), but that of course massively impacted runtime on some allocators like jemalloc.

Can pcre2 add a flag, so we can choose to use stack as before?

PhilipHazel · 2023-01-20T12:41:00Z

This is a nice example of how solving one problem introduces another. The change was made to reduce stack usage, for the benefit of environments with limited stack sizes (e.g. applications with many threads). If you are seeing 4MB heap usage, this means that some matches would, in the past, have had to use the heap (the stack allocation being too small), but they would have freed the heap afterwards. In the new arrangement, the heap is retained in order to reduce the number of malloc/free calls.

The HEAD code, in the last few days, has been updated to add a new function called pcre2_get_match_data_heapframes_size which returns the size of the heap that is being retained. This should allow you to do selective rather than aggressive pcre2_match_data_free, and is intended to help with your situation.

Adding a flag to use the stack sometimes would be extremely messy, and I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.

iskim517 · 2023-01-20T14:06:06Z

this means that some matches would, in the past, have had to use the heap (the stack allocation being too small)

Actually it isn't always the case, I think. It depends on how we use pcre2. For example, here is how libselinux works.

libselinux has a list of pcre2_match_data.
Whenever libselinux searches for a matching pattern, libselinux loops through the list and calls pcre2_match.
Later, pcre2_match_data is freed at once.

So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.

Adding a flag to use the stack sometimes would be extremely messy

Of course.

I suspect it would be very little different to making use of the new pcre2_get_match_data_heapframes_size.

Then one question: should we always use pcre2_match_data_free which frees all data? Can't we just shrink the heap frames size? I think it can at least halve the heap alloc/free count.

PhilipHazel · 2023-01-20T14:51:20Z

You mean have a function that calls realloc()? Something like pcre2_match_data_realloc_heapframes()? I suppose that would save freeing and allocating the match data block itself. It could realloc() down to the minimum size, which is 28K. Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).

carenas · 2023-01-20T15:34:31Z

Unfortunately, the custom memory allocation API for PCRE2 allows only for malloc/free, not realloc, so it would have to be implemented as free() followed by malloc() (at least when custom memory management is in use).

Or we could add a "realloc" to it, which will come handy here and in a few other places. It will also help systems that have less memory (or address space) to keep going even when they need bigger memory blocks, as now we have to keep the old block around until memory is copied to the new one to free it.

carenas · 2023-01-20T15:38:30Z

So here, if the patterns are small, the previous version of pcre2 won't allocate any heaps. Currently the heap will always be used regardless of each pattern's size. The sum of such heaps may be bigger than expected.

One important thing to consider is that "heap" doesn't need to be always allocated, PCRE2 provides a custom memory manager, so you could use that to allocate memory from a pool in the stack instead up to a limit.

For an example of how that works, take a look at the Apache httpd code.

PhilipHazel · 2023-01-20T16:21:50Z

Adding a custom realloc() facility would need a new function such as pcre2_general_context_create_with_realloc, but I suppose it would not be too much upheaval. It would increase the size of a number of data blocks by a pointer (probably 8 bytes in most cases). I suppose the obvious specification is "If realloc() is not supplied, use free+malloc". Note, however, that in the case we are discussing, that is, reducing the heapframes vector that is remembered in match data, there is no need to remember the contents of the memory. So perhaps just using free+malloc would be good enough. On the other hand, having realloc() available when increasing the size of the heapframes vector could be useful.

carenas · 2023-01-20T19:29:05Z

Yes, sorry to hijack the conversation, as you pointed out is not very relevant to the issue presented here which IMHO could be better served with a different solution (like the one I proposed and used by Apache) or the selective freeing that is allowed by the new API.

Not sure where the 10 x framesize or 20K minimum came from, but it would seem that there are users that require normally much less, and for them there will always be increased VMM sizes compared with the pre 10.41 library. One way to soften the blow would be to make the minimum smaller through a configuration instead of shrinking it after the fact IMHO.

PhilipHazel · 2023-01-21T08:42:48Z

It would be easy, I think, to provide pcre2_match_data_reset_heapframes() which just frees them, and it could have an argument giving a size -- only free if larger. Also, we could provide a "threshold" setting (like heap limit) which would trigger automatic freeing after a match if the frames got too big. As for the 10 x framesize and 20K minimum, the latter was, I think the size of the previous stack vector, and 10x came off the top of my head as probably enough for many straightforward matches (perhaps I looked at the tests).

carenas · 2023-01-21T09:11:42Z

The problem with doing that (which BTW I proposed and discarded in #190) was that there is no practical difference between doing a reset or recreating the match_data anyway.

Also, it wouldn't help the case presented here, since the problem is not that the frames had grown too much, but that now they are being kept allocated while before the use of that memory was hidden and discarded when the stack was destroyed, except for the matches that were bigger than what could fit in the stack.

PhilipHazel · 2023-01-21T09:48:39Z

There is a (possibly small) difference: if you just free the heap frames, then only one malloc is needed to re-create them. If you free the whole match data that's two frees and two mallocs. However, I think I prefer my "threshold" suggestion above -- which amounts to "only keep the heapframes vector if it isn't too big". Setting the threshold to zero would be the same as a reset.

carenas · 2023-01-21T10:08:39Z

Setting the threshold to zero would be the same as a reset.

but that would just force another malloc for the next use of the match_data, making the performance worst (as compared with pcre < 10.41).

the interesting thing about the new heapframes is that it actually improves performance and reduces memory usage if used correctly, which in this case means:

allocate 1 match_data per thread and reuse it for all matches (libselinux doesn't care about the captures but is creating them from the expressions by mistake).
remove the lock since there is no need to protect pcre2_code against concurrent use.
maybe use the new API to detect a match_data that has grown too much and recreate it.

after these changes, the overall maximum memory usage will be only bounded to the concurrency instead to the number of expressions evaluated and the process will be faster and more lean.

PhilipHazel · 2023-01-21T13:43:55Z

My "threshold" suggestion is just an automatic way of implementing your last point - though no need to recreate because that will happen automatically at the next match.

carenas · 2023-01-21T21:30:53Z

FWIW went ahead and implemented a proposal to fix libselinux based on the first two points and the results look encouraging even if the code is not finished and my setup is not ideal.

Running heaptrack restorecon -vRn /usr with the different versions of the code in Debian 12 shows:

HEAD (includes workaround but not enabled) + system pcre2:

total runtime: 43.16s.
calls to allocation functions: 469370 (10876/s)
temporary memory allocations: 215204 (4986/s)
peak heap memory consumption: 10.11M
peak RSS (including heaptrack overhead): 22.62M
total memory leaked: 1.81K

proposal + system pcre2:

total runtime: 32.14s.
calls to allocation functions: 463711 (14428/s)
temporary memory allocations: 215203 (6696/s)
peak heap memory consumption: 9.27M
peak RSS (including heaptrack overhead): 21.77M
total memory leaked: 1.90K

proposal + new pcre2:

total runtime: 29.02s.
calls to allocation functions: 464810 (16015/s)
temporary memory allocations: 215191 (7414/s)
peak heap memory consumption: 2.47M
peak RSS (including heaptrack overhead): 7.36M
total memory leaked: 22.40K

There might be a bug somewhere (maybe in heaptrack, because ASAN doesn't see it) as it seems we leaked a whole heapframe, but otherwise the improvements are significant (using new pcre2):

HEAD

total runtime: 39.96s.
calls to allocation functions: 536789 (13433/s)
temporary memory allocations: 214446 (5366/s)
peak heap memory consumption: 59.57M
peak RSS (including heaptrack overhead): 20.59M
total memory leaked: 1.91K

HEAD (workaround enabled)

total runtime: 226.04s.
calls to allocation functions: 509162297 (2252521/s)
temporary memory allocations: 254421578 (1125555/s)
peak heap memory consumption: 2.63M
peak RSS (including heaptrack overhead): 7.79M
total memory leaked: 1.91K

PhilipHazel · 2023-01-22T14:34:27Z

Well, if there's nothing for me to do, I am very happy to do it.

iskim517 · 2023-01-30T01:36:28Z

I tried the proposal and the result looks promising!

PhilipHazel · 2024-07-24T16:21:41Z

Has this issue gone away?

NWilson · 2024-12-08T16:32:31Z

@iskim517 and @carenas, what is the status of the SELinux change? Has it been accepted upstream, and do you have any actions which PCRE2 could take to improve the situation?

I will close this issue soon if there are no further requests for changes to PCRE2.

Thank you very much Inseob Kim for reporting this and discussing the problem in detail; and thank you Carlo for submitting a fix to libselinux!

carenas · 2024-12-08T19:37:27Z

@iskim517 and @carenas, what is the status of the SELinux change?

it is shipped with Android's fork AFAIK, but not yet upstreamed (probably my fault for not being pushy/responsive enough)

I don't think a change in PCRE2 was expected either way, but there are related issues that might be worth fixing in the serialization part long term that might refer to this issue.

NWilson · 2025-01-08T14:36:32Z

Closing as discussed. No further changes needed in PCRE2.

SolitaryGrass mentioned this issue May 31, 2023

internal_dfa_match, a stack overflow occurred due to recursive calls. #258

Closed

NWilson closed this as completed Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring on match_data may increase regression on peak memory usage #194

Refactoring on match_data may increase regression on peak memory usage #194

iskim517 commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

iskim517 commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

carenas commented Jan 20, 2023

carenas commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

carenas commented Jan 20, 2023 •

edited

Loading

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023 •

edited

Loading

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023 •

edited

Loading

PhilipHazel commented Jan 22, 2023

iskim517 commented Jan 30, 2023

PhilipHazel commented Jul 24, 2024

NWilson commented Dec 8, 2024

carenas commented Dec 8, 2024

NWilson commented Jan 8, 2025

Refactoring on match_data may increase regression on peak memory usage #194

Refactoring on match_data may increase regression on peak memory usage #194

Comments

iskim517 commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

iskim517 commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

carenas commented Jan 20, 2023

carenas commented Jan 20, 2023

PhilipHazel commented Jan 20, 2023

carenas commented Jan 20, 2023 • edited Loading

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023 • edited Loading

PhilipHazel commented Jan 21, 2023

carenas commented Jan 21, 2023 • edited Loading

PhilipHazel commented Jan 22, 2023

iskim517 commented Jan 30, 2023

PhilipHazel commented Jul 24, 2024

NWilson commented Dec 8, 2024

carenas commented Dec 8, 2024

NWilson commented Jan 8, 2025

carenas commented Jan 20, 2023 •

edited

Loading

carenas commented Jan 21, 2023 •

edited

Loading

carenas commented Jan 21, 2023 •

edited

Loading