-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Invalid floating point operation (or memory corruption) when reallocating memory #85
Comments
Thank you for the bug report. I have encountered this before, but luckily the worst case scenario (silent memory corruption) is not a scenario that arises frequently - an FPU stack leak while FPU exceptions are masked at the same time. That also only happens when something else has already gone awry. Therefore, whatever solution is chosen should not preferably not come with any performance cost. Calling ffree should have a minimal performance impact, but it could hide a FPU stack leak until much later, which is also not desirable. If a single FPU stack leak is not common, multiple register leaks should be even rarer. In most cases I have seen it is a single stack entry that has leaked. I am therefore leaning towards rather limiting use of the FPU stack in Move procedures to 7 registers (or even less) instead of all 8. What do you think of this suggestion? Another option would be to simply do away with the FPU move routines altogether and instead use SSE2 (if it is available). |
Fun fact of the day, the faulty Pow10() function called by val() has been fixed by Borland/Embarcadero between Delphi 6 and Delphi 10. In case of error, the newer code pops one item from the FPU stack before returning and the FastMM issue is never triggered. If FPU exception are not masked, which is the default behavior in Delphi, an exception will be thrown. In this case, it is unfortunate, and the unsuspecting programmer (or its customer) may not understand why the memory manager throws an FPU exception or be able to troubleshoot the issue properly (assuming that he can reproduce it in the first place), but at least no corruption occurs and the programmer has a fighting chance to fix it. The real issue is when FPU exceptions are masked because that's where corruption occurs silently. Honestly, I am not too sure what the best course of action is. There are multiple factors to consider. First, why are the FPU exceptions masked to begin with? It might be because the Delphi code is calling an external DLL. Since the default VC++ behavior (and possibly others) is to have exceptions masked, having them unmasked might have adverse effects when calling external code. And indeed, that's what we experienced when an external library was completely broken because it assumed FPU exceptions were masked. It is therefore good practice to temporarily mask them when you have control over the process. Second is, who is responsible for the FPU stack. I think it is clear that the memory manager should clean up after itself. But should it be responsible for the mess left by others? My fix assumed that the answer was "no": the memory manager cleans up what it needs to use, but leaves the rest as it was, making it, at worst, no better than it was and, at best, sparkling clean afterwards. That's not necessarily the right or the best solution, as it introduces 99% probably useless ffree and thus a tiny bit of performance decrease. But I think we agree on this one, as your idea doesn't address the additional garbage left on the stack either beyond the registers that FastMM would be using. Which brings me to the third factor, speed. I haven't done timing analysis and have only performed basic testing. Yes, the ffree statements do take more time than without. But even with the added ffree, the custom Move() functions are between 2-4 times faster than the default System.Move() function. Not too shabby. Looking at the alternatives... #1 My proposed fix of cleaning up as many registers as needed #2 I like your idea of freeing the bottom register and using 7 registers, but I see two flaws. First, it assumes that cleaning N number of FPU registers is more expensive time-wise than the bandwidth decrease of using N less FPU registers, which is uncertain (at least to me). Second, it does not solve the issue but simply pushes it back to a less probable but still possible state. #3 SEE2 copying is interesting. Being introduced in 2000, it should be available about everywhere. But SSE2 works on 16-byte aligned memory, whereas the default for Delphi is 8. That can be changed to 16 in FastMM, but then memory usage increases. Or one could use SSE3 which has optimizations for unaligned data, but then we move to sometime in 2004-2005 the base CPU arch. #4 I know for a fact that any other FPU cleanup opcodes are slower (emms, fninit, etc.) or CPU brand specific (femms). So that's a no-go. #5 The code could check if exceptions are masked and only then clean the FPU stack. It turns out that fstcw is about as long as ffree. So for one register, it's not much worth it. But for functions that use multiple registers, there is a gain to be made in doing fstcw+conditional ffree instead of systematic ffree, at the cost of one fstcw (i.e. the equivalent of one additional ffree) every time. After a few runs, the CPU branch predictor will be trained to take the right branch (which should not change) and it should be almost completely transparent thereafter. #6 Since the state in which the problem happens is pretty rare (FPU garbage + masked exception), the ffree statements could be placed in a conditional define statement so that it can be turned on when needed, e.g. when one knows he will mask exceptions. But that means that the programmer would need to know that, and that's precisely part of the problem - it is silent until your program crashes randomly. #7 The cleanup could be dynamically turned on using a global state variable. A jz over the cleanup code based on a threadvar boolean would inexpensive enough while being flexible and adaptative. The branch predictor should hide much of the branch here as well. The problem is that using a threadvar involves a TLS lookup, which is about as expensive as fstcw or ffree, so that's basically a duplicate of #5, except unnecessarily more complicated. I guess it depends whether one favors speed over safety. I know I am in the latter camp, but that's just me. Mind you, it could be a mix of those options. There are already some stackable options in FastMM, so that could be one of them. For those who don't care at all, statu quo. |
Jumping in to throw my two cents.
I wonder which Delphi version introduced more idiomatic There also is Would one google around there are auxillary functions to
And perhaps more funcitons were added/renamed later. Going close-to-metal would incur many small variations, which perhaps are not required for debug-mode when blazing speed is not required at all costs. That being said, it seems Delphi 7 did not had it: https://codeverge.com/embarcadero.delphi.win32/delphi-2010-and-floating-point-excep/1046214 More mess: rounding and precision settings are probably also be considered part of environment, but they seems to be out of
Depends upon meaning of "broken". I'd assume it was a glue code between Delphi and DLL which was broken, failing to set proper-for-DLL environment for the external function on entry and setting another proper-for-Delphi environment on exit. I believe calling Now when it can go really messy is when it is Delphi Heap manager called directly from external code! BlazeMM heap manager said to implemented So i think the whole "reverse idea" that it can be Delphi code including heap manager which can get called from some external non-Delphi code with different hardware environment is to be kept in mind.
Specifically
I am probably wrong around using
If you go real hardcore you can even patch the code in memory, like
We may bi-sect which Delphi version fixed Pow10 function, but what essentially would be blacklisting perhaps would be of limited use, as there can be other functions involved not necessarily Perhaps, we can make wrappers for all potentially-FPU-using functions FastMM has and make them subject of
|
Perhaps this can be used conditionalls, like FastCode move did.
And then using Debug build of FastMM4 can kick in, or it maybe can even be done automatically on Delphi's
I do no see how this is thread-specific, i believe it to be applicaiton-global state. Especially if we are using some "pool of workers" library like OmniThreadingLibrary or its recent RTL counterpart. What there would be a practical mean to inform FastMM4 on which threads to activate or bypass extra scrutiny? Especially in the very beginning of investigation, when programmer (or worse - user) sees a shroedinbug happingnin occasionally, but has no idea what conditions trigger it? I can see none. It looks an app-global opt-in for me. |
A very interesting analysis. One note about threading... I was thinking that the option could be dynamically turned on or off, thus the threadvar. But if it is something that is set at startup time and doesn't change afterwards, like the IsMultiThread variable, then it could indeed be an ordinary global var. Which would make much more sense. Back to the main topic, it all boils down to assumption and intent. We could debate all day long about whose fault it is that the FPU isn't clean, if the Pow10() function was fixed, in which version, is it the DLL's fault, should floating point exceptions be masked, or the FPU control word backed up and restored upon entering a DLL... But if it is not due to Pow10(), it will be another DLL. If not a DLL, maybe a Windowa API. Are we going to put guard code around every Win32 call? Of course not. There's even the possibility of a realloc as a callback from Windows, as you noted. So we have to accept that, no matter how careful we are and how much guard code we put, the memory manager can't rely on the FPU being clean. This is where it breaks down. The current code assumes that 1) the FPU is clean and 2) if it's not the case, an exception will be raised. If those two assumptions are broken, and we know that it can, and a Move() of the right size is executed, memory corruption occurs. That's for assumptions. So what's the intent of the memory manager? Does it intend to be the fast at all times, at the cost of maybe causing some memory corruption once in a blue moon, or does it intend to be reliable and consistent, at the cost of performance? For me the latter is a no-brainer, but there could be use cases for the former, I guess. Hence the idea of putting the fix in a $define. So, if we decide that it is a problem and that it should be fixed, there are two ways to fix it, keeping in mind that the memory manager doesn't and can't control the state of the FPU before a memory reallocation event occurs. 1- Do not use the FPU. Easy enough. There are several ways to do this. Use the system Move() function, MMX, SSE2... I like your idea how to use SSE2 even for non-aligned data. 2- Clean the FPU before using it. This one is a bit trickier. The main problem is of course performance. Any additional step takes time. For example, your code
is too slow, about 7 times slower than issuing 8 ffree statements. Not to mention that the original CW needs to be restored afterwards, otherwise the reset FPU state might be breaking it for someone else's code. You proposed to make that a debug-only feature, but I'm not too keen on this idea. First, this assumes that problems will be detectable during testing, which is unlikely. Second, and maybe it's my background in working with server-type products talking here, but shipping code with a known critical memory corruption defect is a no-go. There could be a special EFastMMCorrupt8087State exception raised if the problematic situation is detected, but I believe that there should still be a fix for the release code. So ideally the fix would solve the issue 100% of the time with minimal and predictable impact on performance. So let's take a shot at this. With some guard code to optimize, first by having a global variable to enable/disable FPU cleanup, and then a second check to see if cleanup is necessary. Something like this:
Super optimized, right? Well, as it turns out, the guard code is about 3 times more expensive than the cleanup code it is guarding. Oops! So you'd be better off always doing the cleanup. So much for the optimization... Hence my original suggestion of hard-coding ffree st(x) for as many registers as needed, enclosed in a $define for those who would rather have faith and keep the existing behavior. Add to the mix your idea of a specific exception that gets explicitly thrown if a memory corruption is detected. So the fix would look like:
Either that, or use MMX/SSE2 memory copying and don't worry about the FPU. But that goes beyond my current knowledge of assembler code. |
An Invalid floating point operation exception may be raised when reallocating a buffer between 32 and 64 bytes. The reallocation can be explicit, e.g. a call to
ReallocMem()
, but could also be implicit, e.g. when concatenating strings, making it much harder to spot.This is due to the fact that FastMM copies memory blocks between 32 and 64 bytes using the FPU registers. But it assumes that all the registers it is going to use are free. This assumption is broken if faulty code was executed prior to the reallocation.
If the FPU overflow exception mask is reset, which is the case by default in Delphi, an exception is raised.
However, if the overflow exception mask is set, the error is ignored and the register contains a magic "invalid" value instead of the data copied to it. When the register content is restored at the destination memory location, a number of the bytes are incorrect, resulting in memory corruption.
In the POC code below, the
val()
Delphi function errors out, leaving a value on the FPU stack. The subsequentReallocMem()
tries to copy the buffer to its new location. It does so using the 8 FPU registers, one of which is still in use. This raises an exception. If you uncomment theSet8087CW($133f)
, the FPU overflow mask is set, the exception is silently ignored and memory corruption occurs, which can be seen by inspecting the buffer.My suggestion would be to
ffree
as many registers as needed, starting at the bottom of the stack. This way the required registers are free, with the caveat that additional garbage remains, but the situation is not made any worse and it reduces the number of calls to the minimum.Other solutions were considered.
fninit
is not a good solution because it resets the FPU code and status word as well.emms
is 2-4 times slower than freeing the registers.femms
is only supported on AMD.Here is an example of a fixed Move36 function, with un-related code removed for simplicity.
The text was updated successfully, but these errors were encountered: