-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid zeroing memory on stackalloc? #4384
Comments
In your example, the JIT has been told to force a zero initialization. There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here. |
Note that zeroing is not always done even today. The following code prints 42, that value remains on the stack from static unsafe void Main() {
Foo();
int* p = stackalloc int[16384];
Console.WriteLine(p[0]);
}
static unsafe void Foo() {
int* p = stackalloc int[16384];
for (int i = 0; i < 16384; i++)
p[i] = 42;
} |
Interesting! I should have looked deeper. If my new understanding is right, the JIT just obeys the method IL header CorILMethod_InitLocals flag when choosing how to allocate, and III.3.47 of the CLI spec suggests that the JIT has no choice but to obey. Darn. I guess any remaining degrees of freedom would be over on the roslyn side of things. |
It's even more interesting if you check the C# specification and you find this text in 19.8:
Go figure. I think that the IL spec is a bit messed up for requiring |
Not setting localsinit for unsafe methods sounds like a good option. I haven't found anything that would restrict that in the specs yet, given that verifiability is already out the window. I think I might go write up an issue on that. Another option might be to introduce a separate IL instruction that does not have the same baggage as localloc, allowing stackalloc to be implemented according to its looser C# spec. I think this might run afoul of the blanket implications of localsinit, though:
At best, it would be a weird corner case. Also, while I'm not familiar with the process behind doing such a thing, I would guess a change to Roslyn's localsinit behavior would be easier if acceptable. |
@mikedn I'm not seeing this (re char* output = stackalloc char[length]; .method private hidebysig static string MultiBlockAsciiString(
class MemoryBlock block,
int32 offset,
int32 length) cil managed
{
// Code size 17 (0x11)
.maxstack 4
.locals init ([0] char* output)
IL_0000: ldarg.2
IL_0001: conv.u
IL_0002: ldc.i4.2
IL_0003: mul.ovf.un
IL_0004: localloc
IL_0006: stloc.0
IL_0007: ldarg.0
IL_0008: ldarg.1
IL_0009: ldarg.2
IL_000a: ldloc.0
IL_000b: call string ::MultiBlockAsciiIter(
class MemoryBlock,
int32,
int32,
char*)
IL_0010: ret
} |
Correction: |
Haven't quite narrowed it down; but if you call a method it needs to be first param; and some sorts of work triggers it; possibly Pattern that seems to work: 1..function that declares via statckalloc and does no other work other than Can pass though needed vars from first function to second. Bit of a contort... See as example: https://github.com/aspnet/KestrelHttpServer/pull/312/files#diff-7042cc5345ca45a84e36433636d7f10fR15 |
@BruceForstall Could you share where this might be? I saw that CoreCLR generates 0 initialization of stack locals very late in the prolog in one of the Kestrel hotspots. |
@choikwa you can do it with stackalloc, but you have to do a weird pattern of stackallocing in one function and nothing else except passing that variable into another function - then the function doing the stackalloc won't zero. |
For stackalloc, the code is in CodeGen::genLclHeap(). Zero init of stack locals in the normal prolog is in CodeGen::genZeroInitFrame() |
Is there any particular reason why zero init of stack locals must be done at the prologue *generation? Why not expose this to the optimizer which can dead-store-eliminate or push down init. where it's only necessary? |
(I believe) we only zero init local vars with GC types that are untracked (thus don't get individual operation GC state change information). There may be other cases, such as potentially uninitialized locals in verifiable code. |
Cursory search for zero init of locals:
The BOTR doesn't seem to describe why the former is needed or GC's relation to the stack locals. It would be nice if there are no hard restrictions to moving zero-init from prolog to earlier pass. |
We don't enregister anything live in/out of handlers. That's because an exception in a "try" can happen at any time, and the handler needs to be able to find the current value of the variable in that case. The OS/VM doesn't preserve registers from the "try" to the handler, so the only place we can find the variable is on the stack. Thus, all writes to the variable must be to the stack. We could do better here in some cases, but that's the current status. We only need to explicitly zero non-arg, GC-ref types because we're going to report them to the GC, and they must always have a valid value (zero is valid). If we knew the value was always defined before entering every EH region that had a handler using that value, we might be able to avoid force zero initializing it in the prolog. This could be improved with some better analysis. It looks like Compiler::fgInlinePrependStatements() only zero inits locals for cases where the IL for a function requires it for the inlinee. |
@BruceForstall @choikwa This issue is about unnecessary zero-initialization for |
We need to do something about this for |
I have look into this some more:
The current options that I am thinking about are - from the most preferred to least preferred:
Having a new intrinsic method that does not zero init, or having flexibility to control this at method granularity, seems to be unnecessary. Assembly level granularity should be sufficient for this. |
Probably works; and unsafe blocks.
Probably a bit heavy? MethodImpl flag?
|
This is variant of 3. I think that folks who care about this would want it for all their code, without need to annotate every method. |
On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always? |
Right - that's the idea. |
@russellhadley The ILLink option above is the easiest one to start with. Could you please look into it for ILLinker? |
Stephen and I have tried to switch CoreCLR to use managed version of number parsing and formatting. It is straightforward port of C++ version of code. The C++ version of the code is using stack allocated scratch buffers. We found that it is impossible to get C# version close to C++ version because of the zero initialization imposses 15%-20% penalty. The changes are under https://github.com/jkotas/coreclr/tree/corertnumbercode branch for now. The microbenchmark tested is to run To unblock progress on the CoreLib work, I planning to add a workaround to suppress unnecessary zero initialization for CoreLib, linked to this issue. This workaround should be removed once we have permanent solution for this problem. |
I have hit this perf issue before when using stackalloc in a tight loop for scratchspace for crypto blocks (16bytes). Is there an option for adding a new keyword?
or similar? I realise that the spec says that stackalloc won't necessarily zero, but I expect we are too far down the path to change behaviour in code that probably doesn't have a lot of safe guards anyway. As always people probably have come to expect it to be zeroed but an extra keyword or different keyword for it would leave existing code and allow an "opt in" |
Also related "JIT: consider optimizing small fixed-sized stackalloc" https://github.com/dotnet/coreclr/issues/8542 to help stackalloc functions inline |
If the input value of output is important then it should be byte* output = null;
SomeInteropFunction(ref output);
If you interop it doesn't change whether you use |
|
Sure so the c might be
So I love the idea of init being ditched. I worry that it will cause the kind of managed/unmanaged bugs that lead to double freeing, segfaults and general nastiness. That is why from where I am sitting it needs to be opt in. I hate to think what horrible interop code there is in some point of sale machine or similar somewhere that this could break. Heck there is a lot of historical finance system interop code I would need to go double check. |
A key word or an attribute would work fine. Even though init "isn't in the spec" it's observed behaviour for what 10 years ? |
Maybe make the assembly opt into this new faster behavior by setting an attribute? That way the core libraries can make use of this enhancement easily. It could look like:
Or similar. The |
I like it, it's opt in so doesn't break anything but you don't have to tag every method. |
With @VSadov's work on ref like types it will be soon allowed. |
Why do we do a tight loop of pushes, rather than a single adjustment of the stack plus a call to ZeroMemory (or equivalent)? Also what are the numbers for the overall perf improvement from stripping the flag? |
CodeGen::genZeroInitFrame() uses different code sequences for different conditions: https://github.com/dotnet/coreclr/blob/6aa119dcdb88ba1157191600a992aa94eee59b22/src/jit/codegencommon.cpp#L6977 |
@BruceForstall, I would think that. for anything that is > 32 bytes, ZeroMemory would probably be faster (especially for large buffers) as it may take advantage of the underlying hardware (such as using It would also be interesting to determine whether non-temporal zeroing would be faster for most cases (I would guess the common scenario is for a user to initialize the bytes themselves, so zeroing without polluting the cache might be better). |
In either case, if we had some numbers for how much stripping the init flag would save, it might be possible to convince the compiler team to add a flag. The init flag impacts all locals, not just stackalloc, and I imagine (for more complicated methods) the JIT isn't able to always elide the zeroing. So knowing the benefit across the board would be useful. |
Note that clearing initlocals flag will still leave lots of zero-initializations of structs with GC fields even though they can be avoided in many cases. I recently opened several items related to this: dotnet/coreclr#13822 (alreday fixed), dotnet/coreclr#13823, dotnet/coreclr#13825, dotnet/coreclr#13827. Also, ILLink now has the ability to clear initlocals unconditionally. I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies. |
Already does this; though that causes issues as isn't necessarily the most efficient way https://github.com/dotnet/coreclr/issues/13827 |
Getting perf numbers for this would be great. If the performance improvement is actually measurable for normal applications (or even things like CoreFX, Roslyn, etc) then it becomes worthwhile to suggest the compiler should have support for stripping the flag integrated into it. |
@RussKeldorph, can you please triage/prioritize this issue. Ty |
@ahsonkhan What do you mean by triage/prioritize? (It's already triaged) By when do you need it? Can you summarize why it is more important now? I'm just looking for a summary to help us weigh it against other tasks. @dotnet/jit-contrib |
2.1, and it is already marked as such. So please ignore my comment. I didn't notice the milestone. We have some experimental APIs in corefxlab that use stackalloc spans and this could improve their performance. For example: |
The current plan is:
I do not think there is anything actionable for 2.1 on this issue currently. Setting the milestone to future. |
Do people at Microsoft still write documentation? This is important. |
If you think you would benefit from this feature I have created a package that makes it easy to control this flag for a specific method or for all methods within a type/assembly using a custom attribute https://github.com/josetr/InitLocals Goes without saying that you shouldn't use it in places where you have no tests since you may rely on this zero-initialization behavior. Also, don't forget to benchmark your code because it's very likely that you don't need it at all. |
The current stackalloc implementation uses a tight loop of push instructions to allocate and fill in the memory. For example,
stackalloc int[16384]
yields:Given that the use of stackalloc is often performance related, it would be nice if the stack pointer just jumped directly to the end and left the memory as-is. This appears to be permitted by the C# spec, section 18.8: "The content of the newly allocated memory is undefined."
Changing this behavior would nastily break applications that relied on the zeroing (despite spec). I'm not sure how much of a problem this would be. It's a niche feature that tends to be used for very specific reasons, and older JIT versions seemed to leave the memory as-is (according to some postings circa 2003).
The text was updated successfully, but these errors were encountered: