Avoid zeroing memory on stackalloc? #4384

RossNordby · 2015-07-23T04:55:11Z

The current stackalloc implementation uses a tight loop of push instructions to allocate and fill in the memory. For example, stackalloc int[16384] yields:

000007FE8D8A05C4  mov         eax,10000h
000007FE8D8A05C9  push        0
000007FE8D8A05CB  push        0
000007FE8D8A05CD  sub         rax,10h
000007FE8D8A05D1  jne         000007FE8D8A05C9

Given that the use of stackalloc is often performance related, it would be nice if the stack pointer just jumped directly to the end and left the memory as-is. This appears to be permitted by the C# spec, section 18.8: "The content of the newly allocated memory is undefined."

Changing this behavior would nastily break applications that relied on the zeroing (despite spec). I'm not sure how much of a problem this would be. It's a niche feature that tends to be used for very specific reasons, and older JIT versions seemed to leave the memory as-is (according to some postings circa 2003).

The text was updated successfully, but these errors were encountered:

BruceForstall · 2015-07-23T06:59:18Z

In your example, the JIT has been told to force a zero initialization. There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

mikedn · 2015-07-23T07:20:54Z

Changing this behavior would nastily break applications that relied on the zeroing (despite spec).

Note that zeroing is not always done even today. The following code prints 42, that value remains on the stack from Foo:

static unsafe void Main() {
    Foo();
    int* p = stackalloc int[16384];
    Console.WriteLine(p[0]);
}
static unsafe void Foo() {
    int* p = stackalloc int[16384];
    for (int i = 0; i < 16384; i++)
        p[i] = 42;
}

RossNordby · 2015-07-24T00:14:46Z

Interesting! I should have looked deeper.

If my new understanding is right, the JIT just obeys the method IL header CorILMethod_InitLocals flag when choosing how to allocate, and III.3.47 of the CLI spec suggests that the JIT has no choice but to obey. Darn.

I guess any remaining degrees of freedom would be over on the roslyn side of things.

mikedn · 2015-07-24T05:33:15Z

It's even more interesting if you check the C# specification and you find this text in 19.8:

The content of the newly allocated memory is undefined.

Go figure. I think that the IL spec is a bit messed up for requiring localalloc to follow the localsinit flag. That's why in my above example no zeroing is done, Main doesn't contain any locals and there's no localsinit flag. Perhaps Roslyn should just refrain from using localsinit in unsafe methods, it's of no use anyway.

RossNordby · 2015-07-24T22:50:03Z

Not setting localsinit for unsafe methods sounds like a good option. I haven't found anything that would restrict that in the specs yet, given that verifiability is already out the window. I think I might go write up an issue on that.

Another option might be to introduce a separate IL instruction that does not have the same baggage as localloc, allowing stackalloc to be implemented according to its looser C# spec. I think this might run afoul of the blanket implications of localsinit, though:

A “localsinit flag” that indicates whether the local variables and memory pool (§I.12.3.2.4) should be initialized by the CLI.

At best, it would be a weird corner case. Also, while I'm not familiar with the process behind doing such a thing, I would guess a change to Roslyn's localsinit behavior would be easier if acceptable.

benaadams · 2015-11-01T06:09:50Z

@mikedn I'm not seeing this (re localsinit); even declaring the stackalloc'd variable results in a localsinit being emitted?

char* output = stackalloc char[length];

.method private hidebysig static string MultiBlockAsciiString(
    class MemoryBlock block,
    int32 offset,
    int32 length) cil managed
{
  // Code size       17 (0x11)
  .maxstack  4
  .locals init ([0] char* output)
  IL_0000:  ldarg.2
  IL_0001:  conv.u
  IL_0002:  ldc.i4.2
  IL_0003:  mul.ovf.un
  IL_0004:  localloc
  IL_0006:  stloc.0
  IL_0007:  ldarg.0
  IL_0008:  ldarg.1
  IL_0009:  ldarg.2
  IL_000a:  ldloc.0
  IL_000b:  call       string ::MultiBlockAsciiIter(
                                                class MemoryBlock,
                                                int32,
                                                int32,
                                                char*)
  IL_0010:  ret
}

benaadams · 2015-11-01T06:36:42Z

Correction: localsinit being dropped does happen but there are some extra caveats; trying to narrow down

benaadams · 2015-11-01T07:03:25Z

Haven't quite narrowed it down; but if you call a method it needs to be first param; and some sorts of work triggers it; possibly for loop even with the index being not local.

Pattern that seems to work:

1..function that declares via statckalloc and does no other work other than
2. returning result of another function that takes the stackalloc as first param

Can pass though needed vars from first function to second. Bit of a contort...

See as example: https://github.com/aspnet/KestrelHttpServer/pull/312/files#diff-7042cc5345ca45a84e36433636d7f10fR15

choikwa · 2016-11-01T22:45:11Z

There is code in the JIT to simply do large frame stack page probing without zero init, but that didn't kick in here.

@BruceForstall Could you share where this might be? I saw that CoreCLR generates 0 initialization of stack locals very late in the prolog in one of the Kestrel hotspots.

benaadams · 2016-11-01T22:57:13Z

@choikwa you can do it with stackalloc, but you have to do a weird pattern of stackallocing in one function and nothing else except passing that variable into another function - then the function doing the stackalloc won't zero.

BruceForstall · 2016-11-01T23:09:43Z

For stackalloc, the code is in CodeGen::genLclHeap(). Zero init of stack locals in the normal prolog is in CodeGen::genZeroInitFrame()

choikwa · 2016-11-01T23:14:47Z

Is there any particular reason why zero init of stack locals must be done at the prologue *generation? Why not expose this to the optimizer which can dead-store-eliminate or push down init. where it's only necessary?

BruceForstall · 2016-11-02T06:05:49Z

(I believe) we only zero init local vars with GC types that are untracked (thus don't get individual operation GC state change information). There may be other cases, such as potentially uninitialized locals in verifiable code.

choikwa · 2016-11-02T18:46:39Z

Cursory search for zero init of locals:

Above method, in the prolog generation by tracking lvMustInit field of LclVarDsc
- tracking seems to be done at Compiler::fgInterBlockLocalVarLiveness().
  
  I am confused by this comment; I would be thrilled if someone could explain to me the need for volatility and must-init.

    /* Mark all pointer variables live on exit from a 'finally'
       block as either volatile for non-GC ref types or as
       'explicitly initialized' (volatile and must-init) for GC-ref types */

Compiler::fgInlinePrependStatements() for zero init of inlinee's locals by inserting assignment trees

The BOTR doesn't seem to describe why the former is needed or GC's relation to the stack locals. It would be nice if there are no hard restrictions to moving zero-init from prolog to earlier pass.

BruceForstall · 2016-11-02T22:57:21Z

We don't enregister anything live in/out of handlers. That's because an exception in a "try" can happen at any time, and the handler needs to be able to find the current value of the variable in that case. The OS/VM doesn't preserve registers from the "try" to the handler, so the only place we can find the variable is on the stack. Thus, all writes to the variable must be to the stack. We could do better here in some cases, but that's the current status.

We only need to explicitly zero non-arg, GC-ref types because we're going to report them to the GC, and they must always have a valid value (zero is valid). If we knew the value was always defined before entering every EH region that had a handler using that value, we might be able to avoid force zero initializing it in the prolog. This could be improved with some better analysis.

It looks like Compiler::fgInlinePrependStatements() only zero inits locals for cases where the IL for a function requires it for the inlinee.

jkotas · 2016-11-02T23:17:55Z

@BruceForstall @choikwa This issue is about unnecessary zero-initialization for stackalloc in C# / localloc in IL.

jkotas · 2017-06-12T20:14:42Z

We need to do something about this for Span<T> to work end-to-end.

jkotas · 2017-06-15T01:55:59Z

I have look into this some more:

Run CoreCLR tests with a hack that suppressed the init-locals bit for everything. No issues in the JIT found. (~30 tests written in IL that were exercising the init-locals bit failed - by design.)
Found one place in corelib that depends on the stackalloc memory to be zero-initialized: https://github.com/dotnet/coreclr/blob/97c58ac4fce27b7796206a59eea0ca27cb49fe1a/src/mscorlib/src/System/RtType.cs#L621
Observed that clearing the init-locals bit helps code quality with “out” arguments. The storage for “out” arguments is always zero-initialized today because of the init-locals bits, but the zero-initialization is not necessary (if the out arguments do not contain GC pointers).

The current options that I am thinking about are - from the most preferred to least preferred:

Change C# compiler to not set the init-locals bit for unsafe methods; or even for everything in assemblies compiled with /unsafe or with a new compiler switch. Works for existing runtimes. Potentially breaks code like the corelib example above.
Introduce option for ILLinker to clear the zero init locals bit on all methods in assembly. It would be the start of turning ILLinker into a full IL-to-IL optimizer. There are number of optimization ideas that we do not have a good place to put today – they are hard/expensive/breaking/complex to implement in either Roslyn or JIT. Proposal: IL optimization step roslyn#15929
Introduce a new CompilationRelaxations.NoZeroInitLocals flag and make folks to explicitly add it if they need fast stackalloc or other benefits of no init-locals bit. It is a bit odd design because of one piece data invalidates other piece of data. Does not work for existing runtimes.

Having a new intrinsic method that does not zero init, or having flexibility to control this at method granularity, seems to be unnecessary. Assembly level granularity should be sufficient for this.

benaadams · 2017-06-15T02:03:02Z

Change C# compiler to not set the init-locals bit for unsafe methods

Probably works; and unsafe blocks.

for everything in assemblies compiled with /unsafe or with a new compiler switch

Probably a bit heavy?

MethodImpl flag?

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

jkotas · 2017-06-15T02:31:55Z

[MethodImpl(MethodImplOptions.NoZeroInitLocals)]

This is variant of 3. I think that folks who care about this would want it for all their code, without need to annotate every method.

benaadams · 2017-06-15T02:48:02Z

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

jkotas · 2017-06-15T02:56:52Z

Right - that's the idea.

jkotas · 2017-06-15T18:12:44Z

@russellhadley The ILLink option above is the easiest one to start with. Could you please look into it for ILLinker?

jkotas · 2017-08-31T20:14:17Z

Stephen and I have tried to switch CoreCLR to use managed version of number parsing and formatting. It is straightforward port of C++ version of code. The C++ version of the code is using stack allocated scratch buffers. We found that it is impossible to get C# version close to C++ version because of the zero initialization imposses 15%-20% penalty. The changes are under https://github.com/jkotas/coreclr/tree/corertnumbercode branch for now. The microbenchmark tested is to run ((Int32)12345).ToString("C"); in a loop.

To unblock progress on the CoreLib work, I planning to add a workaround to suppress unnecessary zero initialization for CoreLib, linked to this issue. This workaround should be removed once we have permanent solution for this problem.

Drawaes · 2017-08-31T20:26:59Z

I have hit this perf issue before when using stackalloc in a tight loop for scratchspace for crypto blocks (16bytes). Is there an option for adding a new keyword?

stackalloc nozero byte[16];

or similar? I realise that the spec says that stackalloc won't necessarily zero, but I expect we are too far down the path to change behaviour in code that probably doesn't have a lot of safe guards anyway. As always people probably have come to expect it to be zeroed but an extra keyword or different keyword for it would leave existing code and allow an "opt in"

benaadams · 2017-08-31T20:38:48Z

Also related "JIT: consider optimizing small fixed-sized stackalloc" https://github.com/dotnet/coreclr/issues/8542 to help stackalloc functions inline

benaadams · 2017-09-02T12:42:33Z

If the input value of output is important then it should be ref and then you init to null (and the C# compiler I believe would force you to). out means its input value is not important.

byte* output = null;
SomeInteropFunction(ref output);

out and ref are functionally the same; though semantically different (i.e ref needs init first, out needs init before return in callee)

If you interop it doesn't change whether you use ref or out so should use the one that agrees; i.e. if the input is important then it should be ref

jakobbotsch · 2017-09-02T12:43:52Z

out output will pass a pointer to the output variable to the method. In either case this will not be a null pointer. In P/invokes your code is equivalent to
SomeInteropFunction(&output);
The value of the 'output' variable does not matter unless the p/invoked function dereferences the pointer.

Drawaes · 2017-09-02T13:15:00Z

Sure so the c might be

void MyMethod ( void** output);

So I love the idea of init being ditched. I worry that it will cause the kind of managed/unmanaged bugs that lead to double freeing, segfaults and general nastiness.

That is why from where I am sitting it needs to be opt in. I hate to think what horrible interop code there is in some point of sale machine or similar somewhere that this could break.

Heck there is a lot of historical finance system interop code I would need to go double check.

Drawaes · 2017-09-02T13:16:59Z

A key word or an attribute would work fine. Even though init "isn't in the spec" it's observed behaviour for what 10 years ?

GSPP · 2017-09-03T14:02:31Z

Maybe make the assembly opt into this new faster behavior by setting an attribute? That way the core libraries can make use of this enhancement easily.

It could look like:

[assembly: CodeOptimizationCompatibility(Level = CodeOptimizationLevel.V5_0)]

Or similar. The Level could trigger many optimizations that are currently not being done for compat reasons. This would be similar to the SQL Server compatibility level. It's a per-database setting enabling general improvements with each new version. In my opinion the SQL Server compatibility level has been a very useful concept.

Drawaes · 2017-09-03T14:19:43Z

I like it, it's opt in so doesn't break anything but you don't have to tag every method.

omariom · 2017-09-07T08:47:56Z

@benaadams

On 1. just realized you can't use stackalloc outside an unsafe block, so it would switch it off for everything always?

With @VSadov's work on ref like types it will be soon allowed.
Span<byte> buffer = stackalloc byte[10];

tannergooding · 2017-10-06T18:54:27Z

Why do we do a tight loop of pushes, rather than a single adjustment of the stack plus a call to ZeroMemory (or equivalent)?

Also what are the numbers for the overall perf improvement from stripping the flag?

BruceForstall · 2017-10-06T19:55:18Z

CodeGen::genZeroInitFrame() uses different code sequences for different conditions: https://github.com/dotnet/coreclr/blob/6aa119dcdb88ba1157191600a992aa94eee59b22/src/jit/codegencommon.cpp#L6977

tannergooding · 2017-10-06T20:11:23Z

@BruceForstall, I would think that. for anything that is > 32 bytes, ZeroMemory would probably be faster (especially for large buffers) as it may take advantage of the underlying hardware (such as using rep movsb or writing up to 512-bits per iteration on hardware that support those sequences as being fast).

It would also be interesting to determine whether non-temporal zeroing would be faster for most cases (I would guess the common scenario is for a user to initialize the bytes themselves, so zeroing without polluting the cache might be better).

tannergooding · 2017-10-06T20:14:55Z

In either case, if we had some numbers for how much stripping the init flag would save, it might be possible to convince the compiler team to add a flag.

The init flag impacts all locals, not just stackalloc, and I imagine (for more complicated methods) the JIT isn't able to always elide the zeroing. So knowing the benefit across the board would be useful.

erozenfeld · 2017-10-06T20:23:28Z

Note that clearing initlocals flag will still leave lots of zero-initializations of structs with GC fields even though they can be avoided in many cases.

I recently opened several items related to this: dotnet/coreclr#13822 (alreday fixed), dotnet/coreclr#13823, dotnet/coreclr#13825, dotnet/coreclr#13827. Also, ILLink now has the ability to clear initlocals unconditionally. I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

benaadams · 2017-10-06T20:24:50Z

it may take advantage of the underlying hardware (such as using rep movsb

Already does this; though that causes issues as isn't necessarily the most efficient way https://github.com/dotnet/coreclr/issues/13827

tannergooding · 2017-10-06T20:31:02Z

I will soon update the version of ILLink used in corefx to have this option. We can try to enable it for all corefx assemblies.

Getting perf numbers for this would be great. If the performance improvement is actually measurable for normal applications (or even things like CoreFX, Roslyn, etc) then it becomes worthwhile to suggest the compiler should have support for stripping the flag integrated into it.

ahsonkhan · 2017-12-15T22:37:49Z

@RussKeldorph, can you please triage/prioritize this issue. Ty

RussKeldorph · 2017-12-16T00:37:37Z

@ahsonkhan What do you mean by triage/prioritize? (It's already triaged) By when do you need it? Can you summarize why it is more important now? I'm just looking for a summary to help us weigh it against other tasks.

@dotnet/jit-contrib

ahsonkhan · 2017-12-16T00:47:09Z

By when do you need it?

2.1, and it is already marked as such. So please ignore my comment. I didn't notice the milestone.

We have some experimental APIs in corefxlab that use stackalloc spans and this could improve their performance.

For example:
https://github.com/dotnet/corefxlab/blob/master/src/System.Buffers.Experimental/System/Buffers/Text/BufferReader.cs#L209
https://github.com/dotnet/corefxlab/blob/master/src/System.Azure.Experimental/System/Azure/CosmosDbAuthorizationHeader.cs#L28
https://github.com/dotnet/corefxlab/blob/master/src/System.Text.Primitives/System/Text/Formatters/Formatter_floats.cs#L73
etc.

jkotas · 2017-12-16T00:57:42Z

The current plan is:

C# will keep zero initializing by default. Changing the default would be too breaking.
We have a set of issues opened to make the zero initialization done by the JIT more efficient or reduce need for it (Move ILVerify tool to dotnet/runtime repo #13827, dotnet/coreclr#13823, dotnet/coreclr#13825)
Folks who really want to get the last bit of performance by avoiding zero initialization can use custom ILLinker step (Add ILLink.CustomsSteps.dll. linker#159) when they know what they are doing. We do this for CoreLib today (via VM hack, but we should switch to the ILLinker), and we plan to experiment with this in CoreFX (Update ILLink version. corefx#25956). Based on the results of these experiments, we may consider introducing a more streamlined way to do this in future. @ahsonkhan You should consider experimenting with it in CoreFXLab as well if you believe that it would help.

I do not think there is anything actionable for 2.1 on this issue currently. Setting the milestone to future.

MusorGaming · 2019-01-01T22:03:19Z

Do people at Microsoft still write documentation?
https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/stackalloc
Doesn't mention Span.
Doesn't say whether it DOES or DOES NOT, in theory or practice zero memory or not.

This is important.

josetr · 2019-03-28T22:56:59Z

If you think you would benefit from this feature I have created a package that makes it easy to control this flag for a specific method or for all methods within a type/assembly using a custom attribute InitLocals(false).

https://github.com/josetr/InitLocals

Goes without saying that you shouldn't use it in places where you have no tests since you may rely on this zero-initialization behavior.

Also, don't forget to benchmark your code because it's very likely that you don't need it at all.

jkotas · 2019-09-16T20:34:24Z

Superseded by:

jkotas assigned russellhadley Jun 15, 2017

jkotas assigned erozenfeld Aug 31, 2017

jkotas referenced this issue in jkotas/coreclr Aug 31, 2017

Clear the init-locals bit for CoreLib to workaround #1279

5efb622

jkotas referenced this issue in jkotas/coreclr Aug 31, 2017

Clear the init-locals bit for CoreLib to workaround #1279

52a2c08

russellhadley assigned RussKeldorph and unassigned russellhadley and erozenfeld Nov 3, 2017

RussKeldorph removed their assignment Dec 16, 2017

jkotas closed this as completed Sep 16, 2019

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

msftgits added this to the Future milestone Jan 30, 2020

1996v mentioned this issue Oct 14, 2020

How to forcefully zeroing memory allocated by stackalloc? #43391

Closed

ghost locked as resolved and limited conversation to collaborators Jan 5, 2021

Avoid zeroing memory on stackalloc? #4384

Avoid zeroing memory on stackalloc? #4384

Comments

RossNordby commented Jul 23, 2015

BruceForstall commented Jul 23, 2015

mikedn commented Jul 23, 2015

RossNordby commented Jul 24, 2015

mikedn commented Jul 24, 2015

RossNordby commented Jul 24, 2015

benaadams commented Nov 1, 2015

benaadams commented Nov 1, 2015

benaadams commented Nov 1, 2015

choikwa commented Nov 1, 2016 • edited Loading

benaadams commented Nov 1, 2016

BruceForstall commented Nov 1, 2016

choikwa commented Nov 1, 2016 • edited Loading

BruceForstall commented Nov 2, 2016

choikwa commented Nov 2, 2016 • edited Loading

BruceForstall commented Nov 2, 2016

jkotas commented Nov 2, 2016

jkotas commented Jun 12, 2017 • edited Loading

jkotas commented Jun 15, 2017 • edited Loading

benaadams commented Jun 15, 2017 • edited Loading

jkotas commented Jun 15, 2017

benaadams commented Jun 15, 2017

jkotas commented Jun 15, 2017

jkotas commented Jun 15, 2017

jkotas commented Aug 31, 2017

Drawaes commented Aug 31, 2017

benaadams commented Aug 31, 2017

benaadams commented Sep 2, 2017 • edited Loading

jakobbotsch commented Sep 2, 2017

Drawaes commented Sep 2, 2017

Drawaes commented Sep 2, 2017

GSPP commented Sep 3, 2017 • edited Loading

Drawaes commented Sep 3, 2017 • edited Loading

omariom commented Sep 7, 2017

tannergooding commented Oct 6, 2017

BruceForstall commented Oct 6, 2017

tannergooding commented Oct 6, 2017

tannergooding commented Oct 6, 2017

erozenfeld commented Oct 6, 2017

benaadams commented Oct 6, 2017

tannergooding commented Oct 6, 2017

ahsonkhan commented Dec 15, 2017

RussKeldorph commented Dec 16, 2017 • edited Loading

ahsonkhan commented Dec 16, 2017 • edited Loading

jkotas commented Dec 16, 2017 • edited Loading

MusorGaming commented Jan 1, 2019

josetr commented Mar 28, 2019

jkotas commented Sep 16, 2019

choikwa commented Nov 1, 2016 •

edited

Loading

choikwa commented Nov 1, 2016 •

edited

Loading

choikwa commented Nov 2, 2016 •

edited

Loading

jkotas commented Jun 12, 2017 •

edited

Loading

jkotas commented Jun 15, 2017 •

edited

Loading

benaadams commented Jun 15, 2017 •

edited

Loading

benaadams commented Sep 2, 2017 •

edited

Loading

GSPP commented Sep 3, 2017 •

edited

Loading

Drawaes commented Sep 3, 2017 •

edited

Loading

RussKeldorph commented Dec 16, 2017 •

edited

Loading

ahsonkhan commented Dec 16, 2017 •

edited

Loading

jkotas commented Dec 16, 2017 •

edited

Loading