Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swapping with tuple deconstruction compiles to inferior IL code compared to using a temporary variable #53300

Closed
hamarb123 opened this issue May 10, 2021 · 11 comments · Fixed by #65327
Assignees
Labels
Area-Compilers Bug Concept-Continuous Improvement Tenet-Performance Regression in measured performance of the product from goals.
Milestone

Comments

@hamarb123
Copy link

Description

Swapping / reordering variables compiles to slower IL code when using tuple destruction than a temporary variable.
I tested this in Release mode.

The following C# code

int x = 1;
int y = 2;
int z = 3;
(x, y)=(y, x);
Console.WriteLine((x,y,z)); //this line is to stop compiling out the variables

should be equivalent in IL output to

int x = 1;
int y = 2;
int z = 3;
int temp = x;
x = y;
y = temp;
Console.WriteLine((x,y,z)); //this line is to stop compiling out the variables

, but the first compiles to:

// Code size       31 (0x1f)
.maxstack  3
.locals init (int32 V_0,
         int32 V_1,
         int32 V_2,
         int32 V_3)
ldc.i4.1
stloc.0
ldc.i4.2
stloc.1
ldc.i4.3
stloc.2
ldloc.1
ldloc.0
stloc.3
stloc.0
ldloc.3
stloc.1
ldloc.0
ldloc.1
ldloc.2
newobj     instance void valuetype [System.Runtime]System.ValueTuple`3<int32,int32,int32>::.ctor(!0,
                                                                                                           !1,
                                                                                                           !2)
box        valuetype [System.Runtime]System.ValueTuple`3<int32,int32,int32>
call       void [System.Console]System.Console::WriteLine(object)
ret

and the second compiles to

// Code size       29 (0x1d)
.maxstack  3
.locals init (int32 V_0,
         int32 V_1,
         int32 V_2)
ldc.i4.1
stloc.0
ldc.i4.2
stloc.1
ldc.i4.3
stloc.2
ldloc.0
ldloc.1
stloc.0
stloc.1
ldloc.0
ldloc.1
ldloc.2
newobj     instance void valuetype [System.Runtime]System.ValueTuple`3<int32,int32,int32>::.ctor(!0,
                                                                                                           !1,
                                                                                                           !2)
box        valuetype [System.Runtime]System.ValueTuple`3<int32,int32,int32>
call       void [System.Console]System.Console::WriteLine(object)
ret

With the essential code being for (1):

ldloc.1
ldloc.0
stloc.3
stloc.0
ldloc.3
stloc.1

, and for (2):

ldloc.0
ldloc.1
stloc.0
stloc.1

I did test this on .NET 5 locally, but also on an online .NET 6 compiler at https://sharplab.io/, which uses a very recent version.
I also imagine that a similar thing happens for other variants, such as:

(x, y) = (x, x);
(x, y, z, w) = (w, y, z, x);

etc.
Surely these destruction assignments should be as simple as ld all of the variables, and then st them all (as long as there's not too many).
Interestingly, the tuple destruction code essentially uses a temporary variable in IL, whereas the temporary variable code eliminates it.
Not sure if this belongs on roslyn or on here.

@dotnet-issue-labeler
Copy link

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Wraith2
Copy link

Wraith2 commented May 10, 2021

If you check the JIT asm on sharplab you'll find they have the same output.

@hamarb123
Copy link
Author

Yes, that makes sense, but I still think that we should produce the best IL possible, since it will take less time to JIT and will be smaller. Especially since it seems that they bothered to do it for C# temporary variable so it should be easier to implement if it's already implemented elsewhere, and I'm sure that they had a good reason to do it for C# temporary variable.

@Joe4evr
Copy link

Joe4evr commented May 10, 2021

but I still think that we should produce the best IL possible

While that may be reasonable for you to assume, the only responsibility the compiler has is producing correct IL. Producing optimized IL would require much more analysis, most of which can't be done if you're compiling something like a class library (no entry point).

  • It's not always clear what "optimal" IL looks like. One must anticipate the capabilities and limitations of the downstream code generator.

- #15929 (comment)

since it will take less time to JIT and will be smaller

And your metrics for this statement are where?

@hamarb123
Copy link
Author

hamarb123 commented May 10, 2021

I see your point re making correct IL, that's probably fair enough then, but this optimisation has nothing to do whether it's a class library or not, it's just swapping locals, it doesn't rely on any library at all, or the possible behaviour differences between a library and an executable (which shouldn't be different if you ask me since you can reference it even if it has an entry point).
Also, by smaller, I meant the IL is smaller, which it is.
I can take a good guess with the JITting time since it is just the same instruction, but more unnecessary copies of it.

@jeffschwMSFT jeffschwMSFT transferred this issue from dotnet/runtime May 10, 2021
@dotnet-issue-labeler dotnet-issue-labeler bot added Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels May 10, 2021
@jaredpar jaredpar added Bug Concept-Continuous Improvement Tenet-Performance Regression in measured performance of the product from goals. Area-Compilers and removed Area-Compilers untriaged Issues and PRs which have not yet been triaged by a lead labels Jul 13, 2021
@jaredpar jaredpar added this to the Backlog milestone Jul 13, 2021
@Neme12
Copy link
Contributor

Neme12 commented May 4, 2022

I just noticed this conversation in dotnet/runtime: dotnet/runtime#64412 (comment)
Apparently, they avoid using (a, b) = (b, a) to swap variables because of the worse IL. It would be nice if this generated the same code as the classic way of swapping 2 variables, because this is a really nice, clear and explicit way of doing the swapping and helps readability. If it generates worse IL, some projects like dotnet/runtime will avoid using this, which is unfortunate.

@Neme12
Copy link
Contributor

Neme12 commented May 4, 2022

@Joe4evr

While that may be reasonable for you to assume, the only responsibility the compiler has is producing correct IL. Producing optimized IL would require much more analysis, most of which can't be done if you're compiling something like a class library (no entry point).

Yes, the only guarantee the compiler makes is producing correct IL, which means that this isn't a bug, but it's still a reasonable feature request and an improvement, especially given the big focus on performance in today's .NET. Also, this would just be a small focused optimization to eliminate a local or avoid creating it in the first place, nothing to do with interprocedural optimization, so your comment about a class library and entry points is irrelevant.

@stephentoub
Copy link
Member

stephentoub commented Sep 4, 2022

@hamarb123
Copy link
Author

hamarb123 commented Sep 4, 2022

I'd also like to say that in terms of inlining, tuple destruction can be a turn off sometimes.
The shortest version of the method (where you just ld all the destinations and values, and then just st them all - I will elaborate below), which I would think would also be the simplest to implement, generates the smallest code IL Code, and equivalent JIT code to the longer form (which has the better JIT). It also allows the .maxstack 8 tiny header optimisation for reordering up to 4 things.
https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AbEBLDAfAASQAI...

Elaboration on what IL code I propose gets emitted:
For code like (a, b, c, ...) = (d, e, f, ...) the following code would be emitted:
Load a's storage
Load d
Load b's storage
Load e
Load c's storage
Load f
...
Store to c
Store to b
Store to a
This method should also work when one address is in multiple positions due to how it's ordered, so no special checking if they're the same address would be needed.

For a field on the left (a):
Load a's storage - load the object for an instance field on a class, or the address for an instance field on a value type, or none for a static field.
Store to a - stfld or stsfld

For a pointer or byref on the left (a):
Load a's storage - load the value (value being the pointer or byref) itself
Store to a - stind or appropriate short form

For a local variable (a):
Load a's storage - none
Store to a - stloc

And for things on the right, it is just the standard way of loading them onto the stack.

@AlekseyTs AlekseyTs changed the title Swapping with tuple destruction compiles to inferior IL code compared to using a temporary variable Swapping with tuple deconstruction compiles to inferior IL code compared to using a temporary variable Sep 30, 2022
@RikkiGibson RikkiGibson assigned RikkiGibson and unassigned jcouv Oct 27, 2022
@RikkiGibson RikkiGibson removed this from the Backlog milestone Oct 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area-Compilers Bug Concept-Continuous Improvement Tenet-Performance Regression in measured performance of the product from goals.
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants