Improvements around `fixed` #2418

gfoidl · 2023-03-26T13:11:22Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

avoid length-check in pinning spans
don't use a dummy fixed sized buffer to set the size of the Block8x8 struct
in Block8x8F.Load avoid unnecessary copying of data -- read it directly from the source

Only where it seems profitable. E.g. not when a UnmanagedMemoryStream is constructed of that pointer.

gfoidl

Notes for review.

gfoidl · 2023-03-26T13:12:22Z

src/ImageSharp/ColorSpaces/Companding/SRgbCompanding.cs

@@ -167,7 +167,7 @@ public static float Compress(float channel)
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe void CompandAvx2(Span<Vector4> vectors, float[] table)
    {
-        fixed (float* tablePointer = &table[0])
+        fixed (float* tablePointer = &MemoryMarshal.GetArrayDataReference(table))


This avoids the bound-checks.
See sharplab for example.

gfoidl · 2023-03-26T13:14:31Z

src/ImageSharp/Common/Helpers/SimdUtils.HwIntrinsics.cs

@@ -644,7 +644,7 @@ public static class HwIntrinsics
            ReadOnlySpan<byte> source,
            Span<float> dest)
        {
-            fixed (byte* sourceBase = source)
+            fixed (byte* sourceBase = &MemoryMarshal.GetReference(source))


This avoids the length-check.

fixed (byte* sourceBase = source)

is compiled to (via pattern matching)

fixed (byte* ptr = &span.GetPinnableReference())

and Span<T>.GetPinnableReference does a length-check which we don't need here, so we can omit it.

See sharplab for example.

Do such (out-of-hot-loop) length checks have a measurable impact on perf? The new code feels bloated, I'm not in favor of such changes if there are no tangible gains.

For this method in HwIntrinsics I don't think it's really measurable in (micro-) benchmarks. But it reduces the machine code a bit and the length-check is just not needed at all here.

I'm more on the eager side to squeeze out unnecessary instructions.
But when you want me this change to revert I'm fine with that too.

The new code feels bloated

The C# code is bloated. That's the obvious downside.

We typically call these methods for entire image scanlines (->100's or 1000's of elements in the span). I would assume having an extra bounds check before the loop negligible, unless there are benchmarks proving it otherwise.

The C# code is bloated.

When you turn it into semi-C, yes. But you don't always have to; we need to make informed, individual decisions regarding perf vs simplicity.

OK - I'll revert.

gfoidl · 2023-03-26T13:16:57Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8.cs

@@ -15,25 +15,14 @@ namespace SixLabors.ImageSharp.Formats.Jpeg.Components;
 /// 8x8 matrix of <see cref="short"/> coefficients.
 /// </summary>
 // ReSharper disable once InconsistentNaming
-[StructLayout(LayoutKind.Explicit)]
-internal unsafe partial struct Block8x8
+[StructLayout(LayoutKind.Explicit, Size = 2 * Size)]


The unused fixed sized buffer to specify the size of the struct isn't neede, as the size can be set directly.

Codegen for creating the struct is also a bit better -- only zeroing the memory, instead of stack-cookie + zeroing.
Though that won't matter much -- the reduction of C# code is the motivation for this change.

gfoidl · 2023-03-26T13:19:58Z

src/ImageSharp/Formats/Jpeg/Components/Block8x8F.cs

-        Block8x8F result = default;
-        result.LoadFrom(data);


Going via instance method needs to copy two times, which can be avoided.

In sharplab it's hard to see (no method names, just the address of the call), so here from the JIT directly:

JIT assembly

; Assembly listing for method Foo:Default(System.Span`1[float]):Block8x8F ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 4 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 56 push rsi 4881EC00010000 sub rsp, 256 488BF1 mov rsi, rcx G_M000_IG02: ;; offset=000BH 488B12 mov rdx, bword ptr [rdx] 488D0C24 lea rcx, bword ptr [rsp] 41B800010000 mov r8d, 256 E80330AF5F call CORINFO_HELP_MEMCPY 488BCE mov rcx, rsi 488D1424 lea rdx, bword ptr [rsp] 41B800010000 mov r8d, 256 E8F12FAF5F call CORINFO_HELP_MEMCPY 488BC6 mov rax, rsi G_M000_IG03: ;; offset=0032H 4881C400010000 add rsp, 256 5E pop rsi C3 ret ; Total bytes of code 59 ; Assembly listing for method Foo:PR(System.Span`1[float]):Block8x8F_PR ; Emitting BLENDED_CODE for X64 CPU with AVX - Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; partially interruptible ; No PGO data ; 0 inlinees with PGO data; 3 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 56 push rsi 488BF1 mov rsi, rcx G_M000_IG02: ;; offset=0004H 488B12 mov rdx, bword ptr [rdx] 488BCE mov rcx, rsi 41B800010000 mov r8d, 256 E80B2FAF5F call CORINFO_HELP_MEMCPY 488BC6 mov rax, rsi G_M000_IG03: ;; offset=0018H 5E pop rsi C3 ret ; Total bytes of code 26

Just wondering: is it the same gain when you compare use cases on call sites, ie.

Block8x8F b = Block8x8F.Load(data)`;

after the PR change vs. typical usage on main:

Block8x8F b = default; b.LoadFrom(data);

?

It depends what's done with the instance of Block8x8F. In this example Default2 shows no difference, as the JIT is smart enough to figure out what's going on. Namely: load the data onto the stack, then operate with the instance method of that type and use data on the stack.
But as soon the Block8x8F is passed around a copy needs to be made. Thus for me the instance LoadFrom has the potential for a perf-trap, and should be removed.

Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange for an value-type*. Here a static method seems the right fit, like int.Parse, Vector128.Load, etc.
As it's an internal type this change seems fine from this view.

* for reference types this is pretty much OK

Besides that from an API point of view having an instance method to load (or better "fill") the block with data is a bit strange

Note that Block8x8F is really huge. If I remember correctly, in early Jpeg code there were cases when existing on-stack blocks were being reused by repopulating it. If it's no longer the case in the current code, I'm happy with the change.

Checked it -- all usages are safe. Block8x8F is either passed by reference, and only copied where needed (e.g. constructors).

existing on-stack blocks were being reused by repopulating

When I understand it right, then this is a simple example of it.

Both variants work and produce the same machine-code.
(It's because the compilers hoists the loop-local block variable, and as it's a value-type the stack space needs to be reserved anyway).
Even if there are two blocks read then it's the same machine code.

I'll see if the C# code here can be simplified.

Similar to a534328

gfoidl · 2023-03-27T17:17:08Z

src/ImageSharp/Formats/Webp/WebpCommonUtils.cs

@@ -25,7 +25,7 @@ public static unsafe bool CheckNonOpaque(Span<Bgra32> row)
            ReadOnlySpan<byte> rowBytes = MemoryMarshal.AsBytes(row);
            int i = 0;
            int length = (row.Length * 4) - 3;
-            fixed (byte* src = rowBytes)
+            fixed (byte* src = &MemoryMarshal.GetReference(rowBytes))


Ups, I've forgotten this one. Couldn't see them in the local diffs.

JimBobSquarePants · 2023-03-29T11:22:10Z

Is this ready for another review?

gfoidl · 2023-03-29T11:41:32Z

Is this ready for another review?

Yes, please.

JimBobSquarePants

Looks good. Thanks!

gfoidl added 3 commits March 26, 2023 15:10

Avoid length-check in pinning spans

a4ad7b0

Only where it seems profitable. E.g. not when a UnmanagedMemoryStream is constructed of that pointer.

Block8x8 set explicit size instead of (unused) fixed sized buffer field

4532194

Optimized Block8x8F

a534328

gfoidl commented Mar 26, 2023

View reviewed changes

gfoidl added 3 commits March 27, 2023 10:40

Merge branch 'main' into fixed-improvements

528469f

Block8x8 Load and CopyTo simplified

f348d70

Similar to a534328

Reverted some pinning changes to have less verbose C# code

4b84aec

gfoidl commented Mar 27, 2023

View reviewed changes

gfoidl and others added 3 commits March 27, 2023 19:19

Reverted Webp/WebpCommonUtils fixed-verbosity

66b2f32

Merge branch 'main' into fixed-improvements

df4ae9e

Merge branch 'main' into fixed-improvements

f65122b

JimBobSquarePants approved these changes Mar 30, 2023

View reviewed changes

JimBobSquarePants merged commit b6736b0 into SixLabors:main Mar 30, 2023

gfoidl deleted the fixed-improvements branch March 30, 2023 09:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements around `fixed` #2418

Improvements around `fixed` #2418

gfoidl commented Mar 26, 2023

gfoidl left a comment

gfoidl Mar 26, 2023

gfoidl Mar 26, 2023

antonfirsov Mar 27, 2023

gfoidl Mar 27, 2023

antonfirsov Mar 27, 2023 •

edited

Loading

gfoidl Mar 27, 2023

gfoidl Mar 26, 2023

gfoidl Mar 26, 2023

antonfirsov Mar 27, 2023

gfoidl Mar 27, 2023

antonfirsov Mar 27, 2023

gfoidl Mar 27, 2023

gfoidl Mar 27, 2023

JimBobSquarePants commented Mar 29, 2023

gfoidl commented Mar 29, 2023

JimBobSquarePants left a comment

Improvements around fixed #2418

Improvements around fixed #2418

Conversation

gfoidl commented Mar 26, 2023

Prerequisites

Description

gfoidl left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antonfirsov Mar 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JimBobSquarePants commented Mar 29, 2023

gfoidl commented Mar 29, 2023

JimBobSquarePants left a comment

Choose a reason for hiding this comment

Improvements around `fixed` #2418

Improvements around `fixed` #2418

antonfirsov Mar 27, 2023 •

edited

Loading