Save 260k in InitValueNumStoreStatics #85945

markples · 2023-05-08T22:07:23Z

It appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This rewrites it as a const table. It adds some redundant information to the tables that we #include/#define in several places but currently includes many assertions that the old and new values match.

Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,735,680. InitValueNumStoreStatics (261k) is replaced by 1.2k of static data.

Resolves #85953

ghost · 2023-05-08T22:07:35Z

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

It appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This change outlines a bunch of those operations, which results in big code size savings, but the operations are still happening at initialization time at runtime.

Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,755,648. InitValueNumStoreStatics goes from 241k to 20k.

Author:	markples
Assignees:	markples
Labels:	`area-CodeGen-coreclr`
Milestone:	-

jkotas · 2023-05-08T22:30:31Z

the operations are still happening at initialization time at runtime.

Is it desirable? Would it be better to initialize it at compile instead?

markples · 2023-05-08T22:52:37Z

Is it desirable? Would it be better to initialize it at compile instead?

I can take a closer look. The table creation isn't as simple as some other cases where we've done this.

BruceForstall · 2023-05-08T23:05:29Z

cc @tannergooding

markples · 2023-05-09T02:47:19Z

So this doesn't quite make it static, though I've gotten suspicious about initialization order and need to check that.

jkotas · 2023-05-09T03:50:05Z

src/coreclr/jit/valuenum.h

+        const UINT8& operator[](size_t idx) const { return m_arr[idx]; }
+    private:
+        static constexpr unsigned GetArity(unsigned oper);
+        static constexpr unsigned GetCommutative(unsigned oper);
+        static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);
+
+        UINT8 m_arr[VNF_COUNT];


Suggested change

const UINT8& operator[](size_t idx) const { return m_arr[idx]; }

private:

static constexpr unsigned GetArity(unsigned oper);

static constexpr unsigned GetCommutative(unsigned oper);

static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);

UINT8 m_arr[VNF_COUNT];

const uint8_t& operator[](size_t idx) const { return m_arr[idx]; }

private:

static constexpr unsigned GetArity(unsigned oper);

static constexpr unsigned GetCommutative(unsigned oper);

static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);

uint8_t m_arr[VNF_COUNT];

Nit: We want to stick to std C++ type names where possible.

tannergooding · 2023-05-09T16:11:37Z

What's missing that would allow this all to become constant?

It looks like the bulk of the logic comes from setting the VNFOA_* flags which are essentially all constant anyways and could just be something like:

static uint8_t vnfOpAttribs[VNF_COUNT] {
#define ValueNumFuncDef(vnf, arity, commute, knownNonNull, sharedStatic) \
    static_cast<uint8_t>( \
        (commute ? VNFOA_Commutative : 0) |
        (knownNonNull ? VNFOA_KnownNonNull : 0) |
        (sharedStatic ? VNFOA_SharedStatic : 0) |
        ((arity > 0) ? ((arity << VNFOA_ArityShift) & VNFOA_ArityMask) : VNFOA_ArityMask)
    )
#include "valuenumfuncs.h"
#undef ValueNumFuncDef
};

Then, for the hwintrinsics which can have a non-constant arity, we can simply check for ((vnfOpAttribs[n] >> VNFOA_ArityShift) & VNFOA_ArityMask) == VNFOA_ArityMask and in that case defer to a runtime check when encounted for the relatively few hwintrinsics impacted.

Likewise, the need for arity++ when the simdBaseType needs to be encoded could/should just be a flag in the hwintrinsiclist*.h tables, which also accounts for cases where there are different instructions but the operations are semantically the same. for example, andps vs andpd vs pand are different instructions but all do the same bitwise op with no differing semantics and so for the purposes of VN can be treated the same. But today they get arity++ instead.

The same should be possible for genTreeOps as we should be able to place any necessary info into the gtlist.h defines.

markples · 2023-05-09T19:02:22Z

I think the short answer is that "just" getting it to work is what's needed, but the details could be problematic.

The initial version of this PR was a tiny change that achieved most of the space savings.

The current version doesn't build on Linux due to the compiler there complaining about c++14 extensions. Perhaps there is a path forward (maybe less constexpr). There are probably also ordering issues.

The #define static initializer is the typical pattern in the JIT, but there are a lot of details here:

The dependency on gentree introduces ordering issues. These can be eliminated by not using anything based on gtOperKindTable, but that means duplicating logic. As a simple example, the existing path calls GenTree::OperIsUnary which could instead be manually reprocessing the GTK_ flags in gtlist.h or adding a duplicate arity field. (I think another alternative is to move the various static table initializations to the same file.)
For things that are representational changes in the table, it would probably be best to propose separate PRs so that they are not tangled with the table rewrite.
For non-constant arity, it looks like these are currently encoded as zero (along with actual zero-arity entities). Distinguishing them sounds like a good change, but is it connected to this?
For the arity++ case, the information is already encoded in the table and calculated by Compiler::vnEncodesResultTypeForHWIntrinsic, though this also has an ordering issue. It sounds like you are suggesting adding duplicate information to the table (the result of vnEncodesResultTypeForHWIntrinsic?), which could be deleted or used as an assertion check. Is that correct?
I have a partial prototype of using the #define strategy to replicate the existing table, but it takes 10 minutes to compile. Simplifying the logic (like using an argument rather than calling vnEncodesResultTypeForHWIntrinsic) could help, though I commented out a bunch of logic and still see about 2 minutes. Perhaps I still have some anti-pattern around. I haven't tried with the compiler on Linux yet.

tannergooding · 2023-05-09T19:17:16Z

The dependency on gentree introduces ordering issues. These can be eliminated by not using anything based on gtOperKindTable, but that means duplicating logic. As a simple example, the existing path calls GenTree::OperIsUnary which could instead be manually reprocessing the GTK_ flags in gtlist.h or adding a duplicate arity field. (I think another alternative is to move the various static table initializations to the same file.)

I think for this either duplicating the info or changing how its represented in #define GTNODE would be better. For example, I imagine it would be overall better for GTNODE to have an explicit arity member and for gtOperKindTable to set GTK_UNOP or GTK_BINOP based on this arity. Particularly since the "arity" is something we need in multiple places but the exact way its represented differs between GenTree and ValueNum.

For things that are representational changes in the table, it would probably be best to propose separate PRs so that they are not tangled with the table rewrite.

👍

For non-constant arity, it looks like these are currently encoded as zero (along with actual zero-arity entities). Distinguishing them sounds like a good change, but is it connected to this?

I think its connected in general to the premise of avoiding dynamic initialization of the tables. I think it can be post-poned as a separate fix, however.

For the arity++ case, the information is already encoded in the table and calculated by Compiler::vnEncodesResultTypeForHWIntrinsic, though this also has an ordering issue. It sounds like you are suggesting adding duplicate information to the table (the result of vnEncodesResultTypeForHWIntrinsic?), which could be deleted or used as an assertion check. Is that correct?

Right, I'm suggesting we add a field to the HARDWARE_INTRINSIC define that allows us to determine if the simdBaseType needs to be encoded.

As it is today, this is basically pessimized to be "if any two base types use different instructions, encode the base type". We should fix that longer term, but short term simply adding a field to the hwintrinsiclist*.h defines would avoid all this complex dynamic computation.

I have a partial prototype of using the #define strategy to replicate the existing table, but it takes 10 minutes to compile. Simplifying the logic (like using an argument rather than calling vnEncodesResultTypeForHWIntrinsic) could help, though I commented out a bunch of logic and still see about 2 minutes. Perhaps I still have some anti-pattern around. I haven't tried with the compiler on Linux yet.

I'm surprised it takes this long. We have much more metadata in the actual hwintrinsiclist*.h and instrs*.h tables and those take practically no time at all to compile.

markples · 2023-05-11T23:32:14Z

An update here -

I have added some new data to the tables to support building a const table. For now, the old code is still there and is used to assert that the new method matches the old method. This could be removed in favor of the static_asserts in this PR (and something new to more directly test the vnEncodesResultTypeForHWIntrinsic replacement), though merging this with the assertion and removing it later feels safer.

I haven't made any changes to the logic where we use -1 or 0 as special values, so perhaps nothing needs to be split out from this PR at this point? Perhaps I could expand the tables and add assertions to the old logic as an intermediate point, but I'm less concerned since I'm not changing the special values.

The win is up to 260k now that all of the code is gone. I do still need to handle the merge conflict and haven't seen a Linux build yet.

markples · 2023-05-11T23:37:32Z

The very slow build times from before were due to having lots of conditionals leading to a huge method (the dynamic initializer) with lots of control flow. That might be resolved due to having a "nicer" initializer now, but I've also made those bit sets straight-line code. It does need some comments.

Conflicts in hwintrinsiclistxarch.h and gtlist.h

markples · 2023-05-31T02:05:49Z

This should now do everything at compile-time (except for the validation in debug builds), and the latest merge conflict was fairly simple. It doesn't change the table at all (include the somewhat strange entry for the BOUNDARY value...), so it should be safe change.

markples · 2023-06-01T18:00:09Z

@tannergooding Could you please take a look at this again?

tannergooding · 2023-06-04T02:02:19Z

src/coreclr/jit/hwintrinsiclistarm64.h

+HARDWARE_INTRINSIC(Vector64,      EqualsAny,                                                         8,      2,     false,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_Helper,                HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen)
+HARDWARE_INTRINSIC(Vector64,      ExtractMostSignificantBits,                                        8,      1,     false,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_Helper,                HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen)
+HARDWARE_INTRINSIC(Vector64,      Floor,                                                             8,      1,     false,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_Helper,                HW_Flag_SpecialImport|HW_Flag_NoCodeGen)
+HARDWARE_INTRINSIC(Vector64,      get_AllBitsSet,                                                    8,      0,      true,  {INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni},        HW_Category_Helper,                HW_Flag_NoCodeGen|HW_Flag_SpecialImport)


Shouldn't this be encodesExtraTypeArg: false under the current rules?

The current rule is effectively that if all INS_* entries are the same then the type doesn't need to be encoded.

Thank you for the review and especially for looking closely enough to ask this question. I was concerned for a bit because the DEBUG checks should catch any case like this, and they aren't firing. It turns out that the "are the same" logic only applies to xarch. For arm64 the current rule is just that having 2+ entries indicates that the type needs to be encoded.

tannergooding · 2023-06-04T02:02:56Z

src/coreclr/jit/hwintrinsiclistarm64.h

+HARDWARE_INTRINSIC(Vector64,      Floor,                                                             8,      1,     false,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_Helper,                HW_Flag_SpecialImport|HW_Flag_NoCodeGen)
+HARDWARE_INTRINSIC(Vector64,      get_AllBitsSet,                                                    8,      0,      true,  {INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni,           INS_mvni},        HW_Category_Helper,                HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
+HARDWARE_INTRINSIC(Vector64,      get_One,                                                           8,      0,     false,  {INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid,        INS_invalid},     HW_Category_Helper,                HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
+HARDWARE_INTRINSIC(Vector64,      get_Zero,                                                          8,      0,      true,  {INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi,           INS_movi},        HW_Category_Helper,                HW_Flag_NoCodeGen|HW_Flag_SpecialImport)


Same for get_Zero here and for V128.

tannergooding

Changes overall LGTM.

There's some cleanup that this will enable for the hwintrinsics in particular and which we can do anytime after this gets merged. There are for example, cases like NI_ISA_And where different instructions might be used but where all operations are equivalent functionality wise and so the type arg doesn't actually need to be encoded.

markples · 2023-06-05T10:42:34Z

For some reason, the build analysis appears to be listing the same failure as known and unknown.

Save 240k in InitValueNumStoreStatics

ae2ecf0

ghost assigned markples May 8, 2023

dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2023

markples added 2 commits May 8, 2023 17:18

constexpr in gentree

cc2a644

Make valuenum s_vnfOpAttribs a const table

4109e02

markples changed the title ~~Save 240k in InitValueNumStoreStatics~~ Save 250k in InitValueNumStoreStatics May 9, 2023

no std

3754327

jkotas reviewed May 9, 2023

View reviewed changes

markples marked this pull request as draft May 9, 2023 07:21

markples added 7 commits May 9, 2023 18:41

experiment

d11e5db

Expand tables, draft of vn table, restore old code for assertions

34d1e09

fix assertion to not incremental from the new data

3e9ad0c

fix tables

8fb5a28

assertions for tables + fixes

c8b8b0c

cleanup, mainly undoing unnecessary changes from main

ce579fe

more cleanup

5cf6b8b

markples changed the title ~~Save 250k in InitValueNumStoreStatics~~ Save 260k in InitValueNumStoreStatics May 11, 2023

markples added 3 commits May 12, 2023 02:54

Merge remote-tracking branch 'dotnet/main' into jitsize

fa9ca35

Conflicts in hwintrinsiclistxarch.h and gtlist.h

Update header comments

42e59df

Merge remote-tracking branch 'dotnet/main' into jitsize

0180232

markples added 8 commits May 22, 2023 16:36

Merge remote-tracking branch 'dotnet/main' into jitsize

e7c0585

finish merge

69cec94

fix

0965c20

undo whitespace change

38c5e2d

jitformat

c5cceba

attempt to fix constexpr errors from linux build

0db814c

jitformat

baec94c

add helper, jitformat

80130a0

build-analysis bot mentioned this pull request May 23, 2023

Tracking issue for CI build timeouts #76454

Closed

markples added 2 commits May 30, 2023 18:38

Merge remote-tracking branch 'dotnet/main' into jitsize

09b3066

oops

3311223

markples marked this pull request as ready for review May 31, 2023 02:05

Merge remote-tracking branch 'dotnet/main' into jitsize

8c4179e

build-analysis bot mentioned this pull request Jun 3, 2023

AppHost.Bundle.Tests.BundleLocalizedApp.Bundled_Localized_App_Run_Succeeds failing #85993

Closed

tannergooding self-requested a review June 3, 2023 04:28

tannergooding reviewed Jun 4, 2023

View reviewed changes

tannergooding approved these changes Jun 4, 2023

View reviewed changes

markples merged commit 9f8f653 into dotnet:main Jun 5, 2023

ghost locked as resolved and limited conversation to collaborators Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Save 260k in InitValueNumStoreStatics #85945

Save 260k in InitValueNumStoreStatics #85945

markples commented May 8, 2023 •

edited

Loading

ghost commented May 8, 2023

jkotas commented May 8, 2023 •

edited

Loading

markples commented May 8, 2023

BruceForstall commented May 8, 2023

markples commented May 9, 2023

jkotas May 9, 2023

tannergooding commented May 9, 2023

markples commented May 9, 2023 •

edited

Loading

tannergooding commented May 9, 2023

markples commented May 11, 2023

markples commented May 11, 2023

markples commented May 31, 2023

markples commented Jun 1, 2023

tannergooding Jun 4, 2023

markples Jun 5, 2023

tannergooding Jun 4, 2023

tannergooding left a comment

markples commented Jun 5, 2023

Save 260k in InitValueNumStoreStatics #85945

Save 260k in InitValueNumStoreStatics #85945

Conversation

markples commented May 8, 2023 • edited Loading

ghost commented May 8, 2023

jkotas commented May 8, 2023 • edited Loading

markples commented May 8, 2023

BruceForstall commented May 8, 2023

markples commented May 9, 2023

jkotas May 9, 2023

Choose a reason for hiding this comment

tannergooding commented May 9, 2023

markples commented May 9, 2023 • edited Loading

tannergooding commented May 9, 2023

markples commented May 11, 2023

markples commented May 11, 2023

markples commented May 31, 2023

markples commented Jun 1, 2023

tannergooding Jun 4, 2023

Choose a reason for hiding this comment

markples Jun 5, 2023

Choose a reason for hiding this comment

tannergooding Jun 4, 2023

Choose a reason for hiding this comment

tannergooding left a comment

Choose a reason for hiding this comment

markples commented Jun 5, 2023

markples commented May 8, 2023 •

edited

Loading

jkotas commented May 8, 2023 •

edited

Loading

markples commented May 9, 2023 •

edited

Loading