Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save 260k in InitValueNumStoreStatics #85945

Merged
merged 25 commits into from
Jun 5, 2023
Merged

Save 260k in InitValueNumStoreStatics #85945

merged 25 commits into from
Jun 5, 2023

Conversation

markples
Copy link
Member

@markples markples commented May 8, 2023

It appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This rewrites it as a const table. It adds some redundant information to the tables that we #include/#define in several places but currently includes many assertions that the old and new values match.

Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,735,680. InitValueNumStoreStatics (261k) is replaced by 1.2k of static data.

Resolves #85953

@ghost ghost assigned markples May 8, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label May 8, 2023
@ghost
Copy link

ghost commented May 8, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

It appears that the big #define for the intrinsics is causing InitValueNumStoreStatics to get big enough that C++ optimization ends up being disabled, which means a lot of constant operations aren't folded. This change outlines a bunch of those operations, which results in big code size savings, but the operations are still happening at initialization time at runtime.

Local size change of release clrjit_win_x64_x64.dll: 2,001,920 -> 1,755,648. InitValueNumStoreStatics goes from 241k to 20k.

Author: markples
Assignees: markples
Labels:

area-CodeGen-coreclr

Milestone: -

@jkotas
Copy link
Member

jkotas commented May 8, 2023

the operations are still happening at initialization time at runtime.

Is it desirable? Would it be better to initialize it at compile instead?

@markples
Copy link
Member Author

markples commented May 8, 2023

Is it desirable? Would it be better to initialize it at compile instead?

I can take a closer look. The table creation isn't as simple as some other cases where we've done this.

@BruceForstall
Copy link
Member

cc @tannergooding

@markples markples changed the title Save 240k in InitValueNumStoreStatics Save 250k in InitValueNumStoreStatics May 9, 2023
@markples
Copy link
Member Author

markples commented May 9, 2023

So this doesn't quite make it static, though I've gotten suspicious about initialization order and need to check that.

Comment on lines 369 to 375
const UINT8& operator[](size_t idx) const { return m_arr[idx]; }
private:
static constexpr unsigned GetArity(unsigned oper);
static constexpr unsigned GetCommutative(unsigned oper);
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);

UINT8 m_arr[VNF_COUNT];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const UINT8& operator[](size_t idx) const { return m_arr[idx]; }
private:
static constexpr unsigned GetArity(unsigned oper);
static constexpr unsigned GetCommutative(unsigned oper);
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);
UINT8 m_arr[VNF_COUNT];
const uint8_t& operator[](size_t idx) const { return m_arr[idx]; }
private:
static constexpr unsigned GetArity(unsigned oper);
static constexpr unsigned GetCommutative(unsigned oper);
static constexpr unsigned GetFunc(int arity, bool commute, bool knownNonNull, bool sharedStatic);
uint8_t m_arr[VNF_COUNT];

Nit: We want to stick to std C++ type names where possible.

@markples markples marked this pull request as draft May 9, 2023 07:21
@tannergooding
Copy link
Member

What's missing that would allow this all to become constant?

It looks like the bulk of the logic comes from setting the VNFOA_* flags which are essentially all constant anyways and could just be something like:

static uint8_t vnfOpAttribs[VNF_COUNT] {
#define ValueNumFuncDef(vnf, arity, commute, knownNonNull, sharedStatic) \
    static_cast<uint8_t>( \
        (commute ? VNFOA_Commutative : 0) |
        (knownNonNull ? VNFOA_KnownNonNull : 0) |
        (sharedStatic ? VNFOA_SharedStatic : 0) |
        ((arity > 0) ? ((arity << VNFOA_ArityShift) & VNFOA_ArityMask) : VNFOA_ArityMask)
    )
#include "valuenumfuncs.h"
#undef ValueNumFuncDef
};

Then, for the hwintrinsics which can have a non-constant arity, we can simply check for ((vnfOpAttribs[n] >> VNFOA_ArityShift) & VNFOA_ArityMask) == VNFOA_ArityMask and in that case defer to a runtime check when encounted for the relatively few hwintrinsics impacted.

Likewise, the need for arity++ when the simdBaseType needs to be encoded could/should just be a flag in the hwintrinsiclist*.h tables, which also accounts for cases where there are different instructions but the operations are semantically the same. for example, andps vs andpd vs pand are different instructions but all do the same bitwise op with no differing semantics and so for the purposes of VN can be treated the same. But today they get arity++ instead.

The same should be possible for genTreeOps as we should be able to place any necessary info into the gtlist.h defines.

@markples
Copy link
Member Author

markples commented May 9, 2023

I think the short answer is that "just" getting it to work is what's needed, but the details could be problematic.

The initial version of this PR was a tiny change that achieved most of the space savings.

The current version doesn't build on Linux due to the compiler there complaining about c++14 extensions. Perhaps there is a path forward (maybe less constexpr). There are probably also ordering issues.

The #define static initializer is the typical pattern in the JIT, but there are a lot of details here:

  • The dependency on gentree introduces ordering issues. These can be eliminated by not using anything based on gtOperKindTable, but that means duplicating logic. As a simple example, the existing path calls GenTree::OperIsUnary which could instead be manually reprocessing the GTK_ flags in gtlist.h or adding a duplicate arity field. (I think another alternative is to move the various static table initializations to the same file.)
  • For things that are representational changes in the table, it would probably be best to propose separate PRs so that they are not tangled with the table rewrite.
  • For non-constant arity, it looks like these are currently encoded as zero (along with actual zero-arity entities). Distinguishing them sounds like a good change, but is it connected to this?
  • For the arity++ case, the information is already encoded in the table and calculated by Compiler::vnEncodesResultTypeForHWIntrinsic, though this also has an ordering issue. It sounds like you are suggesting adding duplicate information to the table (the result of vnEncodesResultTypeForHWIntrinsic?), which could be deleted or used as an assertion check. Is that correct?
  • I have a partial prototype of using the #define strategy to replicate the existing table, but it takes 10 minutes to compile. Simplifying the logic (like using an argument rather than calling vnEncodesResultTypeForHWIntrinsic) could help, though I commented out a bunch of logic and still see about 2 minutes. Perhaps I still have some anti-pattern around. I haven't tried with the compiler on Linux yet.

@tannergooding
Copy link
Member

The dependency on gentree introduces ordering issues. These can be eliminated by not using anything based on gtOperKindTable, but that means duplicating logic. As a simple example, the existing path calls GenTree::OperIsUnary which could instead be manually reprocessing the GTK_ flags in gtlist.h or adding a duplicate arity field. (I think another alternative is to move the various static table initializations to the same file.)

I think for this either duplicating the info or changing how its represented in #define GTNODE would be better. For example, I imagine it would be overall better for GTNODE to have an explicit arity member and for gtOperKindTable to set GTK_UNOP or GTK_BINOP based on this arity. Particularly since the "arity" is something we need in multiple places but the exact way its represented differs between GenTree and ValueNum.

For things that are representational changes in the table, it would probably be best to propose separate PRs so that they are not tangled with the table rewrite.

👍

For non-constant arity, it looks like these are currently encoded as zero (along with actual zero-arity entities). Distinguishing them sounds like a good change, but is it connected to this?

I think its connected in general to the premise of avoiding dynamic initialization of the tables. I think it can be post-poned as a separate fix, however.

For the arity++ case, the information is already encoded in the table and calculated by Compiler::vnEncodesResultTypeForHWIntrinsic, though this also has an ordering issue. It sounds like you are suggesting adding duplicate information to the table (the result of vnEncodesResultTypeForHWIntrinsic?), which could be deleted or used as an assertion check. Is that correct?

Right, I'm suggesting we add a field to the HARDWARE_INTRINSIC define that allows us to determine if the simdBaseType needs to be encoded.

As it is today, this is basically pessimized to be "if any two base types use different instructions, encode the base type". We should fix that longer term, but short term simply adding a field to the hwintrinsiclist*.h defines would avoid all this complex dynamic computation.

I have a partial prototype of using the #define strategy to replicate the existing table, but it takes 10 minutes to compile. Simplifying the logic (like using an argument rather than calling vnEncodesResultTypeForHWIntrinsic) could help, though I commented out a bunch of logic and still see about 2 minutes. Perhaps I still have some anti-pattern around. I haven't tried with the compiler on Linux yet.

I'm surprised it takes this long. We have much more metadata in the actual hwintrinsiclist*.h and instrs*.h tables and those take practically no time at all to compile.

@markples markples changed the title Save 250k in InitValueNumStoreStatics Save 260k in InitValueNumStoreStatics May 11, 2023
@markples
Copy link
Member Author

An update here -

I have added some new data to the tables to support building a const table. For now, the old code is still there and is used to assert that the new method matches the old method. This could be removed in favor of the static_asserts in this PR (and something new to more directly test the vnEncodesResultTypeForHWIntrinsic replacement), though merging this with the assertion and removing it later feels safer.

I haven't made any changes to the logic where we use -1 or 0 as special values, so perhaps nothing needs to be split out from this PR at this point? Perhaps I could expand the tables and add assertions to the old logic as an intermediate point, but I'm less concerned since I'm not changing the special values.

The win is up to 260k now that all of the code is gone. I do still need to handle the merge conflict and haven't seen a Linux build yet.

@markples
Copy link
Member Author

The very slow build times from before were due to having lots of conditionals leading to a huge method (the dynamic initializer) with lots of control flow. That might be resolved due to having a "nicer" initializer now, but I've also made those bit sets straight-line code. It does need some comments.

@markples
Copy link
Member Author

This should now do everything at compile-time (except for the validation in debug builds), and the latest merge conflict was fairly simple. It doesn't change the table at all (include the somewhat strange entry for the BOUNDARY value...), so it should be safe change.

@markples markples marked this pull request as ready for review May 31, 2023 02:05
@markples
Copy link
Member Author

markples commented Jun 1, 2023

@tannergooding Could you please take a look at this again?

HARDWARE_INTRINSIC(Vector64, EqualsAny, 8, 2, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen)
HARDWARE_INTRINSIC(Vector64, ExtractMostSignificantBits, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_BaseTypeFromFirstArg|HW_Flag_NoCodeGen)
HARDWARE_INTRINSIC(Vector64, Floor, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen)
HARDWARE_INTRINSIC(Vector64, get_AllBitsSet, 8, 0, true, {INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be encodesExtraTypeArg: false under the current rules?

The current rule is effectively that if all INS_* entries are the same then the type doesn't need to be encoded.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the review and especially for looking closely enough to ask this question. I was concerned for a bit because the DEBUG checks should catch any case like this, and they aren't firing. It turns out that the "are the same" logic only applies to xarch. For arm64 the current rule is just that having 2+ entries indicates that the type needs to be encoded.

HARDWARE_INTRINSIC(Vector64, Floor, 8, 1, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_SpecialImport|HW_Flag_NoCodeGen)
HARDWARE_INTRINSIC(Vector64, get_AllBitsSet, 8, 0, true, {INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni, INS_mvni}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
HARDWARE_INTRINSIC(Vector64, get_One, 8, 0, false, {INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid, INS_invalid}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
HARDWARE_INTRINSIC(Vector64, get_Zero, 8, 0, true, {INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi, INS_movi}, HW_Category_Helper, HW_Flag_NoCodeGen|HW_Flag_SpecialImport)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for get_Zero here and for V128.

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes overall LGTM.

There's some cleanup that this will enable for the hwintrinsics in particular and which we can do anytime after this gets merged. There are for example, cases like NI_ISA_And where different instructions might be used but where all operations are equivalent functionality wise and so the type arg doesn't actually need to be encoded.

@markples
Copy link
Member Author

markples commented Jun 5, 2023

For some reason, the build analysis appears to be listing the same failure as known and unknown.

@markples markples merged commit 9f8f653 into dotnet:main Jun 5, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Jul 5, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[JIT] x64 InitValueNumStoreStatics is 261k
4 participants