Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose various integer intrinsics for Avx512F, Avx512BW, and Avx512CD #85833

Merged
merged 23 commits into from
May 10, 2023

Conversation

tannergooding
Copy link
Member

This exposes some instructions unique to the AVX512 family of instructions making progress towards completing:

Remaining after this are the Scatter/Gather and Shuffle/Permute intrinsics

@ghost ghost assigned tannergooding May 5, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels May 5, 2023
@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@ghost
Copy link

ghost commented May 5, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch
See info in area-owners.md if you want to be subscribed.

Issue Details

This exposes some instructions unique to the AVX512 family of instructions making progress towards completing:

Remaining after this are the Scatter/Gather and Shuffle/Permute intrinsics

Author: tannergooding
Assignees: tannergooding
Labels:

area-CodeGen-coreclr, new-api-needs-documentation

Milestone: -

const TernaryLogicInfo& TernaryLogicInfo::lookup(uint8_t control)
{
// clang-format off
static const TernaryLogicInfo ternaryLogicFlags[256] = {
Copy link
Member Author

@tannergooding tannergooding May 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This table is 768 bytes and is about as small as we can make it.

The way the constants work is we have three keys:

  • A: 0xF0
  • B: 0xCC
  • C: 0xAA

To compute the correct control byte, you simply perform the corresponding operation on these keys. So, if you wanted to do (A & B) ^ C, you would compute (0xF0 & 0xCC) ^ 0xAA or 0x6A.

This table allows us to compute the inverse information, that is given a control, what are the operations it performs. This allows us to determine things like what operands are actually used (so we can correctly compute isRMW) and what operations are performed and in what order (such that we can do constant folding in the future).

The total set of operations supported are:

  • true: AllBitsSet
  • false: Zero
  • not: ~value
  • and: left & right
  • nand: ~(left & right)
  • or: left | right
  • nor: ~(left | right)
  • xor: left ^ right
  • xnor: ~(left ^ right)
  • cndsel: a ? b : c; aka (B & A) | (C & ~A)
  • major: 0 if two+ input bits are 0
  • minor: 1 if two+ input bits are 0

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put this comment in the source?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@tannergooding
Copy link
Member Author

Resolved feedback and fixed an issue where we were double encoding a vvvv register for emitOutputAM. Expecting everything to pass CI again this time.

@tannergooding
Copy link
Member Author

Initial diffs from the explicit vpternlog usage and containment fixes are good.

We see 9k saved bytes in minopts and 10k saved bytes in fullopts for Windows x64.

There is a bit of a TP hit (up to 0.1%) which is showing up for SIMD heavy code paths. This primarily comes from updating IF_RRD_*RD_CNS (for ARD, MRD, SRD) paths to handle hasCodeMI(ins). This ends up impacting several intrinsics including Extract, Shift, Shuffle, as they now need to execute more and are prominently used in various library paths.

Copy link
Member

@jakobbotsch jakobbotsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not super comfortable with the complexity of what happens in the importer, but I don't have a good way of simplifying and don't see anything obviously wrong anymore, so LGTM.

You may want to get a review from @BruceForstall or @kunalspathak for the emitter changes. And perhaps @EgorBo wants to look at the vectorization changes.

/// VPTERNLOGD xmm1 {k1}{z}, xmm2, xmm3/m128, imm8
/// The above native signature does not exist. We provide this additional overload for consistency with the other bitwise APIs.
/// </summary>
public static Vector128<sbyte> TernaryLogic(Vector128<sbyte> a, Vector128<sbyte> b, Vector128<sbyte> c, [ConstantExpected] byte control) => TernaryLogic(a, b, c, control);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's [ConstantExpected] by the way and do we switch-expand if it's called with a variable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an analyzer that flags if users aren't passing in a constant (or a value that is itself marked [ConstantExpected]). It's not perfect, but it gets most of the job done and flags the obvious cases where we'll have to pessimize.

Yes, we fallback to the recursive call and switch expand in that. Fixing that to be done late phase (such as lowering) is desirable but requires various ABI fixes so the call is happy with the SIMD args again.

GenTree* control;

control = gtNewIconNode(static_cast<uint8_t>((0xF0 | 0xCC) ^ 0xAA)); // (A | B)) ^ C
xor1 = gtNewSimdTernaryLogicNode(simdType, vec1, toLowerVec1, cnsVec1, control, baseType, simdSize);
Copy link
Member

@EgorBo EgorBo May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be a sort of morph/lower pattern match optimization instead? So we can catch user patterns as well, this pattern doesn't look uncommon, ternarylogic seems to allow to fold many other pattens so would want to have a single place to do them all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're catching this one early because its simple/obvious. We do want to add the morph stuff as well, but that's a much more complex/involved PR. Handling these explicit cases early will also help with JIT throughput and avoid us needing to do the much more expensive general handling logic later.

If we decide just handling it in morph is better, than I can remove these explicit cases once that support is in place.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some questions/comments. I assume you verified that the latest tests are executing in CI, right?

src/coreclr/jit/gentree.cpp Show resolved Hide resolved
// This table is 768 bytes and is about as small as we can make it.
//
// The way the constants work is we have three keys:
// * A: 0xF0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious, how did you come up with the values of these keys? Are they related to the encoding or just something you picked?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are directly part of the encoding and are covered in the architecture manuals.

// clang-format off
static const TernaryLogicInfo ternaryLogicFlags[256] = {
/* FALSE */ { TernaryLogicOperKind::False, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
/* norABC */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::ABC, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this work? When I read norABC I see that as A nor B nor C and having TernaryLogicUseFlags::ABC as op1Use while all the others are left None is slightly confusing to me.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norABC is ~(A | B | C). They shorthand the ones that do the same operation on all three, where-as the full name would've been something like norAorBC

Most of these control bytes do exactly 2 operations at once and they are always read as oper Input1 Input2. So andAorBC is And(A, Or(B, C)) (aka. A & (B | C)).

The exception is the conditional select instructions which do 3 operations (with the third one always being the condition code that "picks" bits from the first or second operation).

/* FALSE */ { TernaryLogicOperKind::False, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
/* norABC */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::ABC, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
/* andCnorBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::And, TernaryLogicUseFlags::C, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to keep the ordering of use flags representation you have here?

Suggested change
/* norBA */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },
/* norAB */ { TernaryLogicOperKind::Nor, TernaryLogicUseFlags::AB, TernaryLogicOperKind::None, TernaryLogicUseFlags::None, TernaryLogicOperKind::None, TernaryLogicUseFlags::None },

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The names are taken from the architecture docs, so its helpful to keep them matching as it makes it possible to search for the name in the manual.

We use a normalized form in our flags to keep things simpler.

const TernaryLogicInfo& info = TernaryLogicInfo::lookup(control);
TernaryLogicUseFlags useFlags = info.GetAllUseFlags();

if (useFlags != TernaryLogicUseFlags::ABC)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I completely follow what happens when useFlag == ABC. Can you elaborate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When all three operands are used we have a real ternary instruction and don't have an optimizations or normalization to do. This is because the instruction is already "complete" (aside from constant folding) and can't represent any more state.

For the cases when all operands aren't used, then we have the ability to represent the operation more optimally. This happens by either converting it to the regular unary/binary node (e.g. gtNewSimdBinOpNode(GT_AND, ...)) or by normalizing the format so that the used operands are B/C (which allows us to not be marked RMW and allows greatest chance for containment). It will also simplify the logic that will be added in a future PR to combine unary/binary ops into ternary logic nodes.

Copy link
Member

@kunalspathak kunalspathak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tannergooding
Copy link
Member Author

Found a pre-existing bug in the scalar ROL/ROR logic for x86. I've logged #86027 and worked around it in the tests for the time being

@tannergooding tannergooding merged commit 16559f9 into dotnet:main May 10, 2023
@tannergooding tannergooding deleted the avx512-5 branch May 10, 2023 16:04
@ghost ghost locked as resolved and limited conversation to collaborators Jun 9, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants