-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: math/bits: add Deposit and Extract functions #45455
Comments
I should note as another alternative, the compiler could recognize certain mask/shift expressions and turn them into However, I think it would be easier to teach the compiler to recognize (e.g.) |
Are BMI1 and BMI2 part of the instruction set that we assume everywhere for amd64 or will they require CPU feature flag checks? (If they're available everywhere we should add the peephole optimizations you mentioned.) |
@josharian Interestingly, per Wikipedia, PDEP and PEXT may have slower performance than certain software methods on AMD architectures between Excavator (2015) and Zen 3 (2020). https://github.com/zwegner/zp7 is related. |
If you adopt these for varint decoding but find yourself on a non-x86 platform and falling back to the unaccelerated forms, would the varint decoding end up slower than it would have? Note also that Hacker's Delight calls these (if I am looking at the right sections)
I've not heard Deposit before, but it's possible I am misunderstanding the exact operations being proposed here. |
This proposal has been added to the active column of the proposals project |
I'm less worried about teaching the compiler than teaching users and updating old code (and doing that correctly). If we can make the compiler understand how to turn an expression like the latter into some advanced instruction, then we (a) don't need new API, and (b) optimize all the existing code using those kinds of patterns. That seems like a win-win? |
I suspect so. That's why I also proposed #45454 to make it possible for code to detect whether it's being compiled for
Yes, I think those are the same operations. Section 7-5 (Expand, or Generalized Insert) mentions too "This function has also been called unpack, scatter, and deposit." I don't feel strongly about the names. I used "Deposit" and "Extract" because that's what Intel calls them. If there's reason to favor other naming conventions for the same operations, I think that's fine. It's easy enough to cross-reference common names in the documentation.
My concern is that detecting appropriate expressions seems likely to be subtle and fragile, and users will end up needing to double check the compiled output to make sure they and the compiler both got it right. My intuition is also that people using this functionality for performance will want to structure their code differently if they're not available anyway. Whereas users who don't care about performance would prefer more concise code. So I think both of these users would again benefit from having functionality in math/bits, rather than the compiler recognizing special patterns. |
It's a bit unfortunate to have a portable API in math/bits and then still have to have build tags like Is there some slightly higher-level API that we can implement efficiently? Or if this is very specific to varint-encoding, can we adjust the implementation in encoding/binary to use the right tricks internally? (Write the expression, perhaps build-tag-guarded, with a test that the compiler is compiling it the right way, but all behind the current package API.) There are other places that use varint encoding too, so fixing it in encoding/binary could potentially help many use cases (for example import/export data too). |
Ack. It's not great. One possibility would be to go the style of syscall / x/sys, and have a "math/bits/amd64" package where this function is declared, and the declaration is already guarded to only be available for amd64.v3 (i.e., no fallback).
Off hand, I'm not sure. I think we need more experience to identify that. I think if we were to generalize the runtime/internal/sys package's intrinsics to an internal/* package, that would give some opportunity to explore optimizations within the Go standard library before committing to a particular API for end users. It's also possible that runtime + encoding/binary's varint stuff end up being the only profitable places for using this. |
If we go the route of math/bits/amd64, which I'm not opposed to, then we are basically walking toward the way that C works. For C each CPU architecture defines a set of header files (such as In general performance requirements are a strong incentive to figure out a way to provide access to CPU-specific capabilities, one way or another. Introducing the math/bits package gave us a mechanism for providing that access when we could provide a generic implementation that was fine to use on processors without specific support. Now we need to figure out a mechanism for other capabilities. For example, one approach we could consider would be // Deposit64 takes the low bits from val and deposits them into a uint64 result
// at the corresponding bit locations that are set in mask.
// ??? Why not also pass in an initial value whose bits are set?
func Deposit64(val, mask uint64) uint64
// IsDeposit64Fast reports whether the Deposit64 function is faster than doing the operation in ordinary Go.
// This will be true iff the processor has specific support for the operation.
// ??? Would be nice if this could be a const, but that would not permit runtime CPU detection.
var IsDeposit64Fast bool Likely there are better possible ideas. |
I feel like having slower-than-optimal With suitable build constraints, high-performance code ends up structured the same either way — moving the intrinsic to a different package doesn't change the fact that callers need to use a build constraint to decide whether to use it. And with the functions in |
PDEP sets the non-masked bits to 0, so we'd need to do more work to peephole optimize that case, but I think that's not a problem. However, users can also just write
If it was |
I'm still trying to chart a path through this that ends up with APIs that we don't believe overfit too much to a particular architecture, even a dominant one like x86. As was said earlier, it's possible that the only profitable use case for these are varints, in which case we end up with a bit of troublesome API for relatively little applicability. I'd really like to continue to avoid machine-specific packages and APIs. It's incredibly easy to just add "math/bits/amd64". It's a lot more work to find portable APIs, but if we can do that, the result is better for the ecosystem, for portability, and for Go's long-term viability. encoding/binary and the compiler are already fairly tightly coupled. I don't believe the compiler knows directly about the package's import path, but there was definitely some co-evolution between code patterns the compiler recognizes and code patterns used in that package to make sure they match, with the result that, I believe, things like binary.BigEndian.Uint32 compile into a bswap on x86 and presumably whatever the most efficient forms are on other systems as well. It looks like binary.Uvarint and binary.PutUvarint would be the places where this would be most impactful. What if the compiler just knew the right inlined implementations, same as it does for math.Sqrt and most of math/bits? Or perhaps there is a middle ground where we adjust the implementations so that the key pattern the compiler can optimize is more recognizable? |
I think directly intrinsifying binary.Uvarint and binary.PutUvarint would work. I think it might still be easier to keep them as Go code built on top of internal-only intrinsics though. I suggested above generalizing runtime/internal/sys (which already provides runtime-specific CPU intrinsics, albeit mostly just math/bits stuff) to internal/*, so that we could experiment with additional intrinsics for use both by the runtime and encoding/binary and maybe other std packages, but not yet directly to end users. Does that sound reasonable as a near-term solution while we work out how (and whether) to publicly expose more intrinsics? |
Some additional data that may be helpful: paging through a Sourcegraph search for |
Aren't both instructions useful for UTF-8 rune decoding and encoding too? |
Sounds fine. (And anything in internal doesn't need to be in the proposal process.) |
It sounds like we've converged on, at least for now, doing something not user-visible that makes package binary's varint code faster. If that's correct, then this proposal should probably be declined. Do I have that correct? |
Yeah, I think it's fine to decline this proposal for now in favor of internal-only intrinsics. If concrete use cases outside of the standard library arise, I think we can revisit this. |
Based on the discussion above, this proposal seems like a likely decline. |
No change in consensus, so declined. |
The x86-64-v3 instruction set includes the "Bit Manipulation Instruction" sets BMI1 and BMI2. Many of these are useful as mere peephole optimizations (e.g., ANDN for ~x & y, BLSI for x & -x, BSLR for x & (x - 1)), but the "parallel bits deposit" (PDEP) and "parallel bits extract" (PEXT) instructions are instructions that could benefit from higher-level abstractions.
PEXT takes a value and a mask and extracts all the bits included in the mask (like AND), but then compacts them to the bottom of the result word. PDEP performs the opposite operation (taking bits from the bottom of a value and spreading them out according to a mask).
These instructions are especially useful for varint decoding (e.g., extracting and compacting all value bits in a single instruction), which is a very frequent operation within protobuf decoding. Also, somewhat hot within the Go runtime too.
It's possible today to write them using assembly helper functions, but function call overhead (since they can't be inlined anymore) ends up negating a lot of the potential benefit.
The function signatures would be:
etc. See https://go-review.googlesource.com/c/go/+/299491/2/src/math/bits/bmi2.go#24 for generic implementations.
Questions:
These operations are admittedly fairly x86-specific, and it seems questionable whether other CPUs will add them. Is math/bits still the best place for them?
Do we need the full generality of uint{,8,16,32,64} variants? Haswell only provides PEXT{L,Q} and PDEP{L,Q}, but it's easy to implement 8- and 16-bit wrappers with zero-extension instructions.
The generic fallback code requires a loop, but the compiler could recognize PEXT/PDEP instructions with fixed masks and lower them into fixed and/shift operations. Is this worthwhile?
Alternatively, there are a bunch of ideas from C compilers that we can take inspiration from (e.g., inline assembly like GCC et al; limited inlining of functions written in assembly like SunPro; provide CPU-specific intrinsic functions like ICC et al).
The text was updated successfully, but these errors were encountered: