Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement count with SSE 4.2 and AVX2 #202

Merged
merged 10 commits into from
May 15, 2021
Merged

Conversation

0xd34df00d
Copy link
Contributor

This provides a fairly significant performance boost: counting lines via BS.count 10 was taking about 1.1 s with an mmaped 1.8 gigabyte file, and now it takes 0.35 s. For comparison, length . BS.lines takes about 0.4 s (IMO count should definitely be at least as fast), and Unix wc takes about 0.3-0.4 s.

@sjakobi
Copy link
Member

sjakobi commented Feb 3, 2020

Interesting!

I'm wondering why count didn't use memchr already while other functions do:

bytestring/Data/ByteString.hs

Lines 1001 to 1013 in 95fe6bd

split :: Word8 -> ByteString -> [ByteString]
split _ (PS _ _ 0) = []
split w (PS x s l) = loop 0
where
loop !n =
let q = accursedUnutterablePerformIO $ withForeignPtr x $ \p ->
memchr (p `plusPtr` (s+n))
w (fromIntegral (l-n))
in if q == nullPtr
then [PS x (s+n) (l-n)]
else let i = accursedUnutterablePerformIO $ withForeignPtr x $ \p ->
return (q `minusPtr` (p `plusPtr` s))
in PS x (s+n) (i-n) : loop (i+1)

bytestring/Data/ByteString.hs

Lines 1085 to 1089 in 95fe6bd

elemIndex :: Word8 -> ByteString -> Maybe Int
elemIndex c (PS x s l) = accursedUnutterablePerformIO $ withForeignPtr x $ \p -> do
let p' = p `plusPtr` s
q <- memchr p' c (fromIntegral l)
return $! if q == nullPtr then Nothing else Just $! q `minusPtr` p'

c29225b mentions using memchr for count but goes with a different implementation for some reason?!

It would be nice if you could share your benchmark but I guess that's somewhat blocked on #196.

#109 also looks related.

@0xd34df00d
Copy link
Contributor Author

My benchmark is something really silly, along the lines of

import qualified Data.ByteString as BS
import System.IO.Posix.MMap

main :: IO ()
main = do
  contents <- unsafeMMapFile "/huge/file/on/tmpfs.txt"
  print $ BS.count 10 contents

The file size is about 2 gigs, and it's on a tmpfs partition with no swapping, so no IO involved. Since the file is mmaped, there's also no GC pressure at all.

Thanks for the pointer to #109 , I missed it! While I'm at this task I can just as well try to brush up my SIMD intrinsics skills and also implement a version using PCMPESTRM as suggested in one of the comments — it might indeed be beneficial, especially for strings where the character in question occurs often enough.

In case I get decent results, what's bytestring's policy on SIMD, supported architectures and adding flags for those, or C compiler dependency? IIRC given code along the lines of (I don't remember the right attribute syntax off the top of my head though)

__attribute__((target("sse42")))
void foo() { ... }

__attribute__((target("avx")))
void foo() { ... }

/* fallback */
void foo() { ... }

gcc can generate some dispatching code to select the right implementation at runtime according to the current (runtime) CPU architecture.

@0xd34df00d
Copy link
Contributor Author

0xd34df00d commented Feb 6, 2020

Alright, did some experimenting with PCMPESTRM, and the results are quite in favour of it, especially given that its performance doesn't depend on the character frequency.

  1. When looking for '\n' (with a frequency of about 0.8%), memchr-based run time is about 311 ms. With PCMPESTRM it's a bit (but statistically significantly) lower — about 295 ms.
  2. When looking for ' ' that occurs exactly twice as often, memchr runtime degrades to 425 ms, while PCMPESTRM stays at 295 ms.
  3. When looking for , with frequency of about 10% (my test file is a CSV), memchr is dying at about 1150 ms, while PCMPESTRM — you guessed it — stays at 295 ms.

So I guess if every other character would be the character in question, memchr-based solution would be a no-go.

My gut feel is that with PCMPESTRM the performance is limited by CPI, so I'll play around some more to see if using 256-bit-wide registers and instructions would help, so I'm not pushing what I currently have, but just for the reference, it's

__attribute__((target("sse4.2")))
unsigned long fps_count(unsigned char *str, unsigned long len, unsigned char w) {
    __m128i pat = _mm_set1_epi8(w);

    const int mode = _SIDD_SBYTE_OPS | _SIDD_CMP_EQUAL_EACH;

    unsigned long res = 0;

    size_t i = 0; 

    for (; i < len && (intptr_t)(str + i) % 64; ++i) {
        res += str[i] == w;
    }

    for (; i < len - 64; i += 64) {
        __m128i p1 = _mm_load_si128((const __m128i*)(str + i + 16 * 0));
        __m128i p2 = _mm_load_si128((const __m128i*)(str + i + 16 * 1));
        __m128i p3 = _mm_load_si128((const __m128i*)(str + i + 16 * 2));
        __m128i p4 = _mm_load_si128((const __m128i*)(str + i + 16 * 3));
        __m128i r1 = _mm_cmpestrm(p1, 16, pat, 16, mode);
        __m128i r2 = _mm_cmpestrm(p2, 16, pat, 16, mode);
        __m128i r3 = _mm_cmpestrm(p3, 16, pat, 16, mode);
        __m128i r4 = _mm_cmpestrm(p4, 16, pat, 16, mode);
        res += _popcnt64(_mm_extract_epi64(r1, 0));
        res += _popcnt64(_mm_extract_epi64(r2, 0));
        res += _popcnt64(_mm_extract_epi64(r3, 0));
        res += _popcnt64(_mm_extract_epi64(r4, 0));
    }

    for (; i < len; ++i) {
        res += str[i] == w;
    }

    return res;
}

Let's probably start discussing how SIMD extensions-dependent things should be integrated?

@sjakobi
Copy link
Member

sjakobi commented Feb 6, 2020

Wow, those results with PCMPESTRM look very promising!

Let's probably start discussing how SIMD extensions-dependent things should be integrated?

Indeed – I can't be of much help there though! :)

Ping @hvr, @dcoutts!

@0xd34df00d You can also try garnering more feedback on the libraries mailing list.

@0xd34df00d 0xd34df00d changed the title Use memchr in count Implement count with SSE 4.2 and AVX2 Feb 13, 2020
@0xd34df00d
Copy link
Contributor Author

@sjakobi spent some more time at this, and AVX2 looks even better (see the table at https://github.com/0xd34df00d/counting-chars)!

I made some effort to make sure this still works on machines that don't have AVX2 (or even SSE4.2), but I'm not sure about the treatment of compilers other than gcc and clang, so ideally some other pair of eyes better suited for this sort of stuff than me would take a look at this.

@Bodigrim
Copy link
Contributor

@0xd34df00d this is a very inspiring result! Could you please raise a question about SIMD policies on Libraries maillist?

@cartazio
Copy link

THIS IS REAALLY COOL DEADBEEF, though of course i need to read it much close :)

@cartazio
Copy link

for what its worth, theres no simd optimization guidelines for cbits parts of haskell libraries, and the main bottleneck /engineering challenge is having the right dynamic code selection piece, which this PR seems to do!

one important question is always: whats the overhead of the dynamic dispatch on smaller inputs? (eg, is it better to only do the simd when inputs are above some size threshold?) plus stuff about which cpu generations actually benefit (which is tricky!)

return fps_count_naive(str, len, w);
#else
if (len <= 1024) {
return fps_count_naive(str, len, w);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sweet, it does do the fallback for small inputs :)

@cartazio
Copy link

your dynamic dispatch looks correct, in isolation, but, if i'm reading the gcc/clang docs about function multi versioning, i think you're either using it wrong or underusing it. Or i need to read/play with those more and understand the tricks myself.

reference urls:
https://gcc.gnu.org/wiki/FunctionMultiVersioning
https://llvm.org/devmtg/2014-10/Slides/Christopher-Function%20Multiversioning%20Talk.pdf
https://hannes.hauswedell.net/post/2017/12/09/fmv/
https://clang.llvm.org/docs/AttributeReference.html#cpu-dispatch

https://gcc.gnu.org/onlinedocs/gcc-10.1.0/gcc/Function-Multiversioning.html#Function-Multiversioning

to be clear: my fuzzy understanding is that if the file was compiled in c++ mode, then the different attributes but same function name would do an auto select? (and here you're doing a dynamic dispatch via c either way, so they have deliberately different names). --- also clang apparently has different syntax for the attribute stuff..

to be clear, i literally learned this was a feature today :) , so thats great!

@cartazio
Copy link

@cartazio
Copy link

@0xd34df00d
Copy link
Contributor Author

Yay, thank you @cartazio for taking a look at this!

Regarding the dynamic dispatch — yes, gcc surely can and will generate the dispatching code from these attributes. clang, on the other hand, doesn't seem to always do this — at least, in my experience I had to manually write a similar dispaching code about 3 years ago. Maybe things got better since then!

Of course, I'm all for dropping manual detection and dispatch (the less code, the better), but I'm not entirely comfortable doing so given the experience above.

Also, speaking of the overhead of dynamic dispatch — in any case it only happens once per OS/C thread (or what's the right Haskell-lingo for this), and then it's a matter of a single comparison of a thread-local function pointer against 0, which a CPU's branch predictor will hopefully very quickly learn. My primary reason for using the naive function for smaller inputs is to avoid all the overhead for unaligned prologue/epilogue handlers of the SIMD versions.

@cartazio
Copy link

no, no, it looks fine, i'm just not often given the experience of seeing well written dispatching portable simd.

I think that you dont need to do the attribute annotation if you're doing the manual dispatch? otoh, i literally have zero experience with this stuff so not sure :) .

i think the naive fall back for smaller inputs is the right choice though, so that seems fine to me.

@cartazio
Copy link

lets make sure the build goes green and this code works with both recent gcc and clang both! :)

@sjakobi
Copy link
Member

sjakobi commented May 22, 2020

A good compatibility test would be to build this on GHC's CI infrastructure. I've meant to properly document this for a while, but essentially you

  1. Fork GHC on https://gitlab.haskell.org/ghc/ghc
  2. Update the libraries/bytestring submodule to point at your fork of bytestring
  3. Start a CI pipeline for your GHC branch at https://gitlab.haskell.org/username/ghc/pipelines

@cartazio
Copy link

@0xd34df00d some build errors you should fix ;)


cbits/fpstring.c:214:9: error:
606     warning: implicit declaration of function ‘__get_cpuid_count’ [-Wimplicit-function-declaration]
607         if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
608             ^
609    |
610214 |     if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
611    |         ^
612cbits/fpstring.c: In function ‘select_fps_simd_impl’:
613
614cbits/fpstring.c:214:9: error:
615     warning: implicit declaration of function ‘__get_cpuid_count’ [-Wimplicit-function-declaration]
616         if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {
617             ^
618    |
619214 |     if (__get_cpuid_count(7, 0, &eax, &ebx, &ecx, &edx)) {

@cartazio
Copy link

i think the build is erroring there, but idk :)

@0xd34df00d
Copy link
Contributor Author

@cartazio,

I think that you dont need to do the attribute annotation if you're doing the manual dispatch?

That depends on the compiler flags like -march (I think gcc and clang behave exactly the same here). If gcc/clang generate code for, say, generic x86_64 machine, long story short, they don't know they are allowed to use SSE 4.2 or AVX, and they will error out.

Re the build error, that's interesting, I'll take a look! Might be a different standard lib with different intrinsics exposed. Are you building on mac by any chance?

@sjakobi, thanks for the instructions, will do!

@Bodigrim
Copy link
Contributor

@0xd34df00d Travis build complains about undefined __get_cpuid_count.

@Bodigrim Bodigrim added blocked: patch-needed somebody needs to write a patch or contribute code performance labels Aug 21, 2020
@Bodigrim
Copy link
Contributor

Bodigrim commented Jan 6, 2021

@0xd34df00d are you still interested in this PR?

@Bodigrim
Copy link
Contributor

Bodigrim commented Jan 9, 2021

  • __get_cpuid_count requires gcc-6 and must be guarded by #if __GNUC__ > 5.
  • The patch does not compile on MacOS and FreeBSD with argument to '__builtin_ia32_pcmpestrm128' must be a constant integer failures. That's because clang is stricter than gcc.

These are probably solvable issues, but such amount of conditional compilation makes me uncomfortable to proceed. We are not in a position to test bytestring build against all possible combinations of C compilers and platforms.

@vdukhovni what's your take?

@0xd34df00d
Copy link
Contributor Author

@Bodigrim sorry for long reply, this completely fell out of my attention span. I'm definitely still interested!

Frankly, neither am I comfortable with such amount of platforms that I cannot test this on (especially for such an important and foundational library). Short-term, seems like the best way is to have a cabal flag, controlling whether the hand-written SIMD implementation is used for this function (and possibly others), and disabled by default.

What I'd like to see as an ideal long-term solution is this function implemented purely in terms of GHC's SIMD primops. In fact, I tried that, but some things are missing in the set of available primops (namely, there's an undefined there in place of something, but I did that a while ago so I don't remember what exactly is missing off the top of my head).

@Bodigrim
Copy link
Contributor

Bodigrim commented Jan 9, 2021

Short-term, seems like the best way is to have a cabal flag, controlling whether the hand-written SIMD implementation is used for this function (and possibly others), and disabled by default.

Besides usability concerns (switching a flag of bytestring will cause everything to rebuild, including half of boot libraries), I'm unsure about the intention. Users care less about performance on a local machine than in production, but that's exactly where they have much less control over environment.

Supporting an additional cabal flag is very costly from maintenance perspective. More configurations to test, more potential build failures, more support.

I'm not opposed to SIMD in bytestring in general. Could your patch be rewritten in a more platform-independent way, possibly sacrificing some performance gains? CC @ethercrow who worked on #310.

@0xd34df00d
Copy link
Contributor Author

Ok, implemented this with C11's call_once. Although a bad libc implementation might have this function fairly unoptimized, it's only used for big enough strings, where the actual counting time much exceeds call_once. Although atomics almost certainly wouldt be faster, I'd much rather use a library function designed specifically for this task to avoid placing the burden of reasoning about atomics and the right memory orderings on myself and the next reader of the code.

@vdukhovni
Copy link
Contributor

Well, C11 call_once may not be terribly portable. For example, on FreeBSD 12.2 it requires linking with -lstdthreads. On Fedora 31 it requires linking with -lpthread. Either way, it is not found in just libc.

On MacOS "Big Sur", there is no <threads.h> header file.

@0xd34df00d
Copy link
Contributor Author

I guess this means POSIX's pthread_once is also not an option. Welp, atomics it is then!

@vdukhovni
Copy link
Contributor

Well, pthread_once is an option on Unix-like systems, assuming explicit linking with -lpthread is not an obstacle. It is at least more broadly available, but then perhaps you run into an issue with Windows...

My opinion remains that these sort of primitives belong in GHC itself as new priomops, and not so much in bytestring, but it seems I'm a minority view on this...

@Bodigrim
Copy link
Contributor

Bodigrim commented May 2, 2021

Well, I've heard talks about vectorized operations in GHC for years without much fruition. Even if suddenly there emerges a champion to push this stuff forward, it would still take a long time before bytestring could use it. If we can speed up things without a (potentially infinite) delay, it's a right to do so.

My personal opinion is that vectorized operations would not find way into GHC primitives until there accumulates a critical mass of libraries employing them. Then it would be easy to make a case in favor of.

@0xd34df00d
Copy link
Contributor Author

@vdukhovni I skimmed superficially through what GHC has on SIMD primops some time ago, and, while I more than agree with your sentiment for multiple reasons, I see several obstacles (in fact, I tried writing the above with primops, and stumbled upon a few missing ones).

Firstly, a more technical one: if I understand correctly, currently GHC only allows SIMD primops from GHC.Prim when building with the llvm codegen, which, in this case, means that -fllvm needs to be turned on when building bytestring. Of course, I'm sure that NCG can also be tought about the primops, so that's more of a technicality (that still needs to be accounted for, though).

Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like cmpestrm, that becomes entirely non-obvious — I bet ARM doesn't have a direct substitute for this instruction.

One way to solve the latter is by allowing the user to define their own primops, that might look like

cmpestrm# :: Word8x16# -> Word8x16# -> Imm8# -> Word8x16#
cmpestrm# s1 s2 imm8 = [ghcMagic| pcmpestrm s1 s2 imm8 |]

or something along those lines (where Imm8# stands for some immediate value known at compile-time).

Although I'm not sure the depth of the can of worms it opens: in fact, the above definition should also accept lengths that shall be stored in eax/rax and edx/rdx (that's the contract of pcmpestrm), so it's really a compound instruction, and the codegen needs to know those registers get clobbered.

@Bodigrim
Copy link
Contributor

Bodigrim commented May 8, 2021

Benchmarks look impressive, 70-85% faster for long strings!
If there are no further suggestions (ping @hsyl20), I'll merge it in a couple of days.

@vdukhovni
Copy link
Contributor

Secondly, a more conceptual one: I'm not sure how to expose all the interesting instructions in an architecture-agnostic way. The usual things like vector addition are easy — they're present in pretty much any SIMD extension set with the same semantics. But if we are starting to consider things like cmpestrm, that becomes entirely non-obvious — I bet ARM doesn't have a direct substitute for this instruction.

What I had in mind is not so much primops for non-portable low-level instructions, but in fact direct support for higher-level primitives (like memchr(), ...) in GHC via appropriate low-level primitives when available, or naive loops otherwise. These should work not only for ByteString but also for ByteArray...

@cartazio
Copy link

cartazio commented May 8, 2021 via email

@Bodigrim Bodigrim merged commit 44edcaa into haskell:master May 15, 2021
@Bodigrim
Copy link
Contributor

Thanks, @0xd34df00d!

@Bodigrim Bodigrim added this to the 0.11.2.0 milestone May 16, 2021
Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request May 16, 2021
* Implement SSE4.2 and AVX2-based variants of the `count`

* Address PR comments wrt SIMD-based character counting

* Add test for `count` on longer strings

* Add benchmarks for strict bytestring's count method

* Fix tabs

* Use C11's call_once instead of TLS for choosing counting SIMD function

* Build cbits with -std=c11

* Add a point to `count` benchmarks slightly above the SIMD threshold

* Initialize the `count` SIMD ptr with atomics

* Enable `count` SIMD support only if atomics are available

Turns out a conforming C implementation might be running on a platform
with threads, but with no atomics in the C stdlib.
@chrisdone
Copy link
Member

I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔

@cartazio
Copy link

cartazio commented May 17, 2021 via email

@vdukhovni
Copy link
Contributor

So if bytestring length is greater than say 100000 or 10_000 or something we should call a safe ffi version?

On Sun, May 16, 2021 at 4:42 PM Chris Done @.***> wrote: I wonder about the performance of count combined with par, with the 1GB file split in half and processed in parallel. 350ms is a long time for my other cores to be doing nothing. 🤔 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#202 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAABBQW2WKI6KEPQRP7PEO3TOAU2PANCNFSM4KO4U6PQ .

Wouldn't that conflict with the push the to us unpinned memory?

@cartazio
Copy link

cartazio commented May 17, 2021 via email

@vdukhovni
Copy link
Contributor

My impression was that "unsafe" calls are not preempted by the GC, and don't necessarily need pinned memory. While "safe" calls do. Since there are proposals in flight to move ByteString to unpinned ByteArray, I'd expect those to be less amenable for use with "safe" calls.

@hsyl20
Copy link
Contributor

hsyl20 commented May 17, 2021

If there are no further suggestions (ping @hsyl20), I'll merge it in a couple of days.

Sorry for not being very active these days (first baby at home is taking quite some time). I don't have much to add anyway. Good work! I also wish we had a better story for simd primops in GHC but we don't even have a good story for sub-word (Word8#, Word16€, Word32#...) yet so it will take some time.

@cartazio
Copy link

I think bytestring won’t switch to unconditionally unpinned memory , but rather have pinned and unpinned flavors. (It’s a deeply crash heavy breaking change to silently make them unpinned. May as well be new modules or a new package at that point :) )

noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* Implement SSE4.2 and AVX2-based variants of the `count`

* Address PR comments wrt SIMD-based character counting

* Add test for `count` on longer strings

* Add benchmarks for strict bytestring's count method

* Fix tabs

* Use C11's call_once instead of TLS for choosing counting SIMD function

* Build cbits with -std=c11

* Add a point to `count` benchmarks slightly above the SIMD threshold

* Initialize the `count` SIMD ptr with atomics

* Enable `count` SIMD support only if atomics are available

Turns out a conforming C implementation might be running on a platform
with threads, but with no atomics in the C stdlib.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blocked: patch-needed somebody needs to write a patch or contribute code performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Faster c_count? Perhaps a GHC rather than bytestring issue? Much faster count function in plain Haskell
10 participants