Implement intersperse using SSE2 #310

ethercrow · 2020-10-23T18:18:09Z

SSE2 is available on every x86_64 CPU, so there is no need for a capability check.

Before:

benchmarked intersperse/intersperse
time                 1.127 μs   (1.104 μs .. 1.149 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 1.103 μs   (1.096 μs .. 1.112 μs)
std dev              26.95 ns   (20.98 ns .. 34.21 ns)

After:

benchmarked intersperse/intersperse
time                 715.9 ns   (697.3 ns .. 730.9 ns)
                     0.997 R²   (0.996 R² .. 0.999 R²)
mean                 688.7 ns   (683.4 ns .. 696.8 ns)
std dev              21.36 ns   (15.97 ns .. 32.29 ns)

ethercrow · 2020-10-23T21:52:43Z

@Bodigrim @sjakobi please have a look.

Bodigrim · 2020-10-24T00:39:07Z

Looks reasonable to me, but I have very limited knowledge of SSE intrinsics. @vdukhovni @0xd34df00d could you possibly help me out here?

vdukhovni · 2020-10-24T01:51:17Z

cbits/fpstring.c

+    const unsigned char *const p_begin = p;
+    const unsigned char *const p_end = p_begin + n - 9;
+    while (p < p_end) {
+      const __m128i eight_src_bytes = _mm_loadl_epi64((__m128i *)p);


What happens when p is not 64-bit aligned? Is there any performance degradation?
Are there OS platforms supported by GHC where __x86_64__ is defined, but the intrinsics are not available?
That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...

Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)

[ EDIT: FWIW, it compiles on a FreeBSD 12 system, and appears to work correctly in naïve tests... ]

What happens when p is not 64-bit aligned? Is there any performance degradation?

My understanding is that there is no performance degradation on CPUs released in the latest ~10 years. And for earlier ones I suspect it's still worth it.

Are there OS platforms supported by GHC where x86_64 is defined, but the intrinsics are not available? That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...

I think this code should work on all OSes, at least with gcc and clang. Does it ever happen that cbits of a Haskell package are compiled with MSVC? Intrinsics should still be the same but I'd need to verify that the header names are correct.

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that. I'm not aware of any CI service that offers BSD machines.

Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)

I just want the concept of using SIMD in Haskell to be more mainstream. The reason I started with ByteString.intersperse is that it's very similar to ascii->utf16 conversion that I accelerated in text here: haskell/text#298. I'd like to continue with other functions like ByteString.reverse.

My understanding is that there is no performance degradation...

@ethercrow could you please add a benchmark for non-aligned bytestrings?

@Bodigrim what's a good way to create one that's surely unaligned? Take a slice of global one with an odd offset like 7?

Yeah, Data.ByteString.drop 1 should be enough.

Bodigrim · 2020-10-24T19:09:39Z

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that.

@ethercrow This would be a much appreciated contribution indeed.

ethercrow · 2020-10-24T23:33:49Z

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that.

@ethercrow This would be a much appreciated contribution indeed.

Here it is #311

Bodigrim · 2020-10-25T12:38:07Z

I just want the concept of using SIMD in Haskell to be more mainstream.

I'm sympathetic to this goal. While ideally SIMD should be GHC primops, this will not happen without more demand from the user space. And this demand is to grow from experiments and explorations like this.

ethercrow · 2020-11-04T21:26:13Z

Added a benchmark on unaligned data, performance is the same:

benchmarked intersperse/intersperse
time                 626.8 ns   (624.0 ns .. 631.5 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 632.2 ns   (630.6 ns .. 633.9 ns)
std dev              5.852 ns   (4.659 ns .. 7.552 ns)

benchmarked intersperse/intersperse (unaligned)
time                 628.3 ns   (623.8 ns .. 633.6 ns)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 635.3 ns   (632.3 ns .. 640.8 ns)
std dev              13.55 ns   (9.235 ns .. 23.72 ns)

ethercrow · 2020-11-04T21:41:14Z

Windows build failed, but not because of the intrinsics, I think:

C:\Users\RUNNER~1\AppData\Local\Temp\ghcF6D5.o: DeleteFile "\\\\?\\C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\ghcF6D5.o": permission denied (Access is denied.)

ethercrow · 2020-11-07T18:58:47Z

@Bodigrim rebased onto your windows fixes in master, now looks fine.

Bodigrim · 2020-11-07T19:56:50Z

I do not have particular reservations here, because even I can read this code and understand what's going on and probably even fix something trivial. And I believe that userland package should pave a way for intrinsics in GHC, so I'm OK with additional complexity for a not-so-important function. But I do not have expertise to judge about crossplatform and crossarchitecture issues; this is where I would like to hear from @vdukhovni.

Starting from GHC 8.10.2 FreeBSD is a Tier 1 platform. AFAIK there are no CIs available, but one can create a FreeBSD droplet on DigitalOcean to run tests and destroy it in an hour. Could someone conjure up a shell script for FreeBSD 12.2 zfs x64 to install ghc / cabal and run bytestring tests?

vdukhovni · 2020-11-07T20:30:36Z

Starting from GHC 8.10.2 FreeBSD is a Tier 1[you meant to say Tier 2] platform. AFAIK there are no CIs available, but one can create a FreeBSD droplet on DigitalOcean to run tests and destroy it in an hour. Could someone conjure up a shell script for FreeBSD 12.2 zfs x64 to install ghc / cabal and run bytestring tests?

On FreeBSD 12, ghc and cabal-install are available via the "ports" system.

$ pkg search ghc
ghc-8.10.2                     Compiler for the functional language Haskell
$ pkg search cabal
hs-cabal-install-3.2.0.0       Command-line interface for Cabal and Hackage

A job running as "root" can install ports via: "pkg install ghc cabal-install". After that it is not different from what one would do on Linux. I have no experience with "droplets". Are they able to install ports?

Bodigrim · 2020-11-07T20:52:18Z

"Droplet" is just a fancy name for a virtual machine. Thanks, this works:

pkg install git ghc hs-cabal-install
cabal update
git clone https://github.com/haskell/bytestring.git
cd bytestring
cabal build 
(cd tests; cabal test)
(cd bench; cabal bench -O0 --benchmark-options "--quick --min-duration=0 --include-first-iter")

The branch seems working fine as well.

Bodigrim · 2020-11-07T20:54:47Z

[you meant to say Tier 2]

Why? There is an official binary build: https://www.haskell.org/ghc/download_ghc_8_10_2.html#freebsd_x86_64

vdukhovni · 2020-11-07T21:50:57Z

[you meant to say Tier 2]

Why? There is an official binary build: https://www.haskell.org/ghc/download_ghc_8_10_2.html#freebsd_x86_64

Perhaps the Wiki page I found is outdated

ethercrow · 2020-11-12T21:07:11Z

This should take care of testing on FreeBSD: #322

ethercrow · 2020-11-19T20:34:34Z

This finally has all platforms. Let's merge?

Bodigrim · 2020-11-19T22:47:03Z

@vdukhovni is anything outstanding left here?

vdukhovni · 2020-11-19T22:49:31Z

@vdukhovni is anything outstanding left here?

I have no outstanding issues. The motivation is a bit weak for this particular instance, as I mentioned at the outset, but if the goal is to start with the simplest (even if not compelling) applications and build from there, then it makes a bit of sense.

Bodigrim · 2020-11-20T19:20:46Z

@ethercrow thanks!

ethercrow marked this pull request as draft October 23, 2020 21:35

ethercrow force-pushed the intersperse-sse2 branch from 2e5d729 to a4b937d Compare October 23, 2020 21:47

ethercrow marked this pull request as ready for review October 23, 2020 21:49

vdukhovni reviewed Oct 24, 2020

View reviewed changes

ethercrow force-pushed the intersperse-sse2 branch 4 times, most recently from 1b4e1fa to df6b462 Compare November 4, 2020 21:25

ethercrow force-pushed the intersperse-sse2 branch from df6b462 to c908999 Compare November 7, 2020 17:41

ethercrow force-pushed the intersperse-sse2 branch from c908999 to d685c40 Compare November 18, 2020 20:09

Implement intersperse using SSE2

1a1d168

ethercrow force-pushed the intersperse-sse2 branch from d685c40 to 1a1d168 Compare November 19, 2020 16:21

Bodigrim approved these changes Nov 19, 2020

View reviewed changes

Bodigrim requested a review from sjakobi November 19, 2020 22:55

sjakobi approved these changes Nov 20, 2020

View reviewed changes

Bodigrim merged commit e278a3d into haskell:master Nov 20, 2020

Bodigrim added this to the 0.11.1.0 milestone Nov 20, 2020

Bodigrim pushed a commit that referenced this pull request Nov 20, 2020

Implement intersperse using SSE2 (#310)

c4b41ed

Bodigrim mentioned this pull request Jan 9, 2021

Implement count with SSE 4.2 and AVX2 #202

Merged

ethercrow mentioned this pull request Mar 29, 2021

SSE2 patches for encoding and decoding functions haskell/text#302

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement intersperse using SSE2 #310

Implement intersperse using SSE2 #310

ethercrow commented Oct 23, 2020 •

edited

Loading

ethercrow commented Oct 23, 2020

Bodigrim commented Oct 24, 2020

vdukhovni Oct 24, 2020 •

edited

Loading

ethercrow Oct 24, 2020

Bodigrim Nov 4, 2020

ethercrow Nov 4, 2020

Bodigrim Nov 4, 2020

Bodigrim commented Oct 24, 2020

ethercrow commented Oct 24, 2020

Bodigrim commented Oct 25, 2020

ethercrow commented Nov 4, 2020

ethercrow commented Nov 4, 2020

ethercrow commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

vdukhovni commented Nov 7, 2020 •

edited

Loading

Bodigrim commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

vdukhovni commented Nov 7, 2020

ethercrow commented Nov 12, 2020

ethercrow commented Nov 19, 2020

Bodigrim commented Nov 19, 2020

vdukhovni commented Nov 19, 2020 •

edited

Loading

Bodigrim commented Nov 20, 2020

Implement intersperse using SSE2 #310

Implement intersperse using SSE2 #310

Conversation

ethercrow commented Oct 23, 2020 • edited Loading

ethercrow commented Oct 23, 2020

Bodigrim commented Oct 24, 2020

vdukhovni Oct 24, 2020 • edited Loading

Choose a reason for hiding this comment

ethercrow Oct 24, 2020

Choose a reason for hiding this comment

Bodigrim Nov 4, 2020

Choose a reason for hiding this comment

ethercrow Nov 4, 2020

Choose a reason for hiding this comment

Bodigrim Nov 4, 2020

Choose a reason for hiding this comment

Bodigrim commented Oct 24, 2020

ethercrow commented Oct 24, 2020

Bodigrim commented Oct 25, 2020

ethercrow commented Nov 4, 2020

ethercrow commented Nov 4, 2020

ethercrow commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

vdukhovni commented Nov 7, 2020 • edited Loading

Bodigrim commented Nov 7, 2020

Bodigrim commented Nov 7, 2020

vdukhovni commented Nov 7, 2020

ethercrow commented Nov 12, 2020

ethercrow commented Nov 19, 2020

Bodigrim commented Nov 19, 2020

vdukhovni commented Nov 19, 2020 • edited Loading

Bodigrim commented Nov 20, 2020

ethercrow commented Oct 23, 2020 •

edited

Loading

vdukhovni Oct 24, 2020 •

edited

Loading

vdukhovni commented Nov 7, 2020 •

edited

Loading

vdukhovni commented Nov 19, 2020 •

edited

Loading