Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement intersperse using SSE2 #310

Merged
merged 1 commit into from
Nov 20, 2020

Conversation

ethercrow
Copy link
Contributor

@ethercrow ethercrow commented Oct 23, 2020

SSE2 is available on every x86_64 CPU, so there is no need for a capability check.

Before:

benchmarked intersperse/intersperse
time                 1.127 μs   (1.104 μs .. 1.149 μs)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 1.103 μs   (1.096 μs .. 1.112 μs)
std dev              26.95 ns   (20.98 ns .. 34.21 ns)

After:

benchmarked intersperse/intersperse
time                 715.9 ns   (697.3 ns .. 730.9 ns)
                     0.997 R²   (0.996 R² .. 0.999 R²)
mean                 688.7 ns   (683.4 ns .. 696.8 ns)
std dev              21.36 ns   (15.97 ns .. 32.29 ns)

@ethercrow ethercrow marked this pull request as draft October 23, 2020 21:35
@ethercrow ethercrow marked this pull request as ready for review October 23, 2020 21:49
@ethercrow
Copy link
Contributor Author

@Bodigrim @sjakobi please have a look.

@Bodigrim
Copy link
Contributor

Looks reasonable to me, but I have very limited knowledge of SSE intrinsics. @vdukhovni @0xd34df00d could you possibly help me out here?

const unsigned char *const p_begin = p;
const unsigned char *const p_end = p_begin + n - 9;
while (p < p_end) {
const __m128i eight_src_bytes = _mm_loadl_epi64((__m128i *)p);
Copy link
Contributor

@vdukhovni vdukhovni Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when p is not 64-bit aligned? Is there any performance degradation?
Are there OS platforms supported by GHC where __x86_64__ is defined, but the intrinsics are not available?
That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...

Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)

[ EDIT: FWIW, it compiles on a FreeBSD 12 system, and appears to work correctly in naïve tests... ]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when p is not 64-bit aligned? Is there any performance degradation?

My understanding is that there is no performance degradation on CPUs released in the latest ~10 years. And for earlier ones I suspect it's still worth it.

Are there OS platforms supported by GHC where x86_64 is defined, but the intrinsics are not available? That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...

I think this code should work on all OSes, at least with gcc and clang. Does it ever happen that cbits of a Haskell package are compiled with MSVC? Intrinsics should still be the same but I'd need to verify that the header names are correct.

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that. I'm not aware of any CI service that offers BSD machines.

Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)

I just want the concept of using SIMD in Haskell to be more mainstream. The reason I started with ByteString.intersperse is that it's very similar to ascii->utf16 conversion that I accelerated in text here: haskell/text#298. I'd like to continue with other functions like ByteString.reverse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that there is no performance degradation...

@ethercrow could you please add a benchmark for non-aligned bytestrings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bodigrim what's a good way to create one that's surely unaligned? Take a slice of global one with an odd offset like 7?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, Data.ByteString.drop 1 should be enough.

@Bodigrim
Copy link
Contributor

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that.

@ethercrow This would be a much appreciated contribution indeed.

@ethercrow
Copy link
Contributor Author

I can make a PR configuring GitHub Actions to build and test bytestring on Linux, Mac and Windows if you're OK with that.

@ethercrow This would be a much appreciated contribution indeed.

Here it is #311

@Bodigrim
Copy link
Contributor

I just want the concept of using SIMD in Haskell to be more mainstream.

I'm sympathetic to this goal. While ideally SIMD should be GHC primops, this will not happen without more demand from the user space. And this demand is to grow from experiments and explorations like this.

@ethercrow ethercrow force-pushed the intersperse-sse2 branch 4 times, most recently from 1b4e1fa to df6b462 Compare November 4, 2020 21:25
@ethercrow
Copy link
Contributor Author

Added a benchmark on unaligned data, performance is the same:

benchmarked intersperse/intersperse
time                 626.8 ns   (624.0 ns .. 631.5 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 632.2 ns   (630.6 ns .. 633.9 ns)
std dev              5.852 ns   (4.659 ns .. 7.552 ns)

benchmarked intersperse/intersperse (unaligned)
time                 628.3 ns   (623.8 ns .. 633.6 ns)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 635.3 ns   (632.3 ns .. 640.8 ns)
std dev              13.55 ns   (9.235 ns .. 23.72 ns)

@ethercrow
Copy link
Contributor Author

Windows build failed, but not because of the intrinsics, I think:

C:\Users\RUNNER~1\AppData\Local\Temp\ghcF6D5.o: DeleteFile "\\\\?\\C:\\Users\\RUNNER~1\\AppData\\Local\\Temp\\ghcF6D5.o": permission denied (Access is denied.)

@ethercrow
Copy link
Contributor Author

@Bodigrim rebased onto your windows fixes in master, now looks fine.

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 7, 2020

I do not have particular reservations here, because even I can read this code and understand what's going on and probably even fix something trivial. And I believe that userland package should pave a way for intrinsics in GHC, so I'm OK with additional complexity for a not-so-important function. But I do not have expertise to judge about crossplatform and crossarchitecture issues; this is where I would like to hear from @vdukhovni.

Starting from GHC 8.10.2 FreeBSD is a Tier 1 platform. AFAIK there are no CIs available, but one can create a FreeBSD droplet on DigitalOcean to run tests and destroy it in an hour. Could someone conjure up a shell script for FreeBSD 12.2 zfs x64 to install ghc / cabal and run bytestring tests?

@vdukhovni
Copy link
Contributor

vdukhovni commented Nov 7, 2020

Starting from GHC 8.10.2 FreeBSD is a Tier 1[you meant to say Tier 2] platform. AFAIK there are no CIs available, but one can create a FreeBSD droplet on DigitalOcean to run tests and destroy it in an hour. Could someone conjure up a shell script for FreeBSD 12.2 zfs x64 to install ghc / cabal and run bytestring tests?

On FreeBSD 12, ghc and cabal-install are available via the "ports" system.

$ pkg search ghc
ghc-8.10.2                     Compiler for the functional language Haskell
$ pkg search cabal
hs-cabal-install-3.2.0.0       Command-line interface for Cabal and Hackage

A job running as "root" can install ports via: "pkg install ghc cabal-install". After that it is not different from what one would do on Linux. I have no experience with "droplets". Are they able to install ports?

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 7, 2020

"Droplet" is just a fancy name for a virtual machine. Thanks, this works:

pkg install git ghc hs-cabal-install
cabal update
git clone https://github.com/haskell/bytestring.git
cd bytestring
cabal build 
(cd tests; cabal test)
(cd bench; cabal bench -O0 --benchmark-options "--quick --min-duration=0 --include-first-iter")

The branch seems working fine as well.

@Bodigrim
Copy link
Contributor

Bodigrim commented Nov 7, 2020

[you meant to say Tier 2]

Why? There is an official binary build: https://www.haskell.org/ghc/download_ghc_8_10_2.html#freebsd_x86_64

@vdukhovni
Copy link
Contributor

[you meant to say Tier 2]

Why? There is an official binary build: https://www.haskell.org/ghc/download_ghc_8_10_2.html#freebsd_x86_64

Perhaps the Wiki page I found is outdated

@ethercrow
Copy link
Contributor Author

This should take care of testing on FreeBSD: #322

@ethercrow
Copy link
Contributor Author

This finally has all platforms. Let's merge?

@Bodigrim
Copy link
Contributor

@vdukhovni is anything outstanding left here?

@vdukhovni
Copy link
Contributor

vdukhovni commented Nov 19, 2020

@vdukhovni is anything outstanding left here?

I have no outstanding issues. The motivation is a bit weak for this particular instance, as I mentioned at the outset, but if the goal is to start with the simplest (even if not compelling) applications and build from there, then it makes a bit of sense.

@Bodigrim Bodigrim requested a review from sjakobi November 19, 2020 22:55
@Bodigrim Bodigrim merged commit e278a3d into haskell:master Nov 20, 2020
@Bodigrim
Copy link
Contributor

@ethercrow thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants