-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement intersperse using SSE2 #310
Conversation
2e5d729
to
a4b937d
Compare
Looks reasonable to me, but I have very limited knowledge of SSE intrinsics. @vdukhovni @0xd34df00d could you possibly help me out here? |
const unsigned char *const p_begin = p; | ||
const unsigned char *const p_end = p_begin + n - 9; | ||
while (p < p_end) { | ||
const __m128i eight_src_bytes = _mm_loadl_epi64((__m128i *)p); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when p
is not 64-bit aligned? Is there any performance degradation?
Are there OS platforms supported by GHC where __x86_64__
is defined, but the intrinsics are not available?
That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...
Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)
[ EDIT: FWIW, it compiles on a FreeBSD 12 system, and appears to work correctly in naïve tests... ]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when p is not 64-bit aligned? Is there any performance degradation?
My understanding is that there is no performance degradation on CPUs released in the latest ~10 years. And for earlier ones I suspect it's still worth it.
Are there OS platforms supported by GHC where x86_64 is defined, but the intrinsics are not available? That is I guess, has this code been tested on FreeBSD, MacOS and Windows, and perhaps even less common platforms like NetBSD, that are not officially supported by GHC, but do maintain ports...
I think this code should work on all OSes, at least with gcc and clang. Does it ever happen that cbits of a Haskell package are compiled with MSVC? Intrinsics should still be the same but I'd need to verify that the header names are correct.
I can make a PR configuring GitHub Actions to build and test bytestring
on Linux, Mac and Windows if you're OK with that. I'm not aware of any CI service that offers BSD machines.
Lastly, I am curious about what sort of applications might care about the performance of intersperse? Is this a graphics thing? When would I want to bulk insert a fixed byte between every other byte in a large-enough buffer to want it done faster with intrinsics? (And yet be doing the task in Haskell...)
I just want the concept of using SIMD in Haskell to be more mainstream. The reason I started with ByteString.intersperse
is that it's very similar to ascii->utf16 conversion that I accelerated in text
here: haskell/text#298. I'd like to continue with other functions like ByteString.reverse
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that there is no performance degradation...
@ethercrow could you please add a benchmark for non-aligned bytestrings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Bodigrim what's a good way to create one that's surely unaligned? Take a slice of global one with an odd offset like 7?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, Data.ByteString.drop 1
should be enough.
@ethercrow This would be a much appreciated contribution indeed. |
Here it is #311 |
I'm sympathetic to this goal. While ideally SIMD should be GHC primops, this will not happen without more demand from the user space. And this demand is to grow from experiments and explorations like this. |
1b4e1fa
to
df6b462
Compare
Added a benchmark on unaligned data, performance is the same:
|
Windows build failed, but not because of the intrinsics, I think:
|
df6b462
to
c908999
Compare
@Bodigrim rebased onto your windows fixes in master, now looks fine. |
I do not have particular reservations here, because even I can read this code and understand what's going on and probably even fix something trivial. And I believe that userland package should pave a way for intrinsics in GHC, so I'm OK with additional complexity for a not-so-important function. But I do not have expertise to judge about crossplatform and crossarchitecture issues; this is where I would like to hear from @vdukhovni. Starting from GHC 8.10.2 FreeBSD is a Tier 1 platform. AFAIK there are no CIs available, but one can create a FreeBSD droplet on DigitalOcean to run tests and destroy it in an hour. Could someone conjure up a shell script for FreeBSD 12.2 zfs x64 to install ghc / cabal and run |
On FreeBSD 12, ghc and cabal-install are available via the "ports" system.
A job running as "root" can install ports via: "pkg install ghc cabal-install". After that it is not different from what one would do on Linux. I have no experience with "droplets". Are they able to install ports? |
"Droplet" is just a fancy name for a virtual machine. Thanks, this works:
The branch seems working fine as well. |
Why? There is an official binary build: https://www.haskell.org/ghc/download_ghc_8_10_2.html#freebsd_x86_64 |
Perhaps the Wiki page I found is outdated |
This should take care of testing on FreeBSD: #322 |
c908999
to
d685c40
Compare
d685c40
to
1a1d168
Compare
This finally has all platforms. Let's merge? |
@vdukhovni is anything outstanding left here? |
I have no outstanding issues. The motivation is a bit weak for this particular instance, as I mentioned at the outset, but if the goal is to start with the simplest (even if not compelling) applications and build from there, then it makes a bit of sense. |
@ethercrow thanks! |
SSE2 is available on every x86_64 CPU, so there is no need for a capability check.
Before:
After: