-
Notifications
You must be signed in to change notification settings - Fork 157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE2 patches for encoding and decoding functions #302
Conversation
56ba209
to
22433a0
Compare
@phadej If you're too busy, are there other maintainers we could ping and ask for review? |
I'm only merging simple patches, everything untrivial would need input from @hvr |
@hvr ping |
quarterly bump |
@ethercrow could you please rebase and rerun relevant benchmarks? First on master with |
|
22433a0
to
0625ea0
Compare
Speed ups look great! I think what we need here is a multiplatform CI to ensure that there are no gcc / clang issues on various OSs. Something like a separate workflow running jobs:
build:
runs-on: ${{ matrix.os }}
strategy:
matrix:
os: [ubuntu-16.04, ubuntu-18.04, ubuntu-20.04, windows-2019, macos-10.15, macos-11.0] and FreeBSD, all with the latest available (or default) GHC. |
@Bodigrim I played with the relevant part in godbolt.org, clang and GCC are both happy with that code. MSVC doesn't define The problem with testing is rather that non-i386 (and non-x86_64) architectures are not tested. One option is to have a way to toggle the
I'd avoid adding new Cabal flags. Perfectly users won't know about development or testing concerns (i.e. I'd try to figure out whether What will be relevant sooner or later is ARM64. But there isn't ARM specific code, nor there are easy (and free or even cheap) way to have ARM Github action runners. I played with following snippet: We need to remove #include <stdint.h>
#include <stddef.h>
#if defined(__x86_64__)
#include <emmintrin.h>
#include <xmmintrin.h>
#endif
#define UTF8_ACCEPT 0
#define UTF8_REJECT 12
static const uint8_t utf8d[] = {
/*
* The first part of the table maps bytes to character classes that
* to reduce the size of the transition table and create bitmasks.
*/
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,
1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1, 9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,9,
7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7, 7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,7,
8,8,2,2,2,2,2,2,2,2,2,2,2,2,2,2, 2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,
10,3,3,3,3,3,3,3,3,3,3,3,3,4,3,3, 11,6,6,6,5,8,8,8,8,8,8,8,8,8,8,8,
/*
* The second part is a transition table that maps a combination of
* a state of the automaton and a character class to a state.
*/
0,12,24,36,60,96,84,12,12,12,48,72, 12,12,12,12,12,12,12,12,12,12,12,12,
12, 0,12,12,12,12,12, 0,12, 0,12,12, 12,24,12,12,12,12,12,24,12,24,12,12,
12,12,12,12,12,12,12,24,12,12,12,12, 12,24,12,12,12,12,12,12,12,24,12,12,
12,12,12,12,12,12,12,36,12,36,12,12, 12,36,12,12,12,12,12,36,12,36,12,12,
12,36,12,12,12,12,12,12,12,12,12,12,
};
uint32_t
decode(uint32_t *state, uint32_t* codep, uint32_t byte) {
uint32_t type = utf8d[byte];
*codep = (*state != UTF8_ACCEPT) ?
(byte & 0x3fu) | (*codep << 6) :
(0xff >> type) & (byte);
return *state = utf8d[256 + *state + type];
}
uint8_t const *
_hs_text_decode_utf8_int(uint16_t *const dest, size_t *destoff,
const uint8_t **src, const uint8_t *srcend,
uint32_t *codepoint0, uint32_t *state0)
{
uint16_t *d = dest + *destoff;
const uint8_t *s = *src, *last = *src;
uint32_t state = *state0;
uint32_t codepoint = *codepoint0;
while (s < srcend) {
#if defined(__i386__) || defined(__x86_64__)
/*
* This code will only work on a little-endian system that
* supports unaligned loads.
*
* It gives a substantial speed win on data that is purely or
* partly ASCII (e.g. HTML), at only a slight cost on purely
* non-ASCII text.
*/
if (state == UTF8_ACCEPT) {
#if defined(__x86_64__)
const __m128i zeros = _mm_set1_epi32(0);
while (s < srcend - 8) {
const uint64_t hopefully_eight_ascii_chars = *((uint64_t *) s);
if ((hopefully_eight_ascii_chars & 0x8080808080808080LL) != 0LL)
break;
s += 8;
/* Load 8 bytes of ASCII data */
const __m128i eight_ascii_chars = _mm_cvtsi64_si128(hopefully_eight_ascii_chars);
/* Interleave with zeros */
const __m128i eight_utf16_chars = _mm_unpacklo_epi8(eight_ascii_chars, zeros);
/* Store the resulting 8 bytes into destination */
_mm_storeu_si128((__m128i *)d, eight_utf16_chars);
d += 8;
}
#else
while (s < srcend - 4) {
codepoint = *((uint32_t *) s);
if ((codepoint & 0x80808080) != 0)
break;
s += 4;
/*
* Tried 32-bit stores here, but the extra bit-twiddling
* slowed the code down.
*/
*d++ = (uint16_t) (codepoint & 0xff);
*d++ = (uint16_t) ((codepoint >> 8) & 0xff);
*d++ = (uint16_t) ((codepoint >> 16) & 0xff);
*d++ = (uint16_t) ((codepoint >> 24) & 0xff);
}
#endif
last = s;
} /* end if (state == UTF8_ACCEPT) */
#endif
if (decode(&state, &codepoint, *s++) != UTF8_ACCEPT) {
if (state != UTF8_REJECT)
continue;
break;
}
if (codepoint <= 0xffff)
*d++ = (uint16_t) codepoint;
else {
*d++ = (uint16_t) (0xD7C0 + (codepoint >> 10));
*d++ = (uint16_t) (0xDC00 + (codepoint & 0x3FF));
}
last = s;
}
*destoff = d - dest;
*codepoint0 = codepoint;
*state0 = state;
*src = last;
return s;
} For lines around
and GCC 10.2 virtually the same
so they generate the same code. (I don't know whether GHC calls C-compiler with -O1 or -O2 or the same -O it was given, the TL;DR i don't think there is GCC & Clang concerns. It's not C++17 code :) |
I do wonder whether making loop work so it would do aligned loads would make it even faster, but eh. Benchmarking that would be devastating. In general I wonder whether the code for this exists in some C library. Decoding UTF8 isn't something we should reimplement. (Especially would help if |
This came up when discussing the SIMD implementation of
UTF8-based text is a path z-haskell is pursuing https://hackage.haskell.org/package/Z-Data |
Regarding CI on other OSes, I see that text uses haskell-ci for generating the GitHub Actions manifest, but it looks like that tool is Linux-only. I'm not sure, I don't see relevant documentation or mentions of Windows or macOS in the source code. |
Separate workflows are easier to manage, because GitHub allows to rerun a workflow, but not an individual job. |
Ah, that makes sense, I'll probably get to that over the weekend. |
0625ea0
to
a6ab43d
Compare
Rebased onto master, tests passed on all newly introduced OSes. |
on GHC-8.8.4, https://www.fileformat.info/info/unicode/char/10dd/index.htm On GHCs I tested, incl. GHC-9.0.1
It looks like that geortian text doesn't have concept of title case: https://en.wikipedia.org/wiki/Georgian_Extended and thus that test is broken. |
Could we merge this? I don't think that fixing |
Small note, doesn't affect much, but
Data.Char.isLower's behaviour changed in
https://gitlab.haskell.org/ghc/ghc/-/commit/14d88380ecb909e7032598aaad4efebb72561784.
I specifically ran into issues with results changing when supporting new
ghcs for duckling in facebook/duckling#541
…On Thu, Apr 15, 2021, 05:06 Dmitry Ivanov ***@***.***> wrote:
Could we merge this? I don't think that fixing t_toTitle_1stNotLower
belongs in this PR.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#302 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEOIX26CCXHCYLIQ3ZNBJ4LTI23CTANCNFSM4ST2YPIQ>
.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrgh, I'm deeply sorry, @ethercrow.
I have written a comment below, but forgot to press "Submit", and was awaiting for your response.
FWIW it's not a blocker, I'm just being curious.
cbits/cbits.c
Outdated
*/ | ||
const __m128i zeros = _mm_set1_epi32(0); | ||
while (p < srcend - 3) { | ||
/* Load 4 bytes of ASCII data */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to load 8 bytes here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, yes, this loop can be same as in decodeUtf8, but without a check.
I think I implemented this one first with 4-wide, then realized it can be 8-wide when doing decodeUtf8, but forgot to port it back to Latin1.
a6ab43d
to
6e1e8f7
Compare
6e1e8f7
to
0ad0820
Compare
|
||
const uint64_t w = eight_chars.halves[0]; | ||
if (w & 0xFF80FF80FF80FF80ULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just break
here actually? Does it affect performance? I imagine we could compare a whole 128-byte register with a broadcasted 0xFF80
, and either break
or _mm_packus_epi16
+ _mm_storel_epi64
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I experimented with where exactly and after how many checks the break can be and chose one of the best performing combinations. Unfortunately I didn't record those combinations and the results might be different on different CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, it looks asymmetric to exercise each pair of bytes from the first uint64
, but not in the second one. I understand that doing the same for the second uint64
make code more hairy, so maybe we can stop doing it for the first one as well? It would also allow us to avoid union
stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried both symmetric variants and found both to be slower than this one.
My understanding is that
- if you do too many checks by looking at individual bytes, the time cost of the simd loop iteration becomes higher.
- if you do too few checks (by not even looking at bytes in the first half) then the probability of the simd loop iteration doing useful work decreases and it also hurts the overall performance.
This probability is different for different input data, extreme cases would be pure ASCII (simd routine always works) and pure chinese text (simd routine never works). On our test data I observed that inspecting bytes in the first half was a sweet spot.
src += 4; | ||
|
||
if (eight_chars.halves[1] & 0xFF80FF80FF80FF80ULL) { | ||
break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here you can still pack and store the whole 0'th half using the pext
instruction
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_pext_u64&expand=4330
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR intentionally uses only SSE2 to be compatible with any 64bit x86 CPU. Quite a few models don't support pext
or their implementation is not fast. From wikipedia:
AMD processors before Zen 3[11] that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles[12] rather than a single cycle. As a result, if the mask is known, it is often faster to use other instructions on AMD.
Co-authored-by: Kubo Kováč <[email protected]>
Unless there are more comments/suggestions, I'll merge this by the end of the week. |
Thanks @ethercrow, performance improvements are much appreciated. And sorry that it took so long. |
Fixes haskell#302.
This PR is just a collection of SSE2 patches from those PRs:
@phadej please have a look.