Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE2 patches for encoding and decoding functions #302

Merged
merged 4 commits into from
Apr 25, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion benchmarks/haskell/Benchmarks.hs
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,9 @@ main = do
, env (DecodeUtf8.initEnv (tf "ascii.txt")) (DecodeUtf8.benchmark "ascii")
, env (DecodeUtf8.initEnv (tf "russian.txt")) (DecodeUtf8.benchmark "russian")
, env (DecodeUtf8.initEnv (tf "japanese.txt")) (DecodeUtf8.benchmark "japanese")
, EncodeUtf8.benchmark "επανάληψη 竺法蘭共譯"
, env (DecodeUtf8.initEnv (tf "ascii.txt")) (DecodeUtf8.benchmarkASCII)
, EncodeUtf8.benchmark "non-ASCII" "επανάληψη 竺法蘭共譯"
, EncodeUtf8.benchmark "ASCII" "lorem ipsum"
, env (Equality.initEnv (tf "japanese.txt")) Equality.benchmark
, FileRead.benchmark (tf "russian.txt")
, FoldLines.benchmark (tf "russian.txt")
Expand Down
12 changes: 12 additions & 0 deletions benchmarks/haskell/Benchmarks/DecodeUtf8.hs
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
module Benchmarks.DecodeUtf8
( initEnv
, benchmark
, benchmarkASCII
) where

import Foreign.C.Types
Expand Down Expand Up @@ -62,6 +63,17 @@ benchmark kind ~(bs, lbs) =
, bench "LazyInitLength" $ nf (TL.length . TL.init . TL.decodeUtf8) lbs
]

benchmarkASCII :: Env -> Benchmark
benchmarkASCII ~(bs, lbs) =
bgroup "DecodeASCII"
[ C.bench "strict decodeUtf8" $ nf T.decodeUtf8 bs
, C.bench "strict decodeLatin1" $ nf T.decodeLatin1 bs
, C.bench "strict decodeASCII" $ nf T.decodeASCII bs
, C.bench "lazy decodeUtf8" $ nf TL.decodeUtf8 lbs
, C.bench "lazy decodeLatin1" $ nf TL.decodeLatin1 lbs
, C.bench "lazy decodeASCII" $ nf TL.decodeASCII lbs
]

iconv :: B.ByteString -> IO CInt
iconv (PS fp off len) = withForeignPtr fp $ \ptr ->
time_iconv (ptr `plusPtr` off) (fromIntegral len)
Expand Down
8 changes: 4 additions & 4 deletions benchmarks/haskell/Benchmarks/EncodeUtf8.hs
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@ import qualified Data.Text.Encoding as T
import qualified Data.Text.Lazy as TL
import qualified Data.Text.Lazy.Encoding as TL

benchmark :: String -> Benchmark
benchmark string =
benchmark :: String -> String -> Benchmark
benchmark name string =
bgroup "EncodeUtf8"
[ bench "Text" $ whnf (B.length . T.encodeUtf8) text
, bench "LazyText" $ whnf (BL.length . TL.encodeUtf8) lazyText
[ bench ("Text (" ++ name ++ ")") $ whnf (B.length . T.encodeUtf8) text
, bench ("LazyText (" ++ name ++ ")") $ whnf (BL.length . TL.encodeUtf8) lazyText
]
where
-- The string in different formats
Expand Down
110 changes: 78 additions & 32 deletions cbits/cbits.c
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,11 @@
#include <string.h>
#include <stdint.h>
#include <stdio.h>
#if defined(__x86_64__)
#include <emmintrin.h>
#include <xmmintrin.h>
#endif

#include "text_cbits.h"

void _hs_text_memcpy(void *dest, size_t doff, const void *src, size_t soff,
Expand Down Expand Up @@ -82,6 +87,23 @@ _hs_text_decode_latin1(uint16_t *dest, const uint8_t *src,
while (p != srcend && (uintptr_t)p & 0x3)
*dest++ = *p++;

#if defined(__x86_64__)
/* All the intrinsics used here are from SSE2,
* so every x86_64 CPU supports them.
*/
const __m128i zeros = _mm_set1_epi32(0);
while (p < srcend - 7) {
/* Load 8 bytes of ASCII data */
const __m128i ascii = _mm_cvtsi64_si128(*((const uint64_t *)p));
/* Interleave with zeros */
const __m128i utf16 = _mm_unpacklo_epi8(ascii, zeros);
Lysxia marked this conversation as resolved.
Show resolved Hide resolved
/* Store the resulting 16 bytes into destination */
_mm_storeu_si128((__m128i *)dest, utf16);

dest += 8;
p += 8;
}
#else
/* iterate over 32-bit aligned loads */
while (p < srcend - 3) {
const uint32_t w = *((const uint32_t *)p);
Expand All @@ -93,6 +115,7 @@ _hs_text_decode_latin1(uint16_t *dest, const uint8_t *src,

p += 4;
}
#endif
#endif

/* handle unaligned suffix */
Expand Down Expand Up @@ -157,24 +180,40 @@ _hs_text_decode_utf8_int(uint16_t *const dest, size_t *destoff,
*/

if (state == UTF8_ACCEPT) {
#if defined(__x86_64__)
const __m128i zeros = _mm_set1_epi32(0);
while (s < srcend - 8) {
const uint64_t hopefully_eight_ascii_chars = *((uint64_t *) s);
if ((hopefully_eight_ascii_chars & 0x8080808080808080LL) != 0LL)
break;
s += 8;

/* Load 8 bytes of ASCII data */
const __m128i eight_ascii_chars = _mm_cvtsi64_si128(hopefully_eight_ascii_chars);
/* Interleave with zeros */
const __m128i eight_utf16_chars = _mm_unpacklo_epi8(eight_ascii_chars, zeros);
/* Store the resulting 16 bytes into destination */
_mm_storeu_si128((__m128i *)d, eight_utf16_chars);
d += 8;
}
#else
while (s < srcend - 4) {
codepoint = *((uint32_t *) s);
if ((codepoint & 0x80808080) != 0)
break;
s += 4;

/*
* Tried 32-bit stores here, but the extra bit-twiddling
* slowed the code down.
*/

*d++ = (uint16_t) (codepoint & 0xff);
*d++ = (uint16_t) ((codepoint >> 8) & 0xff);
*d++ = (uint16_t) ((codepoint >> 16) & 0xff);
*d++ = (uint16_t) ((codepoint >> 24) & 0xff);
codepoint = *((uint32_t *) s);
if ((codepoint & 0x80808080) != 0)
break;
s += 4;
/*
* Tried 32-bit stores here, but the extra bit-twiddling
* slowed the code down.
*/
*d++ = (uint16_t) (codepoint & 0xff);
*d++ = (uint16_t) ((codepoint >> 8) & 0xff);
*d++ = (uint16_t) ((codepoint >> 16) & 0xff);
*d++ = (uint16_t) ((codepoint >> 24) & 0xff);
}
#endif
last = s;
}
} /* end if (state == UTF8_ACCEPT) */
#endif

if (decode(&state, &codepoint, *s++) != UTF8_ACCEPT) {
Expand Down Expand Up @@ -237,29 +276,36 @@ _hs_text_encode_utf8(uint8_t **destp, const uint16_t *src, size_t srcoff,

ascii:
#if defined(__x86_64__)
while (srcend - src >= 4) {
uint64_t w = *((uint64_t *) src);
while (srcend - src >= 8) {
union { uint64_t halves[2]; __m128i whole; } eight_chars;
eight_chars.whole = _mm_loadu_si128((__m128i *) src);

const uint64_t w = eight_chars.halves[0];
if (w & 0xFF80FF80FF80FF80ULL) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just break here actually? Does it affect performance? I imagine we could compare a whole 128-byte register with a broadcasted 0xFF80, and either break or _mm_packus_epi16 + _mm_storel_epi64.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I experimented with where exactly and after how many checks the break can be and chose one of the best performing combinations. Unfortunately I didn't record those combinations and the results might be different on different CPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, it looks asymmetric to exercise each pair of bytes from the first uint64, but not in the second one. I understand that doing the same for the second uint64 make code more hairy, so maybe we can stop doing it for the first one as well? It would also allow us to avoid union stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried both symmetric variants and found both to be slower than this one.

My understanding is that

  1. if you do too many checks by looking at individual bytes, the time cost of the simd loop iteration becomes higher.
  2. if you do too few checks (by not even looking at bytes in the first half) then the probability of the simd loop iteration doing useful work decreases and it also hurts the overall performance.

This probability is different for different input data, extreme cases would be pure ASCII (simd routine always works) and pure chinese text (simd routine never works). On our test data I observed that inspecting bytes in the first half was a sweet spot.

if (!(w & 0x000000000000FF80ULL)) {
*dest++ = w & 0xFFFF;
src++;
if (!(w & 0x00000000FF800000ULL)) {
*dest++ = (w >> 16) & 0xFFFF;
src++;
if (!(w & 0x0000FF8000000000ULL)) {
*dest++ = (w >> 32) & 0xFFFF;
src++;
}
}
*dest++ = w & 0xFFFF;
src++;
if (!(w & 0x00000000FF800000ULL)) {
*dest++ = (w >> 16) & 0xFFFF;
src++;
if (!(w & 0x0000FF8000000000ULL)) {
*dest++ = (w >> 32) & 0xFFFF;
src++;
}
}
}
break;
}
*dest++ = w & 0xFFFF;
*dest++ = (w >> 16) & 0xFFFF;
*dest++ = (w >> 32) & 0xFFFF;
*dest++ = w >> 48;
src += 4;

if (eight_chars.halves[1] & 0xFF80FF80FF80FF80ULL) {
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here you can still pack and store the whole 0'th half using the pext instruction
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_pext_u64&expand=4330

Copy link
Contributor Author

@ethercrow ethercrow Apr 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR intentionally uses only SSE2 to be compatible with any 64bit x86 CPU. Quite a few models don't support pext or their implementation is not fast. From wikipedia:

AMD processors before Zen 3[11] that implement PDEP and PEXT do so in microcode, with a latency of 18 cycles[12] rather than a single cycle. As a result, if the mask is known, it is often faster to use other instructions on AMD.

}

const __m128i eight_ascii_chars = _mm_packus_epi16(eight_chars.whole, eight_chars.whole);
_mm_storel_epi64((__m128i *)dest, eight_ascii_chars);

dest += 8;
src += 8;
}
#endif

Expand Down