Optimize isSpace functions #315

ethercrow · 2020-10-25T19:50:21Z

GHC unlike GCC does not optimize expressions like x == 10 || x == 11 || x == 12 || x == 13 into x >= 10 && x <= 13 and further into x - 10 <= 3 so I did it here manually. Turns out this optimization was already applied years ago to Data.Char.isSpace: https://hackage.haskell.org/package/base-4.12.0.0/docs/src/GHC.Unicode.html#isSpace

I chose the w == 0x20 || w == 0xA0 || w - 0x09 <= 4 order or terms instead of w == 0x20 || w - 0x09 <= 4 || w == 0xA0 because it was faster on my machine. It also uses one fewer register according to the Compiler Explorer. It might be beneficial to adopt this order of terms in Data.Char.isSpace as well.

I also added a benchmark for words that uses isSpaceWord8 a lot.

Before:

benchmarked words/lots of words
time                 142.2 μs   (141.8 μs .. 142.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 141.9 μs   (141.7 μs .. 142.2 μs)
std dev              746.9 ns   (611.1 ns .. 927.7 ns)

benchmarked words/one huge word
time                 11.73 μs   (11.71 μs .. 11.75 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.74 μs   (11.72 μs .. 11.76 μs)
std dev              62.67 ns   (46.41 ns .. 86.93 ns)

After:

benchmarked words/lots of words
time                 133.1 μs   (132.7 μs .. 133.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 133.1 μs   (132.8 μs .. 133.5 μs)
std dev              1.003 μs   (671.2 ns .. 1.578 μs)

benchmarked words/one huge word
time                 10.11 μs   (10.08 μs .. 10.13 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.10 μs   (10.08 μs .. 10.13 μs)
std dev              71.64 ns   (50.03 ns .. 116.5 ns)

ethercrow · 2020-10-25T20:04:24Z

For the curious, here is how GCC compiles isspace, no jumps at all: https://godbolt.org/z/zb1oso

Bodigrim · 2020-10-25T20:26:56Z

GCC implementation is equivalent to

import Data.Bits

isSpace :: Word8 -> Bool 
isSpace x = x - 0x09 <= 4 || x .&. 0x7f == 0x20

and one can use GHC.Exts.{isTrue,leWord,eqWord,and,or}# to rewrite it in a jumpless way.

ethercrow · 2020-10-25T20:55:04Z

Thanks for the hint, I just tried this:

isSpaceWord8 :: Word8 -> Bool
isSpaceWord8 (W8# w) =
  isTrue# (orI#
    (eqWord# (and# w 0x7f##) 0x20##)       -- ' ' or nbsp
    (leWord# (minusWord# w 0x09##) 4##))  -- \t, \n, \v, \f, \r

It slowed down words/lots of words from 130-ish to 150-ish us, looks like early exit on the space character is really valuable. There was no noticable effect on words/one huge word.

If I add the 0x20 check to this version, it performs identically to what is already in PR. So it's probably better not to add all those magic hashes.

Bodigrim · 2020-10-25T22:30:41Z

Well, your benchmark is obviously biased in favor of the short-circuiting version, but I guess it is not different in this aspect from the real-world data. Recently @vdukhovni and I discussed a similar function in haskell-streaming/streaming-bytestring#31 (comment)

vdukhovni · 2020-10-25T22:44:03Z

Well, your benchmark is obviously biased in favor of the short-circuiting version, but I guess it is not different in this aspect from the real-world data. Recently @vdukhovni and I discussed a similar function in haskell-streaming/streaming-bytestring#31 (comment)

Indeed the proposed function is almost identical to the one in streaming bytestring, except that I optimise for most characters being non-whitespace ASCII characters, by first ruling out most of those:

-- Predicate to test whether a 'Word8' value is either ASCII whitespace,
-- or a unicode NBSP (U+00A0).  Optimised for ASCII text, with spaces
-- as the most frequent whitespace characters.
w8IsSpace :: Word8 -> Bool
w8IsSpace = \ !w8 ->
    -- Avoid the cost of narrowing arithmetic results to Word8,
    -- the conversion from Word8 to Word is free.
    let w :: Word
        !w = fromIntegral w8
     in w - 0x21 > 0x7e   -- not [x21..0x9f]
        && ( w == 0x20    -- SP
          || w - 0x09 < 5 -- HT, NL, VT, FF, CR
          || w == 0xa0 )  -- NBSP
{-# INLINE w8IsSpace #-}

I am curious how the above compares with this PR on "real world" test data...

[ EDIT: I'm not convinced that the intersperse test-case is realistic behaviour, spaces as every other character is surely not that common, I'd expect to see short runs (>1) of non-whitespace characters as more typical with inputs that one is interested in splitting into "words". My attempt with a couple of paragraphs of lorem ipsum shows 6.59us for the above vs. 6.80us for the version in this PR, but it is slower on the intersperse and one long word tests:

This PR:

benchmarked words/lots of words
time                 228.9 μs   (228.4 μs .. 229.7 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 229.6 μs   (229.1 μs .. 230.6 μs)
std dev              2.142 μs   (999.6 ns .. 3.601 μs)

benchmarked words/one huge word
time                 22.57 μs   (22.51 μs .. 22.70 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 22.53 μs   (22.51 μs .. 22.58 μs)
std dev              101.5 ns   (20.59 ns .. 176.2 ns)

benchmarked words/paragraphs
time                 6.818 μs   (6.737 μs .. 6.909 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.799 μs   (6.779 μs .. 6.826 μs)
std dev              79.99 ns   (60.81 ns .. 98.54 ns)

The alternative from streaming-bytestring:

benchmarked words/lots of words
time                 276.4 μs   (274.3 μs .. 278.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 274.9 μs   (273.2 μs .. 276.3 μs)
std dev              5.218 μs   (3.786 μs .. 7.338 μs)

benchmarked words/one huge word
time                 23.63 μs   (23.63 μs .. 23.64 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 23.63 μs   (23.62 μs .. 23.63 μs)
std dev              17.97 ns   (12.39 ns .. 27.20 ns)

benchmarked words/paragraphs
time                 6.583 μs   (6.572 μs .. 6.593 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.588 μs   (6.584 μs .. 6.594 μs)
std dev              16.07 ns   (11.83 ns .. 23.67 ns)

]

ethercrow · 2020-10-26T13:14:46Z

Changed the test case from intersperse ' ' bigData to lorem ipsum.

BTW how do you run only a subset of benchmarks with cabal bench? I'm just deleting everything not relevant for a moment, that's rather cumbersome.

sjakobi · 2020-10-26T17:27:30Z

BTW how do you run only a subset of benchmarks with cabal bench?

You can pass in options via --benchmark-option[s], e.g.

$ cabal bench bench-bytestring-builder --benchmark-option folds/scanl/1

Try passing in -h to see the various CLI options – --match might be useful.

(vincenthz/hs-gauge#97 is related)

ethercrow · 2020-10-26T19:50:57Z

Viktor's version is the winner on lorem ipsum on my machine as well. So let's adopt that.

Bodigrim · 2020-10-26T20:05:26Z

Data/ByteString/Internal.hs

+    -- the conversion from Word8 to Word is free.
+    let w :: Word
+        !w = fromIntegral w8
+     in w - 0x21 > 0x7e   -- not [x21..0x9f]


This condition discriminates 127 out of 256 possibilities. Could you please benchmark w .&. 0x50 == 0, which discriminates 192 values?

With the & 0x50 test, I get noticeably better results, which outperform also the proposed PR on all the test cases.

benchmarked words/lots of words time 221.2 μs (220.8 μs .. 222.0 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 220.9 μs (220.9 μs .. 221.1 μs) std dev 338.0 ns (159.2 ns .. 669.3 ns) benchmarked words/one huge word time 18.07 μs (17.95 μs .. 18.23 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 17.96 μs (17.94 μs .. 18.01 μs) std dev 104.3 ns (68.50 ns .. 195.1 ns) benchmarked words/paragraphs time 6.243 μs (6.225 μs .. 6.267 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.236 μs (6.231 μs .. 6.240 μs) std dev 15.92 ns (11.74 ns .. 25.17 ns)

But I get even better results combining both filters:

isSpaceWord8 :: Word8 -> Bool isSpaceWord8 = \ !w8 -> -- Avoid the cost of narrowing arithmetic results to Word8, -- the conversion from Word8 to Word is free. let w :: Word !w = fromIntegral w8 in w .&. 0x50 == 0 -- Quick non-whitespace filter && w - 0x21 > 0x7e -- Second non-whitespace filter && ( w == 0x20 -- SP || w - 0x09 < 5 -- HT, NL, VT, FF, CR || w == 0xa0 ) -- NBSP

benchmarked words/lots of words time 216.5 μs (215.6 μs .. 218.3 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 215.8 μs (215.7 μs .. 216.3 μs) std dev 720.2 ns (154.3 ns .. 1.500 μs) benchmarked words/one huge word time 16.77 μs (16.61 μs .. 16.94 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 16.87 μs (16.82 μs .. 16.91 μs) std dev 138.9 ns (97.75 ns .. 183.6 ns) benchmarked words/paragraphs time 6.060 μs (6.040 μs .. 6.083 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.048 μs (6.044 μs .. 6.054 μs) std dev 16.40 ns (12.03 ns .. 23.45 ns)

Just reran the PR as-is as a sanity check that nothing changed in the mean-time and I get:

benchmarked words/lots of words time 228.5 μs (227.6 μs .. 229.0 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 229.8 μs (229.1 μs .. 232.5 μs) std dev 3.588 μs (216.7 ns .. 7.228 μs) benchmarked words/one huge word time 22.45 μs (22.37 μs .. 22.52 μs) 1.000 R² (0.999 R² .. 1.000 R²) mean 22.62 μs (22.54 μs .. 22.80 μs) std dev 374.5 ns (181.8 ns .. 612.1 ns) benchmarked words/paragraphs time 6.746 μs (6.727 μs .. 6.760 μs) 1.000 R² (1.000 R² .. 1.000 R²) mean 6.776 μs (6.760 μs .. 6.841 μs) std dev 87.14 ns (23.18 ns .. 190.9 ns)

Between the two filters only 33 candidate characters are left:

λ> length $ filter (\w -> w .&. 0x50 == 0 && w - 0x21 > 0x7e) [0..255 :: Word8] 33

Cool stuff.

What's more, other than whitespace, almost all are infrequent in text strings (rather than binary data):

[0,1,2,3,4,5,6,7,8 -- controls ,9,10,11,12,13 -- whitespace ,14,15 -- controls ,32,160 -- whitespace ,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175 -- ¡¢£¤¥¦§¨©ª«¬®¯ ]

@ethercrow I see you've switched to the implementation I was testing, are you seeing similar benchmark improvements on your hardware?

Yes, the version with two filters is the fastest for me as well.

vdukhovni · 2020-10-26T22:04:38Z

bench/BenchAll.hs

@@ -101,6 +101,9 @@ byteStringChunksData = map (S.pack . replicate (4 ) . fromIntegral) intData
 oldByteStringChunksData :: [OldS.ByteString]
 oldByteStringChunksData = map (OldS.pack . replicate (4 ) . fromIntegral) intData

+{-# NOINLINE loremIpsum #-}
+loremIpsum :: S.ByteString
+loremIpsum = S8.pack "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\nSed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?\n"


FWIW, I would fold this across multiple lines. The version I used was:

paragraphs :: S.ByteString paragraphs = S8.pack $ "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n\ \incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis\n\ \nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\n\ \Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu\n\ \fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in\n\ \culpa qui officia deserunt mollit anim id est laborum.\n\ \\n\ \Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium\n\ \doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore\n\ \veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim\n\ \ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia\n\ \consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque\n\ \porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur,\n\ \adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et\n\ \dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis\n\ \nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid\n\ \ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea\n\ \voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem\n\ \eum fugiat quo voluptas nulla pariatur?"

Though one paragraph is likely sufficient...

vdukhovni

Looks good overall. The only nit (already noted) is folding the long string across multiple lines.

Bodigrim · 2020-10-28T18:19:45Z

@ethercrow Tests for GHC < 7.10 are failing.
https://travis-ci.org/github/haskell/bytestring/jobs/739381052#L285

Data/ByteString/Internal.hs:679:14:
    Not in scope: type constructor or class ‘Word’
    Perhaps you meant ‘Word8’ (imported from Data.Word)

vdukhovni · 2020-10-28T18:22:15Z

For compatibility with older GHC, you'll need to import Word from Data.Word into Data.ByteString.Internal:
old:

import Data.Word                (Word8)

new:

import Data.Word                (Word8, Word)

With that, the CI tests should pass.

Bodigrim · 2020-10-29T18:23:35Z

@ethercrow while we are waiting for @sjakobi to review, do you want me to label this as "hacktoberfest-accepted"?

ethercrow · 2020-10-29T20:19:19Z

@ethercrow while we are waiting for @sjakobi to review, do you want me to label this as "hacktoberfest-accepted"?

I started looking at bytestring when you posted a hacktoberfest call to arms, but that hacktoberfest context was not important to me, so don't worry about it. Thank you for caring though!

This reverts commit 0055867. # Conflicts: # Data/ByteString/Internal.hs # bench/BenchAll.hs

This reverts commit cc2287b.

Bodigrim approved these changes Oct 25, 2020

View reviewed changes

Bodigrim requested a review from sjakobi October 25, 2020 22:32

Optimize isSpace functions

c4e44bd

ethercrow force-pushed the faster-is-space branch from 8c6bdde to c4e44bd Compare October 26, 2020 19:48

Bodigrim reviewed Oct 26, 2020

View reviewed changes

Additional quick filter in isSpaceWord8

9127d6e

ethercrow force-pushed the faster-is-space branch from 8e96da8 to 9127d6e Compare October 26, 2020 20:55

vdukhovni reviewed Oct 26, 2020

View reviewed changes

vdukhovni approved these changes Oct 26, 2020

View reviewed changes

Split lorem ipsum test string into more lines

c1024c7

vdukhovni approved these changes Oct 27, 2020

View reviewed changes

Fix build with GHC<7.10

8be3627

sjakobi approved these changes Oct 29, 2020

View reviewed changes

Bodigrim added this to the 0.11.1.0 milestone Oct 29, 2020

Bodigrim merged commit 0055867 into haskell:master Oct 29, 2020

ethercrow deleted the faster-is-space branch October 29, 2020 20:19

Bodigrim added a commit to Bodigrim/bytestring that referenced this pull request Feb 17, 2021

Revert "Optimize isSpace functions (haskell#315)"

cc2287b

This reverts commit 0055867. # Conflicts: # Data/ByteString/Internal.hs # bench/BenchAll.hs

Bodigrim added a commit to Bodigrim/bytestring that referenced this pull request Feb 17, 2021

Revert "Revert "Optimize isSpace functions (haskell#315)""

d54623d

This reverts commit cc2287b.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize isSpace functions #315

Optimize isSpace functions #315

ethercrow commented Oct 25, 2020 •

edited

Loading

ethercrow commented Oct 25, 2020

Bodigrim commented Oct 25, 2020 •

edited

Loading

ethercrow commented Oct 25, 2020 •

edited

Loading

Bodigrim commented Oct 25, 2020

vdukhovni commented Oct 25, 2020 •

edited

Loading

ethercrow commented Oct 26, 2020

sjakobi commented Oct 26, 2020

ethercrow commented Oct 26, 2020

Bodigrim Oct 26, 2020

vdukhovni Oct 26, 2020

vdukhovni Oct 26, 2020

vdukhovni Oct 26, 2020

ethercrow Oct 26, 2020

vdukhovni Oct 26, 2020

vdukhovni Oct 26, 2020 •

edited

Loading

ethercrow Oct 26, 2020

vdukhovni Oct 26, 2020

vdukhovni left a comment

Bodigrim commented Oct 28, 2020

vdukhovni commented Oct 28, 2020

Bodigrim commented Oct 29, 2020

ethercrow commented Oct 29, 2020

Optimize isSpace functions #315

Optimize isSpace functions #315

Conversation

ethercrow commented Oct 25, 2020 • edited Loading

ethercrow commented Oct 25, 2020

Bodigrim commented Oct 25, 2020 • edited Loading

ethercrow commented Oct 25, 2020 • edited Loading

Bodigrim commented Oct 25, 2020

vdukhovni commented Oct 25, 2020 • edited Loading

ethercrow commented Oct 26, 2020

sjakobi commented Oct 26, 2020

ethercrow commented Oct 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdukhovni Oct 26, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vdukhovni left a comment

Choose a reason for hiding this comment

Bodigrim commented Oct 28, 2020

vdukhovni commented Oct 28, 2020

Bodigrim commented Oct 29, 2020

ethercrow commented Oct 29, 2020

ethercrow commented Oct 25, 2020 •

edited

Loading

Bodigrim commented Oct 25, 2020 •

edited

Loading

ethercrow commented Oct 25, 2020 •

edited

Loading

vdukhovni commented Oct 25, 2020 •

edited

Loading

vdukhovni Oct 26, 2020 •

edited

Loading