Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize isSpace functions #315

Merged
merged 4 commits into from
Oct 29, 2020
Merged

Conversation

ethercrow
Copy link
Contributor

@ethercrow ethercrow commented Oct 25, 2020

GHC unlike GCC does not optimize expressions like x == 10 || x == 11 || x == 12 || x == 13 into x >= 10 && x <= 13 and further into x - 10 <= 3 so I did it here manually. Turns out this optimization was already applied years ago to Data.Char.isSpace: https://hackage.haskell.org/package/base-4.12.0.0/docs/src/GHC.Unicode.html#isSpace

I chose the w == 0x20 || w == 0xA0 || w - 0x09 <= 4 order or terms instead of w == 0x20 || w - 0x09 <= 4 || w == 0xA0 because it was faster on my machine. It also uses one fewer register according to the Compiler Explorer. It might be beneficial to adopt this order of terms in Data.Char.isSpace as well.

I also added a benchmark for words that uses isSpaceWord8 a lot.

Before:

benchmarked words/lots of words
time                 142.2 μs   (141.8 μs .. 142.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 141.9 μs   (141.7 μs .. 142.2 μs)
std dev              746.9 ns   (611.1 ns .. 927.7 ns)

benchmarked words/one huge word
time                 11.73 μs   (11.71 μs .. 11.75 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.74 μs   (11.72 μs .. 11.76 μs)
std dev              62.67 ns   (46.41 ns .. 86.93 ns)

After:

benchmarked words/lots of words
time                 133.1 μs   (132.7 μs .. 133.7 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 133.1 μs   (132.8 μs .. 133.5 μs)
std dev              1.003 μs   (671.2 ns .. 1.578 μs)

benchmarked words/one huge word
time                 10.11 μs   (10.08 μs .. 10.13 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 10.10 μs   (10.08 μs .. 10.13 μs)
std dev              71.64 ns   (50.03 ns .. 116.5 ns)

@ethercrow
Copy link
Contributor Author

For the curious, here is how GCC compiles isspace, no jumps at all: https://godbolt.org/z/zb1oso

@Bodigrim
Copy link
Contributor

Bodigrim commented Oct 25, 2020

GCC implementation is equivalent to

import Data.Bits

isSpace :: Word8 -> Bool 
isSpace x = x - 0x09 <= 4 || x .&. 0x7f == 0x20

and one can use GHC.Exts.{isTrue,leWord,eqWord,and,or}# to rewrite it in a jumpless way.

@ethercrow
Copy link
Contributor Author

ethercrow commented Oct 25, 2020

Thanks for the hint, I just tried this:

isSpaceWord8 :: Word8 -> Bool
isSpaceWord8 (W8# w) =
  isTrue# (orI#
    (eqWord# (and# w 0x7f##) 0x20##)       -- ' ' or nbsp
    (leWord# (minusWord# w 0x09##) 4##))  -- \t, \n, \v, \f, \r

It slowed down words/lots of words from 130-ish to 150-ish us, looks like early exit on the space character is really valuable. There was no noticable effect on words/one huge word.

If I add the 0x20 check to this version, it performs identically to what is already in PR. So it's probably better not to add all those magic hashes.

@Bodigrim
Copy link
Contributor

Well, your benchmark is obviously biased in favor of the short-circuiting version, but I guess it is not different in this aspect from the real-world data. Recently @vdukhovni and I discussed a similar function in haskell-streaming/streaming-bytestring#31 (comment)

@Bodigrim Bodigrim requested a review from sjakobi October 25, 2020 22:32
@vdukhovni
Copy link
Contributor

vdukhovni commented Oct 25, 2020

Well, your benchmark is obviously biased in favor of the short-circuiting version, but I guess it is not different in this aspect from the real-world data. Recently @vdukhovni and I discussed a similar function in haskell-streaming/streaming-bytestring#31 (comment)

Indeed the proposed function is almost identical to the one in streaming bytestring, except that I optimise for most characters being non-whitespace ASCII characters, by first ruling out most of those:

-- Predicate to test whether a 'Word8' value is either ASCII whitespace,
-- or a unicode NBSP (U+00A0).  Optimised for ASCII text, with spaces
-- as the most frequent whitespace characters.
w8IsSpace :: Word8 -> Bool
w8IsSpace = \ !w8 ->
    -- Avoid the cost of narrowing arithmetic results to Word8,
    -- the conversion from Word8 to Word is free.
    let w :: Word
        !w = fromIntegral w8
     in w - 0x21 > 0x7e   -- not [x21..0x9f]
        && ( w == 0x20    -- SP
          || w - 0x09 < 5 -- HT, NL, VT, FF, CR
          || w == 0xa0 )  -- NBSP
{-# INLINE w8IsSpace #-}

I am curious how the above compares with this PR on "real world" test data...

[ EDIT: I'm not convinced that the intersperse test-case is realistic behaviour, spaces as every other character is surely not that common, I'd expect to see short runs (>1) of non-whitespace characters as more typical with inputs that one is interested in splitting into "words". My attempt with a couple of paragraphs of lorem ipsum shows 6.59us for the above vs. 6.80us for the version in this PR, but it is slower on the intersperse and one long word tests:

This PR:

benchmarked words/lots of words
time                 228.9 μs   (228.4 μs .. 229.7 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 229.6 μs   (229.1 μs .. 230.6 μs)
std dev              2.142 μs   (999.6 ns .. 3.601 μs)

benchmarked words/one huge word
time                 22.57 μs   (22.51 μs .. 22.70 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 22.53 μs   (22.51 μs .. 22.58 μs)
std dev              101.5 ns   (20.59 ns .. 176.2 ns)

benchmarked words/paragraphs
time                 6.818 μs   (6.737 μs .. 6.909 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 6.799 μs   (6.779 μs .. 6.826 μs)
std dev              79.99 ns   (60.81 ns .. 98.54 ns)

The alternative from streaming-bytestring:

benchmarked words/lots of words
time                 276.4 μs   (274.3 μs .. 278.8 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 274.9 μs   (273.2 μs .. 276.3 μs)
std dev              5.218 μs   (3.786 μs .. 7.338 μs)

benchmarked words/one huge word
time                 23.63 μs   (23.63 μs .. 23.64 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 23.63 μs   (23.62 μs .. 23.63 μs)
std dev              17.97 ns   (12.39 ns .. 27.20 ns)

benchmarked words/paragraphs
time                 6.583 μs   (6.572 μs .. 6.593 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.588 μs   (6.584 μs .. 6.594 μs)
std dev              16.07 ns   (11.83 ns .. 23.67 ns)

]

@ethercrow
Copy link
Contributor Author

Changed the test case from intersperse ' ' bigData to lorem ipsum.

BTW how do you run only a subset of benchmarks with cabal bench? I'm just deleting everything not relevant for a moment, that's rather cumbersome.

@sjakobi
Copy link
Member

sjakobi commented Oct 26, 2020

BTW how do you run only a subset of benchmarks with cabal bench?

You can pass in options via --benchmark-option[s], e.g.

$ cabal bench bench-bytestring-builder --benchmark-option folds/scanl/1

Try passing in -h to see the various CLI options – --match might be useful.

(vincenthz/hs-gauge#97 is related)

@ethercrow
Copy link
Contributor Author

Viktor's version is the winner on lorem ipsum on my machine as well. So let's adopt that.

-- the conversion from Word8 to Word is free.
let w :: Word
!w = fromIntegral w8
in w - 0x21 > 0x7e -- not [x21..0x9f]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This condition discriminates 127 out of 256 possibilities. Could you please benchmark w .&. 0x50 == 0, which discriminates 192 values?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the & 0x50 test, I get noticeably better results, which outperform also the proposed PR on all the test cases.

benchmarked words/lots of words
time                 221.2 μs   (220.8 μs .. 222.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 220.9 μs   (220.9 μs .. 221.1 μs)
std dev              338.0 ns   (159.2 ns .. 669.3 ns)

benchmarked words/one huge word
time                 18.07 μs   (17.95 μs .. 18.23 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 17.96 μs   (17.94 μs .. 18.01 μs)
std dev              104.3 ns   (68.50 ns .. 195.1 ns)

benchmarked words/paragraphs
time                 6.243 μs   (6.225 μs .. 6.267 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.236 μs   (6.231 μs .. 6.240 μs)
std dev              15.92 ns   (11.74 ns .. 25.17 ns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I get even better results combining both filters:

isSpaceWord8 :: Word8 -> Bool
isSpaceWord8 = \ !w8 ->
    -- Avoid the cost of narrowing arithmetic results to Word8,
    -- the conversion from Word8 to Word is free.
    let w :: Word
        !w = fromIntegral w8
     in w .&. 0x50 == 0   -- Quick non-whitespace filter
        && w - 0x21 > 0x7e -- Second non-whitespace filter
        && ( w == 0x20    -- SP
          || w - 0x09 < 5 -- HT, NL, VT, FF, CR
          || w == 0xa0 )  -- NBSP
benchmarked words/lots of words
time                 216.5 μs   (215.6 μs .. 218.3 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 215.8 μs   (215.7 μs .. 216.3 μs)
std dev              720.2 ns   (154.3 ns .. 1.500 μs)

benchmarked words/one huge word
time                 16.77 μs   (16.61 μs .. 16.94 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 16.87 μs   (16.82 μs .. 16.91 μs)
std dev              138.9 ns   (97.75 ns .. 183.6 ns)

benchmarked words/paragraphs
time                 6.060 μs   (6.040 μs .. 6.083 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.048 μs   (6.044 μs .. 6.054 μs)
std dev              16.40 ns   (12.03 ns .. 23.45 ns)

Just reran the PR as-is as a sanity check that nothing changed in the mean-time and I get:

benchmarked words/lots of words
time                 228.5 μs   (227.6 μs .. 229.0 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 229.8 μs   (229.1 μs .. 232.5 μs)
std dev              3.588 μs   (216.7 ns .. 7.228 μs)

benchmarked words/one huge word
time                 22.45 μs   (22.37 μs .. 22.52 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 22.62 μs   (22.54 μs .. 22.80 μs)
std dev              374.5 ns   (181.8 ns .. 612.1 ns)

benchmarked words/paragraphs
time                 6.746 μs   (6.727 μs .. 6.760 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 6.776 μs   (6.760 μs .. 6.841 μs)
std dev              87.14 ns   (23.18 ns .. 190.9 ns)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Between the two filters only 33 candidate characters are left:

λ> length $ filter (\w -> w .&. 0x50 == 0 && w - 0x21 > 0x7e) [0..255 :: Word8]
33

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool stuff.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's more, other than whitespace, almost all are infrequent in text strings (rather than binary data):

[0,1,2,3,4,5,6,7,8 -- controls
,9,10,11,12,13 -- whitespace
,14,15 -- controls
,32,160 -- whitespace
,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175 -- ¡¢£¤¥¦§¨©ª«¬­®¯
]

Copy link
Contributor

@vdukhovni vdukhovni Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethercrow I see you've switched to the implementation I was testing, are you seeing similar benchmark improvements on your hardware?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the version with two filters is the fastest for me as well.

@@ -101,6 +101,9 @@ byteStringChunksData = map (S.pack . replicate (4 ) . fromIntegral) intData
oldByteStringChunksData :: [OldS.ByteString]
oldByteStringChunksData = map (OldS.pack . replicate (4 ) . fromIntegral) intData

{-# NOINLINE loremIpsum #-}
loremIpsum :: S.ByteString
loremIpsum = S8.pack "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\nSed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur, adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem eum fugiat quo voluptas nulla pariatur?\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I would fold this across multiple lines. The version I used was:

paragraphs :: S.ByteString
paragraphs = S8.pack $
   "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor\n\
   \incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis\n\
   \nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.\n\
   \Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu\n\
   \fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in\n\
   \culpa qui officia deserunt mollit anim id est laborum.\n\
   \\n\
   \Sed ut perspiciatis unde omnis iste natus error sit voluptatem accusantium\n\
   \doloremque laudantium, totam rem aperiam, eaque ipsa quae ab illo inventore\n\
   \veritatis et quasi architecto beatae vitae dicta sunt explicabo. Nemo enim\n\
   \ipsam voluptatem quia voluptas sit aspernatur aut odit aut fugit, sed quia\n\
   \consequuntur magni dolores eos qui ratione voluptatem sequi nesciunt. Neque\n\
   \porro quisquam est, qui dolorem ipsum quia dolor sit amet, consectetur,\n\
   \adipisci velit, sed quia non numquam eius modi tempora incidunt ut labore et\n\
   \dolore magnam aliquam quaerat voluptatem. Ut enim ad minima veniam, quis\n\
   \nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid\n\
   \ex ea commodi consequatur? Quis autem vel eum iure reprehenderit qui in ea\n\
   \voluptate velit esse quam nihil molestiae consequatur, vel illum qui dolorem\n\
   \eum fugiat quo voluptas nulla pariatur?"

Though one paragraph is likely sufficient...

Copy link
Contributor

@vdukhovni vdukhovni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. The only nit (already noted) is folding the long string across multiple lines.

@Bodigrim
Copy link
Contributor

@ethercrow Tests for GHC < 7.10 are failing.
https://travis-ci.org/github/haskell/bytestring/jobs/739381052#L285

Data/ByteString/Internal.hs:679:14:
    Not in scope: type constructor or class ‘Word’
    Perhaps you meant ‘Word8’ (imported from Data.Word)

@vdukhovni
Copy link
Contributor

For compatibility with older GHC, you'll need to import Word from Data.Word into Data.ByteString.Internal:
old:

import Data.Word                (Word8)

new:

import Data.Word                (Word8, Word)

With that, the CI tests should pass.

@Bodigrim
Copy link
Contributor

@ethercrow while we are waiting for @sjakobi to review, do you want me to label this as "hacktoberfest-accepted"?

@Bodigrim Bodigrim added this to the 0.11.1.0 milestone Oct 29, 2020
@Bodigrim Bodigrim merged commit 0055867 into haskell:master Oct 29, 2020
@ethercrow
Copy link
Contributor Author

@ethercrow while we are waiting for @sjakobi to review, do you want me to label this as "hacktoberfest-accepted"?

I started looking at bytestring when you posted a hacktoberfest call to arms, but that hacktoberfest context was not important to me, so don't worry about it. Thank you for caring though!

@ethercrow ethercrow deleted the faster-is-space branch October 29, 2020 20:19
Bodigrim added a commit to Bodigrim/bytestring that referenced this pull request Feb 17, 2021
This reverts commit 0055867.

# Conflicts:
#	Data/ByteString/Internal.hs
#	bench/BenchAll.hs
Bodigrim added a commit to Bodigrim/bytestring that referenced this pull request Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants