Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hand SAT optimization for Data.ByteString.unfoldrN #356

Merged
merged 3 commits into from
May 26, 2021

Conversation

SkamDart
Copy link
Contributor

This follows the approach described in #350 to optimize unfoldrN to float out p and use pokeByteOff

@Bodigrim
Copy link
Contributor

Cool! Could you please add a benchmark to bench/BenchAll.hs, so that we can estimate performance gains?

And if you throw {-# INLINE createAndTrim' #-} atop, we'll be able to resolve #128 as well.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 4, 2021

@SkamDart feel free to ping me if you need any guidance with regards to adding a benchmark.

@SkamDart
Copy link
Contributor Author

SkamDart commented Feb 5, 2021

@Bodigrim - Thanks for reaching out. Will address your comments this weekend and reach out to you if I am having any troubles!

@SkamDart
Copy link
Contributor Author

SkamDart commented Feb 9, 2021

With Optimizations:

benchmarked folds/unfoldrN/1
time                 28.97 ns   (28.90 ns .. 29.05 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 28.99 ns   (28.95 ns .. 29.17 ns)
std dev              231.1 ps   (91.93 ps .. 484.9 ps)

benchmarked folds/unfoldrN/2
time                 29.14 ns   (29.08 ns .. 29.20 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.14 ns   (29.12 ns .. 29.18 ns)
std dev              104.3 ps   (88.14 ps .. 125.2 ps)

benchmarked folds/unfoldrN/4
time                 29.10 ns   (29.01 ns .. 29.19 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.14 ns   (29.11 ns .. 29.17 ns)
std dev              102.6 ps   (81.06 ps .. 142.8 ps)

benchmarked folds/unfoldrN/8
time                 29.13 ns   (29.06 ns .. 29.23 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.13 ns   (29.10 ns .. 29.17 ns)
std dev              128.4 ps   (86.35 ps .. 208.7 ps)

benchmarked folds/unfoldrN/16
time                 29.12 ns   (29.05 ns .. 29.21 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.13 ns   (29.11 ns .. 29.16 ns)
std dev              75.93 ps   (63.56 ps .. 95.53 ps)

benchmarked folds/unfoldrN/32
time                 29.44 ns   (29.39 ns .. 29.48 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.36 ns   (29.33 ns .. 29.39 ns)
std dev              96.41 ps   (80.49 ps .. 121.5 ps)

benchmarked folds/unfoldrN/64
time                 29.70 ns   (29.66 ns .. 29.74 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 29.71 ns   (29.69 ns .. 29.73 ns)
std dev              66.84 ps   (57.23 ps .. 79.83 ps)

benchmarked folds/unfoldrN/128
time                 30.55 ns   (30.51 ns .. 30.61 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.60 ns   (30.58 ns .. 30.64 ns)
std dev              96.43 ps   (68.66 ps .. 160.4 ps)

benchmarked folds/unfoldrN/256
time                 32.02 ns   (31.97 ns .. 32.08 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 32.02 ns   (32.00 ns .. 32.05 ns)
std dev              83.63 ps   (63.89 ps .. 109.6 ps)

benchmarked folds/unfoldrN/512
time                 34.14 ns   (34.07 ns .. 34.23 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 34.10 ns   (34.07 ns .. 34.15 ns)
std dev              139.0 ps   (104.2 ps .. 223.7 ps)

benchmarked folds/unfoldrN/1024
time                 37.54 ns   (37.49 ns .. 37.62 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 37.69 ns   (37.65 ns .. 37.73 ns)
std dev              142.4 ps   (120.0 ps .. 176.0 ps)

benchmarked folds/unfoldrN/2048
time                 55.08 ns   (54.89 ns .. 55.28 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 54.93 ns   (54.83 ns .. 55.06 ns)
std dev              375.7 ps   (304.2 ps .. 474.4 ps)

benchmarked folds/unfoldrN/4096
time                 61.73 ns   (61.44 ns .. 62.14 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 61.75 ns   (61.62 ns .. 61.89 ns)
std dev              487.2 ps   (414.7 ps .. 572.1 ps)

benchmarked folds/unfoldrN/8192
time                 70.54 ns   (70.27 ns .. 70.74 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 70.66 ns   (70.51 ns .. 70.88 ns)
std dev              595.2 ps   (354.2 ps .. 1.027 ns)

benchmarked folds/unfoldrN/16384
time                 86.74 ns   (86.13 ns .. 87.33 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 86.39 ns   (86.21 ns .. 86.56 ns)
std dev              587.8 ps   (493.2 ps .. 692.3 ps)

benchmarked folds/unfoldrN/32768
time                 118.8 ns   (117.5 ns .. 120.7 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 118.2 ns   (118.0 ns .. 118.6 ns)
std dev              1.048 ns   (654.2 ps .. 1.647 ns)

benchmarked folds/unfoldrN/65536
time                 181.9 ns   (181.3 ns .. 182.8 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 180.4 ns   (179.9 ns .. 180.9 ns)
std dev              1.542 ns   (1.294 ns .. 2.017 ns)

Without Optimizations

benchmarked folds/unfoldrN/1
time                 35.54 ns   (35.46 ns .. 35.63 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 35.49 ns   (35.46 ns .. 35.53 ns)
std dev              119.9 ps   (101.0 ps .. 147.3 ps)

benchmarked folds/unfoldrN/2
time                 35.68 ns   (35.61 ns .. 35.75 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 35.61 ns   (35.59 ns .. 35.64 ns)
std dev              94.22 ps   (80.22 ps .. 111.7 ps)

benchmarked folds/unfoldrN/4
time                 35.61 ns   (35.55 ns .. 35.67 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 35.64 ns   (35.62 ns .. 35.67 ns)
std dev              89.30 ps   (73.96 ps .. 109.5 ps)

benchmarked folds/unfoldrN/8
time                 35.48 ns   (35.43 ns .. 35.53 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 35.60 ns   (35.57 ns .. 35.63 ns)
std dev              112.2 ps   (92.34 ps .. 138.5 ps)

benchmarked folds/unfoldrN/16
time                 35.59 ns   (35.56 ns .. 35.63 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 35.67 ns   (35.65 ns .. 35.70 ns)
std dev              80.24 ps   (67.84 ps .. 96.71 ps)

benchmarked folds/unfoldrN/32
time                 36.26 ns   (35.89 ns .. 36.83 ns)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 35.86 ns   (35.77 ns .. 36.04 ns)
std dev              420.6 ps   (153.1 ps .. 813.5 ps)

benchmarked folds/unfoldrN/64
time                 36.36 ns   (35.02 ns .. 37.62 ns)
                     0.981 R²   (0.963 R² .. 0.993 R²)
mean                 39.07 ns   (37.94 ns .. 40.95 ns)
std dev              4.470 ns   (2.997 ns .. 6.416 ns)
variance introduced by outliers: 68% (severely inflated)

benchmarked folds/unfoldrN/128
time                 38.39 ns   (37.48 ns .. 40.47 ns)
                     0.974 R²   (0.930 R² .. 1.000 R²)
mean                 38.18 ns   (37.69 ns .. 39.69 ns)
std dev              2.699 ns   (1.077 ns .. 5.350 ns)
variance introduced by outliers: 45% (moderately inflated)

benchmarked folds/unfoldrN/256
time                 39.16 ns   (39.05 ns .. 39.34 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 39.18 ns   (39.12 ns .. 39.29 ns)
std dev              267.3 ps   (163.9 ps .. 412.3 ps)

benchmarked folds/unfoldrN/512
time                 41.50 ns   (41.39 ns .. 41.67 ns)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 41.49 ns   (41.45 ns .. 41.56 ns)
std dev              174.3 ps   (110.7 ps .. 271.6 ps)

benchmarked folds/unfoldrN/1024
time                 52.84 ns   (51.05 ns .. 54.66 ns)
                     0.995 R²   (0.991 R² .. 0.999 R²)
mean                 49.81 ns   (49.50 ns .. 50.50 ns)
std dev              1.419 ns   (854.8 ps .. 2.619 ns)
variance introduced by outliers: 11% (moderately inflated)

benchmarked folds/unfoldrN/2048
time                 66.64 ns   (64.77 ns .. 68.89 ns)
                     0.995 R²   (0.992 R² .. 0.998 R²)
mean                 66.67 ns   (66.02 ns .. 67.29 ns)
std dev              2.273 ns   (2.018 ns .. 2.607 ns)
variance introduced by outliers: 15% (moderately inflated)

benchmarked folds/unfoldrN/4096
time                 73.68 ns   (71.48 ns .. 75.94 ns)
                     0.996 R²   (0.994 R² .. 0.998 R²)
mean                 70.13 ns   (69.54 ns .. 71.00 ns)
std dev              2.253 ns   (1.625 ns .. 2.894 ns)
variance introduced by outliers: 15% (moderately inflated)

benchmarked folds/unfoldrN/8192
time                 76.34 ns   (75.73 ns .. 76.91 ns)
                     0.998 R²   (0.996 R² .. 0.999 R²)
mean                 81.97 ns   (80.98 ns .. 83.08 ns)
std dev              3.619 ns   (3.212 ns .. 4.185 ns)
variance introduced by outliers: 25% (moderately inflated)

benchmarked folds/unfoldrN/16384
time                 93.32 ns   (92.71 ns .. 93.97 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 93.63 ns   (93.38 ns .. 94.03 ns)
std dev              1.083 ns   (792.7 ps .. 1.563 ns)

benchmarked folds/unfoldrN/32768
time                 126.5 ns   (125.9 ns .. 127.1 ns)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 126.8 ns   (126.5 ns .. 127.4 ns)
std dev              1.464 ns   (1.091 ns .. 2.295 ns)

benchmarked folds/unfoldrN/65536
time                 193.6 ns   (192.7 ns .. 194.8 ns)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 195.0 ns   (194.3 ns .. 196.0 ns)
std dev              2.865 ns   (1.599 ns .. 4.461 ns)

@@ -416,6 +416,8 @@ main = do
nf (S.foldl' (\acc x -> acc + fromIntegral x) (0 :: Int)) s) foldInputs
, bgroup "foldr'" $ map (\s -> bench (show $ S.length s) $
nf (S.foldr' (\x acc -> fromIntegral x + acc) (0 :: Int)) s) foldInputs
, bgroup "unfoldrN" $ map (\s -> bench (show $ S.length s) $
nf (S.unfoldrN (S.length s) (\_ -> Nothing) ) s) foldInputs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a strange choice of benchmark: you are essentially building an empty string and do not invoke the inner loop of unfoldrN at all. Please put something like

nf (S.unfoldrN (S.length s) (\a -> (a, a + 1)) 0)

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 9, 2021

Please rebase your branch atop of master. Then rerun benchmarks as follows

git checkout <state with benchmarks but without other changes>
cabal bench --benchmark-options '--csv baseline.csv -p folds.unfoldrN'
git checkout <your-branch>
cabal bench --benchmark-options '--baseline baseline.csv -p folds.unfoldrN'

@Bodigrim
Copy link
Contributor

@SkamDart could you please rebase your branch?

@SkamDart SkamDart force-pushed the SkamDart/SAT-unfoldrN branch from 33dbbd5 to 9f73055 Compare March 1, 2021 07:25
bench/BenchAll.hs Outdated Show resolved Hide resolved
@sjakobi
Copy link
Member

sjakobi commented Mar 11, 2021

Please rebase your branch atop of master. Then rerun benchmarks as follows

git checkout <state with benchmarks but without other changes>
cabal bench --benchmark-options '--csv baseline.csv -p folds.unfoldrN'
git checkout <your-branch>
cabal bench --benchmark-options '--baseline baseline.csv -p folds.unfoldrN'

Using my suggested benchmark fix:

    unfoldrN
      1:     OK (0.15s)
         61 ns ± 5.4 ns, 18% faster than baseline
      2:     OK (0.67s)
         80 ns ± 1.4 ns, 15% faster than baseline
      4:     OK (0.26s)
        119 ns ± 5.4 ns, 10% faster than baseline
      8:     OK (0.22s)
        205 ns ±  18 ns
      16:    OK (0.38s)
        353 ns ±  10 ns
      32:    OK (0.35s)
        648 ns ±  21 ns
      64:    OK (0.15s)
        1.2 μs ± 118 ns
      128:   OK (0.16s)
        2.4 μs ± 185 ns
      256:   OK (0.16s)
        4.8 μs ± 353 ns
      512:   OK (0.32s)
        9.5 μs ± 354 ns
      1024:  OK (0.32s)
         19 μs ± 689 ns
      2048:  OK (0.32s)
         38 μs ± 1.4 μs,  4% faster than baseline
      4096:  OK (0.30s)
         76 μs ± 2.8 μs
      8192:  OK (0.16s)
        152 μs ±  11 μs
      16384: OK (0.30s)
        305 μs ±  12 μs
      32768: OK (0.32s)
        623 μs ±  22 μs,  4% faster than baseline
      65536: OK (0.32s)
        1.2 ms ±  49 μs

So there's hardly an effect, but it doesn't seem to hurt either.

@Bodigrim
Copy link
Contributor

@SkamDart are you still interested to carry on with this PR?

@SkamDart SkamDart force-pushed the SkamDart/SAT-unfoldrN branch from 9f73055 to 3c7f9cc Compare May 26, 2021 08:39
@SkamDart
Copy link
Contributor Author

@Bodigrim appreciate your patience through the radio silence on my end.
Here are the results of the benchmark on my machine.

    unfoldrN
      1:     OK (0.52s)
         60 ns ± 3.1 ns, 34% faster than baseline
      2:     OK (0.36s)
         81 ns ± 4.8 ns, 31% faster than baseline
      4:     OK (0.27s)
        123 ns ± 7.2 ns, 26% faster than baseline
      8:     OK (0.13s)
        232 ns ±  22 ns,  9% faster than baseline
      16:    OK (0.43s)
        410 ns ±  26 ns, 15% faster than baseline
      32:    OK (0.78s)
        738 ns ±  43 ns, 11% faster than baseline
      64:    OK (0.20s)
        1.5 μs ± 124 ns
      128:   OK (0.20s)
        3.0 μs ± 201 ns
      256:   OK (12.91s)
        6.1 μs ± 110 ns
      512:   OK (0.38s)
         11 μs ± 366 ns, 11% faster than baseline
      1024:  OK (0.39s)
         23 μs ± 1.3 μs
      2048:  OK (0.34s)
         40 μs ± 2.1 μs, 16% faster than baseline
      4096:  OK (0.16s)
         78 μs ± 6.8 μs, 18% faster than baseline
      8192:  OK (0.33s)
        158 μs ±  10 μs, 14% faster than baseline
      16384: OK (0.68s)
        333 μs ±  23 μs, 15% faster than baseline
      32768: OK (2.76s)
        655 μs ±  10 μs,  9% faster than baseline
      65536: OK (0.16s)
        1.3 ms ± 114 μs

Copy link
Member

@sjakobi sjakobi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! :)

@Bodigrim Bodigrim linked an issue May 26, 2021 that may be closed by this pull request
@Bodigrim Bodigrim merged commit a4b3781 into haskell:master May 26, 2021
Bodigrim pushed a commit to Bodigrim/bytestring that referenced this pull request Jun 10, 2021
* appropriate unfoldrN benchmark

* Hand SAT optimization for Data.ByteString.unfoldrN

* inline createAndTrim'
noughtmare pushed a commit to noughtmare/bytestring that referenced this pull request Dec 12, 2021
* appropriate unfoldrN benchmark

* Hand SAT optimization for Data.ByteString.unfoldrN

* inline createAndTrim'
@Bodigrim Bodigrim added this to the 0.11.2.0 milestone May 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Benchmark unfoldrN and inline createAndTrim'
3 participants