Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Vectorise BitArray #41896

Merged
merged 6 commits into from
Nov 7, 2019
Merged

Vectorise BitArray #41896

merged 6 commits into from
Nov 7, 2019

Conversation

Gnbrkm41
Copy link

@Gnbrkm41 Gnbrkm41 commented Oct 18, 2019

Fixes https://github.com/dotnet/corefx/issues/41762 and https://github.com/dotnet/corefx/issues/37946
Related #39173

This PR continues from the previous PR from @BruceForstall (#39173) in an attempt to speed up various operations of BitArray by vectorisation and using AVX2 256-bit wide instructions.

The performance difference, compared to before the optimizations were applied are as following, when operating on arrays of size 4/512/32768 (Threshold 5%):

summary:
better: 17, geomean: 3.773
worse: 3, geomean: 2.094
total diff: 20
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4) 4.76 0.78 3.72
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 4) 1.79 0.92 1.64
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 4) 1.08 9.83 10.61
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 32768) 88.64 116382.63 1312.93
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 32768) 33.88 323823.66 9557.69
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 512) 25.40 5096.80 200.63
System.Collections.Tests.Perf_BitArray.BitArrayBoolArrayCtor(Size: 512) 14.80 407.56 27.54
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 32768) 7.83 3679.64 469.90
System.Collections.Tests.Perf_BitArray.BitArrayNot(Size: 512) 6.55 61.56 9.39
System.Collections.Tests.Perf_BitArray.BitArrayCopyToBoolArray(Size: 4) 1.98 71.92 36.26
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 512) 1.82 19.69 10.83
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512) 1.75 53.17 30.31
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 512) 1.66 17.88 10.78
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32768) 1.63 3078.29 1883.64
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 512) 1.55 17.15 11.06
System.Collections.Tests.Perf_BitArray.BitArrayAnd(Size: 32768) 1.49 1259.86 847.41
System.Collections.Tests.Perf_BitArray.BitArrayXor(Size: 32768) 1.48 1261.98 853.47
System.Collections.Tests.Perf_BitArray.BitArrayOr(Size: 32768) 1.47 1238.11 842.76
System.Collections.Tests.Perf_BitArray.BitArrayLeftShift(Size: 4) 1.16 4.63 3.98
System.Collections.Tests.Perf_BitArray.BitArrayCopyToByteArray(Size: 4) 1.10 23.14 21.08

Regarding the slowdown of BitArraySetAll, I have re-run the benchmarks with various sizes to see at which point the new implementation outrun the current implementation.
(Threshold 5%)

summary:                                                                                                                                                                                  better: 6, geomean: 1.345
worse: 2, geomean: 1.729
total diff: 8
Slower diff/base Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 4) 2.69 1.39 3.74
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 16) 1.11 3.16 3.50
Faster base/diff Base Median (ns) Diff Median (ns) Modality
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 512) 1.76 53.79 30.51
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 64) 1.35 7.03 5.23
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 96) 1.31 9.52 7.25
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 128) 1.30 11.93 9.21
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 256) 1.22 22.09 18.09
System.Collections.Tests.Perf_BitArray.BitArraySetAll(Size: 32) 1.20 4.08 3.40

Which suggests that it may be faster for filling BitArray that contains more than 32 elements. One thing to note though, is that the numbers for small sizes seem to fluctuate around for small sizes, so I suppose the results may be inaccurate. (Even for this benchmark, I expect the numbers to be similar for Size 4/16/32, since they are all stored in one int and therefore should be just a single copy of an int; but they all seem to give different results)

Furthermore, since the current implementation of SetAll operates on the whole of the backing array, this may result in unnecessary copying to unused area when the BitArray.Length has been set to make the BitArray smaller but the backing array hasn't been resized due to the new length not meeting the _ShrinkThreshold (in int counts):

private const int _ShrinkThreshold = 256;

The new implementation uses GetInt32ArrayLengthFromBitLength(Length) method to calculate where the used area are and only copies to that region. Unfortunately, since this happens for smaller sized arrays as well, this check on itself seem to results in approximately 0.7x slowdown when the array has less than 32 elements.

Regarding the use of AVX2, I figured out that AVX2 generally improved the performance despite the concerns about downclocking. This is an example comparison between various paths for BitArray(Array, int) with bool arrays (See https://github.com/dotnet/corefx/issues/41762#issuecomment-542658154 and https://github.com/dotnet/corefx/issues/41762#issuecomment-542831649 for benchmarks of And/or/xor/not and BitArray(bool[])):

// * Summary *                                                                                                          
BenchmarkDotNet=v0.11.5.1159-nightly, OS=Windows 10.0.18999                                                             Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-alpha1-014899
  [Host]              : .NET Core 5.0.0-alpha1.19507.3 (CoreCLR 5.0.19.50101, CoreFX 5.0.19.50407), X64 RyuJIT
  Job-VRDFCM          : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  AVX2 Disabled       : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
  Intrinsics Disabled : .NET Core ? (CoreCLR 5.0.19.51405, CoreFX 5.0.19.51801), X64 RyuJIT
Method Job EnvironmentVariables PowerPlanMode Toolchain IterationTime MaxIterationCount MinIterationCount WarmupCount Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 4 35.90 ns 0.218 ns 0.203 ns 35.84 ns 35.60 ns 36.26 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 4 35.90 ns 0.142 ns 0.132 ns 35.89 ns 35.65 ns 36.13 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 4 73.08 ns 0.333 ns 0.311 ns 73.06 ns 72.56 ns 73.76 ns - - - -
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 512 191.24 ns 1.056 ns 0.988 ns 191.25 ns 189.72 ns 192.84 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 512 220.78 ns 0.819 ns 0.766 ns 220.75 ns 219.00 ns 221.94 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 512 5,221.48 ns 16.221 ns 12.664 ns 5,227.59 ns 5,202.04 ns 5,235.70 ns - - - -
BitArrayCopyToBoolArray Default Empty 00000000-0000-0000-0000-000000000000 CoreRun 250.0000 ms 20 15 1 32768 9,776.74 ns 89.691 ns 74.896 ns 9,756.03 ns 9,696.10 ns 9,976.58 ns - - - -
BitArrayCopyToBoolArray AVX2 Disabled COMPlus_EnableAVX2=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 32768 11,746.34 ns 50.812 ns 42.431 ns 11,735.14 ns 11,679.19 ns 11,834.34 ns - - - -
BitArrayCopyToBoolArray Intrinsics Disabled COMPlus_EnableHWIntrinsic=0 8c5e7fda-e8bf-4a96-9a85-a6e23a8c635c Before Default Default Default Default 32768 331,026.89 ns 936.011 ns 781.612 ns 331,110.79 ns 330,069.53 ns 332,353.66 ns - - - -


// The extracted bits can be anywhere between 0 and 255, so we normalise the value to either 0 or 1
// to ensure compatibility with "C# bool" (0 for false, 1 for true, rest undefined)
Vector256<byte> normalized = Avx2.Min(extracted, ones);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you don't do this kind of normalization for BitArray(bool[]) constructor

Copy link
Author

@Gnbrkm41 Gnbrkm41 Oct 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is handled by comparing the bytes with zero (checking if the bytes are false) then negating the result: 72477e7#diff-e2f01cf03382b7d63fc3a67ad77fcedcR140-R142

{
for (; (i + Vector256<byte>.Count) <= m_length; i += Vector256<byte>.Count)
{
int bits = m_array[i / BitsPerInt32];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, you can load m_array as a Vector and spawn vectors for each integer

@adamsitnik adamsitnik added the tenet-performance Performance related issue label Oct 18, 2019
@@ -275,16 +309,34 @@ public unsafe BitArray And(BitArray value)
if (Length != value.Length || (uint)count > (uint)thisArray.Length || (uint)count > (uint)valueArray.Length)
throw new ArgumentException(SR.Arg_ArrayLengthsDiffer);

// Unroll loop for count less than Vector256 size.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the vectorized version there's a sequential loop to process the remaining elements. Why not jump to this switch instead the loop?

(Of course, keep the loop if no Avx2 or Sse2 is available.)

src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved
src/System.Collections/src/System/Collections/BitArray.cs Outdated Show resolved Hide resolved
BruceForstall and others added 5 commits November 5, 2019 00:39
1. Use AVX2, if available, for And/Or/Xor
2. Vectorize Not
3. Use Span<T>.Fill() for SetAll()
4. Add more test sizes to account for And/Or/Xor/Not loop unrolling cases
* Fix bugs present in BitArray(bool[])
* Vectorise CopyTo(Array, int) when copying to a bool[]
* Add test data for random values & larger array
* Use Vector128/256.Create and store it in static readonly field instead of loading from PE header
@danmoseley
Copy link
Member

@Gnbrkm41 thanks for your work on this PR. As you probably saw this repo will move to a new one so we are hoping to finish as many active PR's as possible by 11/13 so they don't have to be manually re-created. Are you able to keep moving this along?

@Gnbrkm41
Copy link
Author

Gnbrkm41 commented Nov 5, 2019

I'll push a few commits today; I've been investigating whether fetching and storing int elements in bulk using vector instruction is worth it per @EgorBo's comment but it feels like either my code is not good enough or it isn't worth it, because the results seemed worse than the current version.

I think it'll be fine to get this merged with the current logic (that is, after I push my commits). Alternatively I'm also fine with digging more into it then just re-opening the PR after the consolidation.

@adamsitnik
Copy link
Member

@tannergooding @BruceForstall could you please take a look? I would love to merge it before repo consolidation

Copy link
Member

@tannergooding tannergooding left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good/correct to me.

@maryamariyan
Copy link
Member

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your corefx repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/corefx <path to the patch created in step 1>

@adamsitnik adamsitnik merged commit a4f0447 into dotnet:master Nov 7, 2019
@adamsitnik
Copy link
Member

@Gnbrkm41 thank you!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.