diff --git a/EigenRand/doc.h b/EigenRand/doc.h index 1f725ac..0e7e682 100644 --- a/EigenRand/doc.h +++ b/EigenRand/doc.h @@ -12,7 +12,7 @@ You can get 5~10 times speed by just replacing old Eigen's Random or unvectorizable c++11 random number generators with EigenRand. - EigenRand currently supports only x86-64 architecture (SSE, AVX, AVX2) and ARM64 NEON (experimental). + EigenRand currently supports only x86-64 architecture (SSE, AVX, AVX2) and ARM64 NEON. EigenRand is distributed under the MIT License. @@ -264,136 +264,37 @@ * * @page performance Performance - * The following charts show the relative speed-up of EigenRand compared to Reference(C++ std or Eigen functions). Detailed results are below the charts. - - @section performance_1 Overview of Results at x86-64 Architecture - - \image html perf_no_vect.png - - \image html perf_sse2.png - - \image html perf_avx.png - - \image html perf_avx2.png - - \image html perf_mv_part1.png - - \image html perf_mv_part2.png - - * The following result is a measure of the time in seconds it takes to generate 1M random numbers. It shows the average of 20 times. - - @section performance_2 Overview of Results at ARM64 NEON (experimental) - - \image html perf_neon_v0.3.90.png - - \image html perf_mv_part1_neon_v0.3.90.png - - \image html perf_mv_part2_neon_v0.3.90.png - - * The following result is a measure of the time in seconds it takes to generate 1M random numbers. It shows the average of 20 times. - - @section performance_3 Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz (Ubuntu 16.04, gcc7.5) - -| | C++ std (or Eigen) | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| `balanced`* | 9.0 | 5.9 | 1.5 | 1.4 | 1.3 | 0.9 | -| `balanced`(double)* | 8.7 | 6.4 | 3.3 | 2.9 | 1.7 | 1.7 | -| `binomial(20, 0.5)` | 400.8 | 118.5 | 32.7 | 36.6 | 30.0 | 22.7 | -| `binomial(50, 0.01)` | 71.7 | 22.5 | 7.7 | 8.3 | 7.9 | 6.6 | -| `binomial(100, 0.75)` | 340.5 | 454.5 | 91.7 | 111.5 | 106.3 | 86.4 | -| `cauchy` | 36.1 | 54.4 | 6.1 | 7.1 | 4.7 | 3.9 | -| `chiSquared` | 80.5 | 249.5 | 64.6 | 58.0 | 29.4 | 28.8 | -| `discrete`(int32) | - | 14.0 | 2.9 | 2.6 | 2.4 | 1.7 | -| `discrete`(fp32) | - | 21.9 | 4.3 | 4.0 | 3.6 | 3.0 | -| `discrete`(fp64) | 72.4 | 21.4 | 6.9 | 6.5 | 4.9 | 3.7 | -| `exponential` | 31.0 | 25.3 | 5.5 | 5.3 | 3.3 | 2.9 | -| `extremeValue` | 66.0 | 60.1 | 11.9 | 10.7 | 6.5 | 5.8 | -| `fisherF(1, 1)` | 178.1 | 35.1 | 33.2 | 39.3 | 22.9 | 18.7 | -| `fisherF(5, 5)` | 141.8 | 415.2 | 136.47 | 172.4 | 92.4 | 74.9 | -| `gamma(0.2, 1)` | 207.8 | 211.4 | 54.6 | 51.2 | 26.9 | 27.0 | -| `gamma(5, 3)` | 80.9 | 60.0 | 14.3 | 13.3 | 11.4 | 8.0 | -| `gamma(10.5, 1)` | 81.1 | 248.6 | 63.3 | 58.5 | 29.2 | 28.4 | -| `geometric` | 43.0 | 22.4 | 6.7 | 7.4 | 5.8 | | -| `lognormal` | 66.3 | 55.4 | 12.8 | 11.8 | 6.2 | 6.2 | -| `negativeBinomial(10, 0.5)` | 312.0 | 301.4 | 82.9 | 100.6 | 95.3 | 77.9 | -| `negativeBinomial(20, 0.25)` | 483.4 | 575.9 | 125.0 | 158.2 | 148.4 | 119.5 | -| `normal(0, 1)` | 38.1 | 28.5 | 6.8 | 6.2 | 3.8 | 3.7 | -| `normal(2, 3)` | 37.6 | 29.0 | 7.3 | 6.6 | 4.0 | 3.9 | -| `poisson(1)` | 31.8 | 25.2 | 9.8 | 10.8 | 9.7 | 8.2 | -| `poisson(16)` | 231.8 | 274.1 | 66.2 | 80.7 | 74.4 | 64.2 | -| `randBits` | 5.2 | 5.4 | 1.4 | 1.3 | 1.1 | 1.0 | -| `studentT(1)` | 122.7 | 120.1 | 15.3 | 19.2 | 12.6 | 9.4 | -| `studentT(20)` | 102.2 | 111.1 | 15.4 | 19.2 | 12.2 | 9.4 | -| `uniformInt(0~63)` | 22.4 | 4.7 | 1.7 | 1.6 | 1.4 | 1.1 | -| `uniformInt(0~100k)` | 21.8 | 10.1 | 6.2 | 6.7 | 6.6 | 5.4 | -| `uniformReal` | 12.9 | 5.7 | 1.4 | 1.2 | 1.4 | 0.7 | -| `weibull` | 41.0 | 35.8 | 17.7 | 15.5 | 8.5 | 8.5 | - -* Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. - -| | C++ std | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| Mersenne Twister(int32) | 4.7 | 5.6 | 4.0 | 3.7 | 3.5 | 3.6 | -| Mersenne Twister(int64) | 5.4 | 5.3 | 4.0 | 3.9 | 3.4 | 2.6 | - -| | Python 3.6 + scipy 1.5.2 + numpy 1.19.2 | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| `Dirichlet(4)` | 6.47 | 6.60 | 2.39 | 2.49 | 1.34 | 1.67 | -| `Dirichlet(100)` | 75.95 | 189.97 | 66.60 | 72.11 | 38.86 | 34.98 | -| `InvWishart(4)` | 140.18 | 7.62 | 4.21 | 4.54 | 3.58 | 3.39 | -| `InvWishart(50)` | 1510.47 | 1737.4 | 697.39 | 733.69 | 604.59 | 554.006 | -| `Multinomial(4, t=20)` | 3.32 | 4.12 | 0.95 | 1.06 | 1.00 | 1.03 | -| `Multinomial(4, t=1000)` | 3.51 | 192.51 | 35.99 | 39.58 | 27.84 | 35.45 | -| `Multinomial(100, t=20)` | 69.19 | 4.80 | 2.00 | 2.20 | 2.28 | 2.09 | -| `Multinomial(100, t=1000)` | 139.74 | 179.43 | 49.48 | 56.19 | 40.78 | 43.18 | -| `MvNormal(4)` | 2.32 | 0.96 | 0.36 | 0.37 | 0.25 | 0.30 | -| `MvNormal(100)` | 49.09 | 57.18 | 17.17 | 18.51 | 10.82 | 11.03 | -| `Wishart(4)` | 71.19 | 5.28 | 2.70 | 2.93 | 2.04 | 1.94 | -| `Wishart(50)` | 1185.26 | 1360.49 | 492.91 | 517.44 | 359.03 | 324.60 | - - @section performance_4 AMD Ryzen 7 3700x CPU @ 3.60GHz (Windows 10, MSVC2017) - -| | C++ std (or Eigen) | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:| -| `balanced`* | 20.8 | 1.9 | 2.0 | 1.4 | -| `balanced`(double)* | 21.7 | 4.1 | 2.7 | 3.0 | -| `binomial(20, 0.5)` | 416.0 | 27.7 | 28.9 | 29.1 | -| `binomial(50, 0.01)` | 37.8 | 6.3 | 6.0 | 6.6 | -| `binomial(100, 0.75)` | 309.1 | 72.4 | 66.0 | 67.0 | -| `cauchy` | 42.2 | 4.8 | 5.1 | 2.7 | -| `chiSquared` | 153.8 | 33.5 | 21.2 | 17.0 | -| `discrete`(int32) | - | 2.4 | 2.3 | 2.5 | -| `discrete`(fp32) | - | 2.6 | 2.3 | 3.5 | -| `discrete`(fp64) | 55.8 | 5.1 | 4.7 | 4.3 | -| `exponential` | 33.4 | 6.4 | 2.8 | 2.2 | -| `extremeValue` | 39.4 | 7.8 | 4.6 | 4.0 | -| `fisherF(1, 1)` | 103.9 | 25.3 | 14.9 | 11.7 | -| `fisherF(5, 5)` | 295.7 | 85.5 | 58.3 | 44.8 | -| `gamma(0.2, 1)` | 128.8 | 31.9 | 18.3 | 15.8 | -| `gamma(5, 3)` | 156.1 | 9.7 | 8.0 | 5.0 | -| `gamma(10.5, 1)` | 148.5 | 33.1 | 21.1 | 17.2 | -| `geometric` | 27.1 | 6.6 | 4.3 | 4.1 | -| `lognormal` | 104.0 | 6.6 | 4.7 | 3.5 | -| `negativeBinomial(10, 0.5)` | 462.1 | 60.0 | 56.4 | 58.6 | -| `negativeBinomial(20, 0.25)` | 357.6 | 84.5 | 80.6 | 78.4 | -| `normal(0, 1)` | 48.8 | 4.2 | 3.7 | 2.3 | -| `normal(2, 3)` | 48.8 | 4.5 | 3.8 | 2.4 | -| `poisson(1)` | 46.4 | 7.9 | 7.4 | 8.2 | -| `poisson(16)` | 192.4 | 43.2 | 40.4 | 40.9 | -| `randBits` | 4.2 | 1.7 | 1.5 | 1.8 | -| `studentT(1)` | 107.0 | 12.3 | 6.8 | 5.7 | -| `studentT(20)` | 107.1 | 12.3 | 6.8 | 5.8 | -| `uniformInt(0~63)` | 31.2 | 1.1 | 1.0 | 1.2 | -| `uniformInt(0~100k)` | 27.7 | 5.6 | 5.6 | 5.4 | -| `uniformReal` | 30.7 | 1.1 | 1.0 | 0.6 | -| `weibull` | 46.5 | 10.6 | 6.4 | 5.2 | - -* Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. - -| | C++ std | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:| -| Mersenne Twister(int32) | 5.0 | 3.4 | 3.4 | 3.3 | -| Mersenne Twister(int64) | 5.1 | 3.9 | 3.9 | 3.3 | + * The following charts show the relative speed-up of EigenRand compared to references(equivalent functions of C++ std or Eigen for univariate distributions and Scipy for multivariate distributions). + +Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. + +Cases filled with orange are generators that are slower than reference functions. + + @section performance_1 Windows 2019, MSVC 19.29.30147, Intel(R) Xeon(R) Platinum 8171M CPU, AVX2, Eigen 3.4.0 + + \image html perf_avx2_win.png width=80% + + \image html perf_avx2_win_mv1.png width=80% + + \image html perf_avx2_win_mv2.png width=80% + + @section performance_2 Ubuntu 18.04, gcc 7.5.0, Intel(R) Xeon(R) Platinum 8370C CPU, AVX2, Eigen 3.4.0 + + \image html perf_avx2_ubu.png width=80% + + \image html perf_avx2_ubu_mv1.png width=80% + + \image html perf_avx2_ubu_mv2.png width=80% + + @section performance_3 macOS Monterey 12.2.1, clang 13.1.6, Apple M1 Pro, NEON, Eigen 3.4.0 + + \image html perf_neon_mac.png width=80% + + \image html perf_neon_mac_mv1.png width=80% + + \image html perf_neon_mac_mv2.png width=80% + + You can see the detailed numerical values used to plot the above charts on the Action Results of GitHub repository. * */ diff --git a/README.md b/README.md index f062b99..660ac3a 100644 --- a/README.md +++ b/README.md @@ -105,259 +105,27 @@ https://bab2min.github.io/eigenrand/ | `Eigen::Rand::P8_mt19937_64` | a vectorized version of Mersenne Twister algorithm. Since it generates eight 64bit random integers simultaneously, the random values are the same regardless of architecture. | | ## Performance -The following charts show the relative speed-up of EigenRand compared to references(equivalent functions of C++ std or Eigen). - -![Perf_no_vect](/doxygen/images/perf_no_vect.png) -![Perf_no_vect](/doxygen/images/perf_sse2.png) -![Perf_no_vect](/doxygen/images/perf_avx.png) -![Perf_no_vect](/doxygen/images/perf_avx2.png) - -The following charts are about multivariate distributions. -![Perf_no_vect](/doxygen/images/perf_mv_part1.png) -![Perf_no_vect](/doxygen/images/perf_mv_part2.png) - - -The following result is a measure of the time in seconds it takes to generate 1M random numbers. -It shows the average of 20 times. - -### Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz (Ubuntu 16.04, gcc5.4) - -| | C++ std (or Eigen) | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| `balanced`* | 9.0 | 5.9 | 1.5 | 1.4 | 1.3 | 0.9 | -| `balanced`(double)* | 8.7 | 6.4 | 3.3 | 2.9 | 1.7 | 1.7 | -| `binomial(20, 0.5)` | 400.8 | 118.5 | 32.7 | 36.6 | 30.0 | 22.7 | -| `binomial(50, 0.01)` | 71.7 | 22.5 | 7.7 | 8.3 | 7.9 | 6.6 | -| `binomial(100, 0.75)` | 340.5 | 454.5 | 91.7 | 111.5 | 106.3 | 86.4 | -| `cauchy` | 36.1 | 54.4 | 6.1 | 7.1 | 4.7 | 3.9 | -| `chiSquared` | 80.5 | 249.5 | 64.6 | 58.0 | 29.4 | 28.8 | -| `discrete`(int32) | - | 14.0 | 2.9 | 2.6 | 2.4 | 1.7 | -| `discrete`(fp32) | - | 21.9 | 4.3 | 4.0 | 3.6 | 3.0 | -| `discrete`(fp64) | 72.4 | 21.4 | 6.9 | 6.5 | 4.9 | 3.7 | -| `exponential` | 31.0 | 25.3 | 5.5 | 5.3 | 3.3 | 2.9 | -| `extremeValue` | 66.0 | 60.1 | 11.9 | 10.7 | 6.5 | 5.8 | -| `fisherF(1, 1)` | 178.1 | 35.1 | 33.2 | 39.3 | 22.9 | 18.7 | -| `fisherF(5, 5)` | 141.8 | 415.2 | 136.47 | 172.4 | 92.4 | 74.9 | -| `gamma(0.2, 1)` | 207.8 | 211.4 | 54.6 | 51.2 | 26.9 | 27.0 | -| `gamma(5, 3)` | 80.9 | 60.0 | 14.3 | 13.3 | 11.4 | 8.0 | -| `gamma(10.5, 1)` | 81.1 | 248.6 | 63.3 | 58.5 | 29.2 | 28.4 | -| `geometric` | 43.0 | 22.4 | 6.7 | 7.4 | 5.8 | | -| `lognormal` | 66.3 | 55.4 | 12.8 | 11.8 | 6.2 | 6.2 | -| `negativeBinomial(10, 0.5)` | 312.0 | 301.4 | 82.9 | 100.6 | 95.3 | 77.9 | -| `negativeBinomial(20, 0.25)` | 483.4 | 575.9 | 125.0 | 158.2 | 148.4 | 119.5 | -| `normal(0, 1)` | 38.1 | 28.5 | 6.8 | 6.2 | 3.8 | 3.7 | -| `normal(2, 3)` | 37.6 | 29.0 | 7.3 | 6.6 | 4.0 | 3.9 | -| `poisson(1)` | 31.8 | 25.2 | 9.8 | 10.8 | 9.7 | 8.2 | -| `poisson(16)` | 231.8 | 274.1 | 66.2 | 80.7 | 74.4 | 64.2 | -| `randBits` | 5.2 | 5.4 | 1.4 | 1.3 | 1.1 | 1.0 | -| `studentT(1)` | 122.7 | 120.1 | 15.3 | 19.2 | 12.6 | 9.4 | -| `studentT(20)` | 102.2 | 111.1 | 15.4 | 19.2 | 12.2 | 9.4 | -| `uniformInt(0~63)` | 22.4 | 4.7 | 1.7 | 1.6 | 1.4 | 1.1 | -| `uniformInt(0~100k)` | 21.8 | 10.1 | 6.2 | 6.7 | 6.6 | 5.4 | -| `uniformReal` | 12.9 | 5.7 | 1.4 | 1.2 | 1.4 | 0.7 | -| `weibull` | 41.0 | 35.8 | 17.7 | 15.5 | 8.5 | 8.5 | +The following charts show the relative speed-up of EigenRand compared to references(equivalent functions of C++ std or Eigen for univariate distributions and Scipy for multivariate distributions). * Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. +* Cases filled with orange are generators that are slower than reference functions. -| | C++ std | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| Mersenne Twister(int32) | 4.7 | 5.6 | 4.0 | 3.7 | 3.5 | 3.6 | -| Mersenne Twister(int64) | 5.4 | 5.3 | 4.0 | 3.9 | 3.4 | 2.6 | - -| | Python 3.6 + scipy 1.5.2 + numpy 1.19.2 | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:|---:| -| `Dirichlet(4)` | 6.47 | 6.60 | 2.39 | 2.49 | 1.34 | 1.67 | -| `Dirichlet(100)` | 75.95 | 189.97 | 66.60 | 72.11 | 38.86 | 34.98 | -| `InvWishart(4)` | 140.18 | 7.62 | 4.21 | 4.54 | 3.58 | 3.39 | -| `InvWishart(50)` | 1510.47 | 1737.4 | 697.39 | 733.69 | 604.59 | 554.006 | -| `Multinomial(4, t=20)` | 3.32 | 4.12 | 0.95 | 1.06 | 1.00 | 1.03 | -| `Multinomial(4, t=1000)` | 3.51 | 192.51 | 35.99 | 39.58 | 27.84 | 35.45 | -| `Multinomial(100, t=20)` | 69.19 | 4.80 | 2.00 | 2.20 | 2.28 | 2.09 | -| `Multinomial(100, t=1000)` | 139.74 | 179.43 | 49.48 | 56.19 | 40.78 | 43.18 | -| `MvNormal(4)` | 2.32 | 0.96 | 0.36 | 0.37 | 0.25 | 0.30 | -| `MvNormal(100)` | 49.09 | 57.18 | 17.17 | 18.51 | 10.82 | 11.03 | -| `Wishart(4)` | 71.19 | 5.28 | 2.70 | 2.93 | 2.04 | 1.94 | -| `Wishart(50)` | 1185.26 | 1360.49 | 492.91 | 517.44 | 359.03 | 324.60 | - - -### Intel(R) Xeon(R) CPU E5-1650 v2 @ 3.50GHz (macOS 10.15, clang-1103) - -| | C++ std (or Eigen) | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | -|---|---:|---:|---:|---:|---:| -| `balanced`* | 6.5 | 7.3 | 1.1 | 1.4 | 1.1 | -| `balanced`(double)* | 6.6 | 7.5 | 2.6 | 3.3 | 2.4 | -| `binomial(20, 0.5)` | 38.8 | 164.9 | 27.7 | 29.3 | 24.9 | -| `binomial(50, 0.01)` | 21.9 | 27.6 | 6.6 | 7.0 | 6.3 | -| `binomial(100, 0.75)` | 52.2 | 421.9 | 93.6 | 94.8 | 89.1 | -| `cauchy` | 36.0 | 30.4 | 5.6 | 5.8 | 4.0 | -| `chiSquared` | 84.4 | 152.2 | 44.1 | 48.7 | 26.2 | -| `discrete`(int32) | - | 12.4 | 2.1 | 2.6 | 2.2 | -| `discrete`(fp32) | - | 23.2 | 3.4 | 3.7 | 3.4 | -| `discrete`(fp64) | 48.6 | 22.9 | 4.2 | 5.0 | 4.6 | -| `exponential` | 22.0 | 18.0 | 4.1 | 4.9 | 3.2 | -| `extremeValue` | 36.2 | 32.0 | 8.7 | 9.5 | 5.1 | -| `fisherF(1, 1)` | 158.2 | 73.1 | 32.3 | 32.1 | 18.1 | -| `fisherF(5, 5)` | 177.3 | 310.1 | 127.0 | 121.8 | 74.3 | -| `gamma(0.2, 1)` | 69.8 | 80.4 | 28.5 | 33.8 | 19.2 | -| `gamma(5, 3)` | 83.9 | 53.3 | 10.6 | 12.4 | 8.6 | -| `gamma(10.5, 1)` | 83.2 | 150.4 | 43.3 | 48.4 | 26.2 | -| `geometric` | 39.6 | 19.0 | 4.3 | 4.4 | 4.1 | -| `lognormal` | 43.8 | 40.7 | 9.0 | 10.8 | 5.7 | -| `negativeBinomial(10, 0.5)` | 217.4 | 274.8 | 71.6 | 73.7 | 68.2 | -| `negativeBinomial(20, 0.25)` | 192.9 | 464.9 | 112.0 | 111.5 | 105.7 | -| `normal(0, 1)` | 32.6 | 28.6 | 5.5 | 6.5 | 3.8 | -| `normal(2, 3)` | 32.9 | 30.5 | 5.7 | 6.7 | 3.9 | -| `poisson(1)` | 37.9 | 31.0 | 7.5 | 7.8 | 7.1 | -| `poisson(16)` | 92.4 | 243.3 | 55.6 | 57.7 | 53.7 | -| `randBits` | 6.5 | 6.5 | 1.1 | 1.3 | 1.1 | -| `studentT(1)` | 115.0 | 54.1 | 15.5 | 15.7 | 8.3 | -| `studentT(20)` | 121.2 | 53.8 | 15.8 | 16.0 | 8.2 | -| `uniformInt(0~63)` | 20.2 | 9.8 | 1.8 | 1.8 | 1.6 | -| `uniformInt(0~100k)` | 25.7 | 16.1 | 8.1 | 8.5 | 7.2 | -| `uniformReal` | 12.7 | 7.0 | 1.0 | 1.2 | 1.1 | -| `weibull` | 23.1 | 19.2 | 11.6 | 13.6 | 7.6 | +### Windows 2019, MSVC 19.29.30147, Intel(R) Xeon(R) Platinum 8171M CPU, AVX2, Eigen 3.4.0 +![Perf_AVX2_Win](/doxygen/images/perf_avx2_win.png) +![Perf_AVX2_Win_Mv1](/doxygen/images/perf_avx2_win_mv1.png) +![Perf_AVX2_Win_Mv1](/doxygen/images/perf_avx2_win_mv2.png) -* Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. - -| | C++ std | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | -|---|---:|---:|---:|---:|---:| -| Mersenne Twister(int32) | 6.2 | 6.4 | 1.7 | 2.0 | 1.8 | -| Mersenne Twister(int64) | 6.4 | 6.3 | 2.5 | 3.1 | 2.4 | - - -| | Python 3.6 + scipy 1.5.2 + numpy 1.19.2 | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (SSSE3) | EigenRand (AVX) | -|---|---:|---:|---:|---:|---:| -| `Dirichlet(4)` | 3.54 | 3.29 | 1.25 | 1.25 | 0.83 | -| `Dirichlet(100)` | 57.63 | 145.32 | 49.71 | 49.50 | 29.13 | -| `InvWishart(4)` | 210.92 | 7.53 | 3.72 | 3.66 | 3.10 | -| `InvWishart(50)` | 1980.73 | 1446.40 | 560.40 | 559.73 | 457.07 | -| `Multinomial(4, t=20)` | 2.60 | 5.22 | 1.48 | 1.50 | 1.42 | -| `Multinomial(4, t=1000)` | 3.90 | 208.75 | 29.19 | 29.50 | 27.70 | -| `Multinomial(100, t=20)` | 47.71 | 7.09 | 3.71 | 3.63 | 3.60 | -| `Multinomial(100, t=1000)` | 128.69 | 215.19 | 44.48 | 44.63 | 43.76 | -| `MvNormal(4)` | 2.04 | 1.05 | 0.35 | 0.34 | 0.19 | -| `MvNormal(100)` | 48.69 | 47.10 | 16.25 | 16.12 | 11.41 | -| `Wishart(4)` | 81.11 | 13.24 | 9.87 | 9.81 | 5.90 | -| `Wishart(50)` | 1419.02 | 1087.40 | 448.06 | 442.97 | 328.20 | - - -### Intel(R) Xeon(R) Platinum 8171M CPU @ 2.60GHz (Windows Server 2019, MSVC2019) - -| | C++ std (or Eigen) | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:| -| `balanced`* | 20.7 | 7.2 | 3.3 | 4.0 | 2.2 | -| `balanced`(double)* | 21.9 | 8.8 | 6.7 | 4.3 | 4.3 | -| `binomial(20, 0.5)` | 718.3 | 141.0 | 38.1 | 30.2 | 32.7 | -| `binomial(50, 0.01)` | 61.5 | 21.4 | 7.5 | 6.5 | 8.0 | -| `binomial(100, 0.75)` | 495.9 | 1042.5 | 100.6 | 95.2 | 93.0 | -| `cauchy` | 71.6 | 30.0 | 6.8 | 6.4 | 3.0 | -| `chiSquared` | 243.0 | 147.3 | 63.5 | 34.1 | 24.0 | -| `discrete`(int32) | - | 12.4 | 3.5 | 2.7 | 2.2 | -| `discrete`(fp32) | - | 19.2 | 5.1 | 3.6 | 3.7 | -| `discrete`(fp64) | 83.9 | 19.0 | 6.7 | 7.4 | 4.6 | -| `exponential` | 58.7 | 16.0 | 6.8 | 6.4 | 3.0 | -| `extremeValue` | 64.6 | 27.7 | 13.5 | 9.8 | 5.5 | -| `fisherF(1, 1)` | 178.7 | 75.2 | 35.3 | 28.4 | 17.5 | -| `fisherF(5, 5)` | 491.0 | 298.4 | 125.8 | 87.4 | 60.5 | -| `gamma(0.2, 1)` | 211.7 | 69.3 | 43.7 | 24.7 | 18.7 | -| `gamma(5, 3)` | 272.5 | 42.3 | 17.6 | 17.2 | 8.5 | -| `gamma(10.5, 1)` | 237.8 | 146.2 | 63.7 | 33.8 | 23.5 | -| `geometric` | 49.3 | 17.0 | 7.0 | 5.8 | 5.4 | -| `lognormal` | 169.8 | 37.6 | 12.7 | 7.2 | 5.0 | -| `negativeBinomial(10, 0.5)` | 752.7 | 462.3 | 87.0 | 83.0 | 81.6 | -| `negativeBinomial(20, 0.25)` | 611.4 | 855.3 | 123.7 | 125.3 | 116.6 | -| `normal(0, 1)` | 78.4 | 21.1 | 6.9 | 4.6 | 2.9 | -| `normal(2, 3)` | 77.2 | 22.3 | 6.8 | 4.8 | 3.1 | -| `poisson(1)` | 77.4 | 28.9 | 10.0 | 8.1 | 10.1 | -| `poisson(16)` | 312.9 | 485.5 | 63.6 | 61.5 | 60.5 | -| `randBits` | 6.0 | 6.2 | 3.1 | 2.7 | 2.7 | -| `studentT(1)` | 175.8 | 53.9 | 17.3 | 12.5 | 7.7 | -| `studentT(20)` | 173.2 | 55.5 | 17.9 | 12.7 | 7.6 | -| `uniformInt(0~63)` | 39.1 | 5.2 | 2.0 | 1.4 | 1.6 | -| `uniformInt(0~100k)` | 38.5 | 12.3 | 7.6 | 6.0 | 7.7 | -| `uniformReal` | 53.4 | 5.7 | 1.9 | 2.3 | 1.0 | -| `weibull` | 75.1 | 44.3 | 18.5 | 14.3 | 7.9 | - -* Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. - -| | C++ std | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:| -| Mersenne Twister(int32) | 6.5 | 6.4 | 5.6 | 5.1 | 4.5 | -| Mersenne Twister(int64) | 6.6 | 6.5 | 6.9 | 5.9 | 5.1 | - - -| | Python 3.6 + scipy 1.5.2 + numpy 1.19.2 | EigenRand (No Vect.) | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:|---:| -| `Dirichlet(4)` | 4.27 | 3.20 | 2.31 | 1.43 | 1.25 | -| `Dirichlet(100)` | 69.61 | 150.33 | 67.01 | 47.34 | 32.47 | -| `InvWishart(4)` | 482.87 | 14.52 | 8.88 | 13.17 | 11.28 | -| `InvWishart(50)` | 2222.72 | 2211.66 | 902.34 | 775.36 | 610.60 | -| `Multinomial(4, t=20)` | 2.99 | 5.41 | 1.99 | 1.92 | 1.78 | -| `Multinomial(4, t=1000)` | 4.23 | 235.84 | 49.73 | 42.41 | 40.76 | -| `Multinomial(100, t=20)` | 58.20 | 9.12 | 5.84 | 6.02 | 5.98 | -| `Multinomial(100, t=1000)` | 130.54 | 234.40 | 72.99 | 66.36 | 55.28 | -| `MvNormal(4)` | 2.25 | 1.89 | 0.35 | 0.32 | 0.25 | -| `MvNormal(100)` | 57.71 | 68.80 | 24.40 | 18.28 | 13.05 | -| `Wishart(4)` | 70.18 | 16.25 | 4.49 | 3.97 | 3.07 | -| `Wishart(50)` | 1471.29 | 1641.73 | 628.58 | 485.68 | 349.81 | - - -### AMD Ryzen 7 3700x CPU @ 3.60GHz (Windows 10, MSVC2017) - -| | C++ std (or Eigen) | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:| -| `balanced`* | 20.8 | 1.9 | 2.0 | 1.4 | -| `balanced`(double)* | 21.7 | 4.1 | 2.7 | 3.0 | -| `binomial(20, 0.5)` | 416.0 | 27.7 | 28.9 | 29.1 | -| `binomial(50, 0.01)` | 37.8 | 6.3 | 6.0 | 6.6 | -| `binomial(100, 0.75)` | 309.1 | 72.4 | 66.0 | 67.0 | -| `cauchy` | 42.2 | 4.8 | 5.1 | 2.7 | -| `chiSquared` | 153.8 | 33.5 | 21.2 | 17.0 | -| `discrete`(int32) | - | 2.4 | 2.3 | 2.5 | -| `discrete`(fp32) | - | 2.6 | 2.3 | 3.5 | -| `discrete`(fp64) | 55.8 | 5.1 | 4.7 | 4.3 | -| `exponential` | 33.4 | 6.4 | 2.8 | 2.2 | -| `extremeValue` | 39.4 | 7.8 | 4.6 | 4.0 | -| `fisherF(1, 1)` | 103.9 | 25.3 | 14.9 | 11.7 | -| `fisherF(5, 5)` | 295.7 | 85.5 | 58.3 | 44.8 | -| `gamma(0.2, 1)` | 128.8 | 31.9 | 18.3 | 15.8 | -| `gamma(5, 3)` | 156.1 | 9.7 | 8.0 | 5.0 | -| `gamma(10.5, 1)` | 148.5 | 33.1 | 21.1 | 17.2 | -| `geometric` | 27.1 | 6.6 | 4.3 | 4.1 | -| `lognormal` | 104.0 | 6.6 | 4.7 | 3.5 | -| `negativeBinomial(10, 0.5)` | 462.1 | 60.0 | 56.4 | 58.6 | -| `negativeBinomial(20, 0.25)` | 357.6 | 84.5 | 80.6 | 78.4 | -| `normal(0, 1)` | 48.8 | 4.2 | 3.7 | 2.3 | -| `normal(2, 3)` | 48.8 | 4.5 | 3.8 | 2.4 | -| `poisson(1)` | 46.4 | 7.9 | 7.4 | 8.2 | -| `poisson(16)` | 192.4 | 43.2 | 40.4 | 40.9 | -| `randBits` | 4.2 | 1.7 | 1.5 | 1.8 | -| `studentT(1)` | 107.0 | 12.3 | 6.8 | 5.7 | -| `studentT(20)` | 107.1 | 12.3 | 6.8 | 5.8 | -| `uniformInt(0~63)` | 31.2 | 1.1 | 1.0 | 1.2 | -| `uniformInt(0~100k)` | 27.7 | 5.6 | 5.6 | 5.4 | -| `uniformReal` | 30.7 | 1.1 | 1.0 | 0.6 | -| `weibull` | 46.5 | 10.6 | 6.4 | 5.2 | +### Ubuntu 18.04, gcc 7.5.0, Intel(R) Xeon(R) Platinum 8370C CPU, AVX2, Eigen 3.4.0 +![Perf_AVX2_Ubu](/doxygen/images/perf_avx2_ubu.png) +![Perf_AVX2_Ubu_Mv1](/doxygen/images/perf_avx2_ubu_mv1.png) +![Perf_AVX2_Ubu_Mv1](/doxygen/images/perf_avx2_ubu_mv2.png) -* Since there is no equivalent class to `balanced` in C++11 std, we used Eigen::DenseBase::Random instead. +### macOS Monterey 12.2.1, clang 13.1.6, Apple M1 Pro, NEON, Eigen 3.4.0 +![Perf_NEON_mac](/doxygen/images/perf_neon_mac.png) +![Perf_NEON_mac_Mv1](/doxygen/images/perf_neon_mac_mv1.png) +![Perf_NEON_mac_Mv1](/doxygen/images/perf_neon_mac_mv2.png) -| | C++ std | EigenRand (SSE2) | EigenRand (AVX) | EigenRand (AVX2) | -|---|---:|---:|---:|---:| -| Mersenne Twister(int32) | 5.0 | 3.4 | 3.4 | 3.3 | -| Mersenne Twister(int64) | 5.1 | 3.9 | 3.9 | 3.3 | - -### ARM64 NEON (Cortex-A73) -Currently, Support for ARM64 NEON is experimental and the result may be sub-optimal. -Also keep in mind that NEON does not support vectorization of double type. -So if you use double type generators, they would fallback into scalar computations. - -![Perf_no_vect](/doxygen/images/perf_neon_v0.3.90.png) - -The following charts are about multivariate distributions. -![Perf_no_vect](/doxygen/images/perf_mv_part1_neon_v0.3.90.png) -![Perf_no_vect](/doxygen/images/perf_mv_part2_neon_v0.3.90.png) - -Cases filled with orange are generators that are slower than reference functions. +You can see the detailed numerical values used to plot the above charts on the [Action](https://github.com/bab2min/EigenRand/actions/workflows/release.yml) page. ## Accuracy Since vectorized mathematical functions may have a loss of precision, I measured how well the generated random number fits its actual distribution. @@ -389,6 +157,11 @@ MIT License ## History +### 0.5.0 (2023-01-31) +* Improved the performance of `MultinomialGen`. +* Implemented vectorization over parameters to some distributions. +* Optimized the performance of `double`-type generators on NEON architecture. + ### 0.4.1 (2022-08-13) * Fixed a bug where double-type generation with std::mt19937 fails compilation. * Fixed a bug where `UniformIntGen` in scalar mode generates numbers in the wrong range. diff --git a/doxygen/images/perf_avx.png b/doxygen/images/perf_avx.png deleted file mode 100644 index ea36ab0..0000000 Binary files a/doxygen/images/perf_avx.png and /dev/null differ diff --git a/doxygen/images/perf_avx2.png b/doxygen/images/perf_avx2.png deleted file mode 100644 index 5225775..0000000 Binary files a/doxygen/images/perf_avx2.png and /dev/null differ diff --git a/doxygen/images/perf_avx2_ubu.png b/doxygen/images/perf_avx2_ubu.png new file mode 100644 index 0000000..a4f9249 Binary files /dev/null and b/doxygen/images/perf_avx2_ubu.png differ diff --git a/doxygen/images/perf_avx2_ubu_mv1.png b/doxygen/images/perf_avx2_ubu_mv1.png new file mode 100644 index 0000000..f368cf4 Binary files /dev/null and b/doxygen/images/perf_avx2_ubu_mv1.png differ diff --git a/doxygen/images/perf_avx2_ubu_mv2.png b/doxygen/images/perf_avx2_ubu_mv2.png new file mode 100644 index 0000000..31b0995 Binary files /dev/null and b/doxygen/images/perf_avx2_ubu_mv2.png differ diff --git a/doxygen/images/perf_avx2_win.png b/doxygen/images/perf_avx2_win.png new file mode 100644 index 0000000..bcd515c Binary files /dev/null and b/doxygen/images/perf_avx2_win.png differ diff --git a/doxygen/images/perf_avx2_win_mv1.png b/doxygen/images/perf_avx2_win_mv1.png new file mode 100644 index 0000000..56e36d5 Binary files /dev/null and b/doxygen/images/perf_avx2_win_mv1.png differ diff --git a/doxygen/images/perf_avx2_win_mv2.png b/doxygen/images/perf_avx2_win_mv2.png new file mode 100644 index 0000000..1ff905f Binary files /dev/null and b/doxygen/images/perf_avx2_win_mv2.png differ diff --git a/doxygen/images/perf_mv_part1.png b/doxygen/images/perf_mv_part1.png deleted file mode 100644 index c34c35f..0000000 Binary files a/doxygen/images/perf_mv_part1.png and /dev/null differ diff --git a/doxygen/images/perf_mv_part1_neon_v0.3.90.png b/doxygen/images/perf_mv_part1_neon_v0.3.90.png deleted file mode 100644 index ac13a62..0000000 Binary files a/doxygen/images/perf_mv_part1_neon_v0.3.90.png and /dev/null differ diff --git a/doxygen/images/perf_mv_part2.png b/doxygen/images/perf_mv_part2.png deleted file mode 100644 index aaee92a..0000000 Binary files a/doxygen/images/perf_mv_part2.png and /dev/null differ diff --git a/doxygen/images/perf_mv_part2_neon_v0.3.90.png b/doxygen/images/perf_mv_part2_neon_v0.3.90.png deleted file mode 100644 index 649a4ca..0000000 Binary files a/doxygen/images/perf_mv_part2_neon_v0.3.90.png and /dev/null differ diff --git a/doxygen/images/perf_neon_mac.png b/doxygen/images/perf_neon_mac.png new file mode 100644 index 0000000..2e83700 Binary files /dev/null and b/doxygen/images/perf_neon_mac.png differ diff --git a/doxygen/images/perf_neon_mac_mv1.png b/doxygen/images/perf_neon_mac_mv1.png new file mode 100644 index 0000000..e5b89c7 Binary files /dev/null and b/doxygen/images/perf_neon_mac_mv1.png differ diff --git a/doxygen/images/perf_neon_mac_mv2.png b/doxygen/images/perf_neon_mac_mv2.png new file mode 100644 index 0000000..d19e495 Binary files /dev/null and b/doxygen/images/perf_neon_mac_mv2.png differ diff --git a/doxygen/images/perf_neon_v0.3.90.png b/doxygen/images/perf_neon_v0.3.90.png deleted file mode 100644 index 4eeb4ce..0000000 Binary files a/doxygen/images/perf_neon_v0.3.90.png and /dev/null differ diff --git a/doxygen/images/perf_no_vect.png b/doxygen/images/perf_no_vect.png deleted file mode 100644 index 26a0e5c..0000000 Binary files a/doxygen/images/perf_no_vect.png and /dev/null differ diff --git a/doxygen/images/perf_sse2.png b/doxygen/images/perf_sse2.png deleted file mode 100644 index 4023083..0000000 Binary files a/doxygen/images/perf_sse2.png and /dev/null differ