Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a separate thread for tiered compilation background work #45901

Merged
merged 2 commits into from
Jan 26, 2021

Conversation

kouvel
Copy link
Member

@kouvel kouvel commented Dec 10, 2020

  • Makes it easier to manage how much time is spent for performing background work like rejitting and allows yielding more frequently with just Sleep without incurring thread pool overhead, which is useful in CPU-limited cases
  • A min/max range is determined for how long background work will be done before yielding the thread. The max is the same as before, 50 ms. For now the min is processor count ms (capped to the max), such that in CPU-limited cases the thread would yield more frequently in order to not monopolize too much of the limited CPU resources for background work, and in cases with a larger number of processors where the background work is typically less intrusive to foreground work it would yield less frequently.
  • At the same time, progress should be made on background work such that steady-state perf would be reached in reasonable time. Yielding too frequently can slow down the background work too much. The sleep duration is measured to identify oversubscribed situations to yield less frequently and make faster progress on the background work.
  • Due to less time spent rejitting in some CPU-limited cases, steady-state performance may be reached a bit later in favor of smaller spikes along the way
  • When the portable thread pool is enabled, a side effect of using a managed worker thread for tiering background work was that several GC-heavy microbenchmarks regressed. Tiering was the only thing using the thread pool in those tests and stack-walking the managed thread was slower due to the presence of GC refs. It's not too concerning, the benchmarks are just measuring something different from before, but in any case this change also resolves that issue. Fixes [Perf] Widespread regressions after threadpool change #44211.

@kouvel kouvel added this to the 6.0.0 milestone Dec 10, 2020
@kouvel kouvel requested a review from noahfalk December 10, 2020 18:40
@kouvel kouvel self-assigned this Dec 10, 2020
@kouvel
Copy link
Member Author

kouvel commented Dec 10, 2020

MusicStore Windows affinity: 1 proc

Before

============= Startup Performance ============

Server start (ms):   457
1st Request (ms):    378
Total (ms):          835

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)    Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   ------   -----------   ------------   --------------   -----------   ------
    2-  100                 1047   465.57          1.73           2.15             1.80         17.26     8.46
  101-  250                 1366   470.16          1.76           2.13             1.82         39.01    11.67
  251-  500                 1919   452.32          1.43           2.21             1.70        126.60    22.62
  501-  750                 2651   341.64          1.23           2.93             1.37        126.52    26.84
  751- 1000                 3104   551.91          1.21           1.81             1.27        126.01    27.54
 1001- 1500                 3950   590.90          1.13           1.69             1.19        203.36    24.06
 1501- 2000                 4562   816.66          1.13           1.22             1.19          4.78     0.77
 2001- 3000                 5772   826.59          1.14           1.21             1.19          2.25     0.36
 3001- 5000                 8189   827.25          1.11           1.21             1.19          2.52     0.26
 5001-10000                14281   820.76          1.11           1.22             1.19          7.73     0.20

After

============= Startup Performance ============

Server start (ms):   466
1st Request (ms):    374
Total (ms):          840

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)    Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   ------   -----------   ------------   --------------   -----------   ------
    2-  100                 1054   460.95          1.74           2.17             1.80         19.82     9.28
  101-  250                 1354   501.21          1.73           2.00             1.78          9.51     3.58
  251-  500                 1872   482.59          1.51           2.07             1.68          7.27     3.55
  501-  750                 2300   584.15          1.47           1.71             1.52          6.67     2.69
  751- 1000                 2856   448.89          1.38           2.23             1.50          7.18     4.24
 1001- 1500                 3743   563.79          1.30           1.77             1.38         16.48     3.28
 1501- 2000                 4765   489.63          1.16           2.04             1.27         11.07     3.47
 2001- 3000                 6008   804.51          1.14           1.24             1.19          5.11     0.84
 3001- 5000                 8447   819.91          1.03           1.22             1.19          2.84     0.30
 5001-10000                14519   823.37          1.08           1.21             1.19          2.33     0.17

MusicStore Linux --cpuset-cpus=1

Before

============= Startup Performance ============

Server start (ms):  1019
1st Request (ms):    293
Total (ms):         1312

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)     Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   -------   -----------   ------------   --------------   -----------   ------
    2-  100                 1478    595.83          1.27           1.68             1.34         15.55    10.37
  101-  250                 1743    564.91          1.27           1.77             1.31         58.03    21.51
  251-  500                 2193    556.11          1.07           1.80             1.11         54.65    20.05
  501-  750                 2528    745.39          1.03           1.34             1.10         53.40    15.60
  751- 1000                 3002    527.89          0.82           1.89             1.04         53.85    21.84
 1001- 1500                 3537    933.26          0.77           1.07             0.83         55.08    14.04
 1501- 2000                 4054    967.10          0.73           1.03             0.79         52.34    14.08
 2001- 3000                 5067    988.04          0.68           1.01             0.72         56.17    10.85
 3001- 5000                 6557   1341.62          0.67           0.75             0.72          2.00     0.30
 5001-10000                10239   1358.02          0.67           0.74             0.72          2.22     0.18

After

============= Startup Performance ============

Server start (ms):  1025
1st Request (ms):    304
Total (ms):         1329

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)     Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   -------   -----------   ------------   --------------   -----------   ------
    2-  100                 1495    596.10          1.26           1.68             1.33         15.62    10.48
  101-  250                 1765    555.33          1.26           1.80             1.32          9.63     4.94
  251-  500                 2178    604.41          1.05           1.65             1.16          3.68     2.67
  501-  750                 2542    688.24          1.05           1.45             1.12         11.79     5.19
  751- 1000                 2987    560.69          0.90           1.78             2.08          9.78     2.93
 1001- 1500                 3572    854.85          0.79           1.17             0.85         14.11     4.00
 1501- 2000                 4102    944.62          0.75           1.06             0.83         10.81     3.28
 2001- 3000                 5194    915.07          0.68           1.09             0.78         10.63     2.58
 3001- 5000                 6659   1365.22          0.67           0.73             0.71          1.96     0.30
 5001-10000                10359   1351.50          0.68           0.74             0.72          2.41     0.21

MusicStore Linux --cpus=1

Before

============= Startup Performance ============

Server start (ms):  1009
1st Request (ms):    306
Total (ms):         1315

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)    Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   ------   -----------   ------------   --------------   -----------   ------
    2-  100                 1834   190.48          1.46           5.25             1.66        103.80    31.16
  101-  250                 2101   561.62          1.34           1.78             1.43         15.38     8.29
  251-  500                 2800   357.80          0.99           2.79             1.16        107.74    24.82
  501-  750                 3260   543.99          1.04           1.84             1.14        104.71    23.62
  751- 1000                 3768   492.22          0.88           2.03             1.05        109.44    25.01
 1001- 1500                 4420   765.76          0.89           1.31             1.08         20.41     6.90
 1501- 2000                 5078   760.81          0.84           1.31             1.05         21.16     7.49
 2001- 3000                 6318   806.30          0.92           1.24             0.97         21.58     5.36
 3001- 5000                 8918   769.09          0.92           1.30             1.05         23.24     3.70
 5001-10000                15282   785.76          0.86           1.27             0.99         23.13     2.44

After

============= Startup Performance ============

Server start (ms):   991
1st Request (ms):    302
Total (ms):         1293

========== Steady State Performance ==========

  Requests    Aggregate Time(ms)    Req/s   Req Min(ms)   Req Mean(ms)   Req Median(ms)   Req Max(ms)   SEM(%)
-----------   ------------------   ------   -----------   ------------   --------------   -----------   ------
    2-  100                 1502   471.88          1.51           2.12             1.65         18.35    10.84
  101-  250                 1790   521.18          1.43           1.92             1.51         23.63     9.68
  251-  500                 2420   397.07          1.30           2.52             1.48         57.73    17.26
  501-  750                 3045   399.72          1.31           2.50             1.43         58.17    17.55
  751- 1000                 3564   481.83          1.06           2.08             1.22         56.80    18.52
 1001- 1500                 4523   521.37          1.03           1.92             1.15         59.15    12.78
 1501- 2000                 5450   539.66          0.96           1.85             1.10         57.93    13.85
 2001- 3000                 6684   809.82          0.90           1.23             0.97         22.16     5.41
 3001- 5000                 9178   802.16          0.93           1.25             0.97         22.73     3.90
 5001-10000                15481   793.19          0.79           1.26             0.98         24.00     2.45

Other testing

  • Ran CscRoslyn in CPU-limited cases, as it has a lot of methods to rejit, to verify that steady-state perf is reached in reasonable time. No change in startup or steady-state perf.
  • Ran ASP.NET JsonPlatform in CPU-limited cases to look for regressions. Didn't see any unexpected differences, just the expected slight differences in rejit timing. No change in startup or steady-state perf.
  • Ran some things on non-CPU-limited environments, no noticeable change there

@kouvel
Copy link
Member Author

kouvel commented Dec 10, 2020

The diff of tieredcompilation.cpp is not well aligned, I have shared a better-aligned diff, download and open in a browser.

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a little fishiness in the cross-thread signaling but other than that all looked good to me.

src/coreclr/vm/tieredcompilation.cpp Show resolved Hide resolved
src/coreclr/vm/tieredcompilation.cpp Show resolved Hide resolved
- Makes it easier to manage how much time is spend for performing background work like rejitting and allows yielding more frequently with just Sleep without incurring thread pool overhead, which is useful in CPU-limited cases
- A min/max range is determined for how long background work will be done before yielding the thread. The max is the same as before, 50 ms. For now the min is `processor count` ms (capped to the max), such that in CPU-limited cases the thread would yield more frequently in order to not monopolize too much of the limited CPU resources for background work, and in cases with a larger number of processors where the background work is typically less intrusive to foreground work it would yield less frequently.
- At the same time, progress should be made on background work such that steady-state perf would be reached in reasonable time. Yielding too frequently can slow down the background work too much. The sleep duration is measured to identify oversubscribed situations to yield less frequently and make faster progress on the background work.
- Due to less time spent rejitting in some CPU-limited cases, steady-state performance may be reached a bit later in favor of fewer spikes along the way
- When the portable thread pool is enabled, a side effect of using a managed worker thread for tiering background work was that several GC-heavy microbenchmarks regressed. Tiering was the only thing using the thread pool in those tests and stack-walking the managed thread was slower due to the presence of GC refs. It's not too concerning, the benchmarks are just measuring something different from before, but in any case this change also resolves that issue. Fixes dotnet#44211.
@kouvel
Copy link
Member Author

kouvel commented Jan 19, 2021

Rebased to fix conflicts

Copy link
Member

@noahfalk noahfalk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks Kount!

@kouvel kouvel merged commit 59ba160 into dotnet:master Jan 26, 2021
@kouvel kouvel deleted the TierTpFix branch January 26, 2021 15:51
@ghost ghost locked as resolved and limited conversation to collaborators Feb 25, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Perf] Widespread regressions after threadpool change
2 participants