Cache `numDeletesToMerge` in MergePolicy #35594

s1monw · 2018-11-15T15:32:23Z

With update and refresh heavy usecases we might execute a lot of queries given we have a retention policy when we flush or refresh to make the right decision in the MergePolicy. Yet, having not absolutely correct values isn't crucial for the merge policy to be accurate. Yet, this can cause significant slowdowns on indexing speed. In order to find a middle-ground this change caches the results of the merge policy for 1 minute by default.

elasticmachine · 2018-11-15T15:32:25Z

Pinging @elastic/es-distributed

dliappis · 2018-11-16T14:50:00Z

I benchmarked this in terms of impact on total_time in merges and refreshes. I also measured the impact in indexing throughput, even thought the PR is about the merge policy, but as expected there was no impact there. All this was done on the AWS environment we used for most of the benchmarks in #31717.

Used a subset of http_logs (7780628 docs, abt 860mb index size) to iterate faster, ran baseline metrics using the current es#master (commit: 2da4bc8) and compared performance between 3 scenarios: soft_deletes: off, soft_deletes: on and no replication and finally with CCR replication on.

The indexing scenario is with 25% conflicts, with a bias towards more recent ids (recency: 0.75). on-conflict is index. 8 bulk indexing clients we used. All other ES settings were left to defaults.

ES master commit `2da4bc8`:

| Metric                               | soft_deletes: off |   soft_deletes: on |       Δ % |
| -----------------------------------  | ----------------- | ------------------ |    ------ |
| indices_merges_total_time_in_millis  |            235093 |             279914 | 19.065221 |
| indices_refresh_total_time_in_millis |             77861 |             162428 | 108.61278 |

This PR

| Metric                               | soft_deletes: off |   soft_deletes: on |       Δ % |
| -----------------------------------  | ----------------- | ------------------ |    ------ |
| indices_merges_total_time_in_millis  |            232987 |             246994 | 6.0119234 |
| indices_refresh_total_time_in_millis |             77051 |             143383 | 86.088435 |

So the Δ% between soft_deletes: off and soft_deletes on went down with this PR, for both merges and refresh total time by (relative change) 68.5% and 20.74% respectively.

For the record attached below the throughput values:

Indexing throughput es master commit `2da4bc8`

soft_deletes: off vs soft_deletes: on without replication:
| Median Throughput | bulk-leader-index-sequential-ids | 125385 | 119569 | -5815.55 | docs/s |

soft_deletes_off vs CCR on:

| Median Throughput | bulk-leader-index-sequential-ids | 119569 | 112103 | -7466.44 | docs/s |

Indexing throughput for this PR

soft_deletes: off vs soft_deletes: on without replication:

| Median Throughput | bulk-leader-index-sequential-ids | 124466 | 118745 | -5721.42 | docs/s |

soft_deletes_off vs CCR on:

| Median Throughput | bulk-leader-index-sequential-ids | 118745 | 115427 | -3317.93 | docs/s |

This reverts commit 3f0a7b1.

This reverts commit fc0ba09.

dnhatn

LGTM.

dliappis · 2018-12-13T09:22:35Z

To have a better understanding of the impact of an update workload (index on conflict) we benchmarked ES master at this commit:
4a825e2

which contains 3fb5a12 including the PRs: apache/lucene-solr#521 and apache/lucene-solr#522 with specific improvements for updates.

The http_logs track was used, in a 1-node scenario and we compared performance with soft_deletes_enabled: false (baseline) vs soft_deletes_enabled: true (contender):

$ rally compare --baseline=20181212T141753Z --contender=20181212T122139Z
Auto-updating Rally from origin
    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/
Comparing baseline
  Race timestamp: 2018-12-12 14:17:53
  Challenge: update
  Car: 4gheap
with contender
  Race timestamp: 2018-12-12 12:21:39
  Challenge: update
  Car: 4gheap
------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
|                               Metric |   Task |   Baseline |   Contender |     Diff |   Unit |
|-------------------------------------:|-------:|-----------:|------------:|---------:|-------:|
|                  Total indexing time |        |    312.699 |     290.881 | -21.8186 |    min |
|          Min indexing time per shard |        |    2.20418 |      1.9393 | -0.26488 |    min |
|       Median indexing time per shard |        |     9.5667 |       8.821 |  -0.7457 |    min |
|          Max indexing time per shard |        |    255.243 |     240.141 | -15.1021 |    min |
|                     Total merge time |        |    91.5886 |     80.1072 | -11.4814 |    min |
|             Min merge time per shard |        |  0.0940667 |           0 | -0.09407 |    min |
|          Median merge time per shard |        |    1.37393 |      1.0618 | -0.31213 |    min |
|             Max merge time per shard |        |    82.4002 |     74.0482 | -8.35202 |    min |
|            Total merge throttle time |        |    51.4853 |     37.3038 | -14.1815 |    min |
|    Min merge throttle time per shard |        |  0.0154167 |           0 | -0.01542 |    min |
| Median merge throttle time per shard |        |   0.587067 |    0.298817 | -0.28825 |    min |
|    Max merge throttle time per shard |        |    47.0388 |      35.527 | -11.5117 |    min |
|                   Total refresh time |        |    11.5231 |     20.1915 |  8.66843 |    min |
|           Min refresh time per shard |        |   0.137133 |    0.172817 |  0.03568 |    min |
|        Median refresh time per shard |        |   0.407183 |     0.56185 |  0.15467 |    min |
|           Max refresh time per shard |        |    9.27443 |     17.0656 |  7.79118 |    min |
|                     Total flush time |        |     4.5164 |     16.6682 |  12.1518 |    min |
|             Min flush time per shard |        |  0.0171833 | 0.000716667 | -0.01647 |    min |
|          Median flush time per shard |        |   0.143683 |    0.310483 |   0.1668 |    min |
|             Max flush time per shard |        |    3.73527 |      14.657 |  10.9217 |    min |
|                   Total Young Gen GC |        |    204.049 |     165.049 |      -39 |      s |
|                     Total Old Gen GC |        |      3.946 |       4.178 |    0.232 |      s |
|                           Store size |        |    20.4358 |     21.6128 |  1.17702 |     GB |
|                        Translog size |        |    3.72426 |     3.82259 |  0.09833 |     GB |
|                      Totally written |        |    148.353 |     140.922 | -7.43103 |     GB |
|               Heap used for segments |        |    56.2567 |     53.1571 |  -3.0996 |     MB |
|             Heap used for doc values |        |  0.0203896 |   0.0575466 |  0.03716 |     MB |
|                  Heap used for terms |        |    46.0091 |     42.6506 | -3.35855 |     MB |
|                  Heap used for norms |        | 0.00909424 |  0.00994873 |  0.00085 |     MB |
|                 Heap used for points |        |    4.46687 |     4.56505 |  0.09818 |     MB |
|          Heap used for stored fields |        |    5.75117 |     5.87392 |  0.12276 |     MB |
|                        Segment count |        |        149 |         163 |       14 |        |
|                       Min Throughput | update |    53042.5 |     55003.4 |  1960.96 | docs/s |
|                    Median Throughput | update |    56923.4 |     59198.8 |  2275.42 | docs/s |
|                       Max Throughput | update |    59490.9 |     61646.3 |  2155.38 | docs/s |
|              50th percentile latency | update |    432.729 |     400.974 | -31.7553 |     ms |
|              90th percentile latency | update |    1974.45 |     1939.48 | -34.9747 |     ms |
|              99th percentile latency | update |    3122.52 |     3062.97 | -59.5558 |     ms |
|            99.9th percentile latency | update |    3795.65 |      9986.2 |  6190.55 |     ms |
|           99.99th percentile latency | update |    4990.56 |     14176.1 |  9185.52 |     ms |
|             100th percentile latency | update |    5564.62 |     17730.7 |    12166 |     ms |
|         50th percentile service time | update |    432.729 |     400.974 | -31.7553 |     ms |
|         90th percentile service time | update |    1974.45 |     1939.48 | -34.9747 |     ms |
|         99th percentile service time | update |    3122.52 |     3062.97 | -59.5558 |     ms |
|       99.9th percentile service time | update |    3795.65 |      9986.2 |  6190.55 |     ms |
|      99.99th percentile service time | update |    4990.56 |     14176.1 |  9185.52 |     ms |
|        100th percentile service time | update |    5564.62 |     17730.7 |    12166 |     ms |
|                           error rate | update |          0 |           0 |        0 |      % |
-------------------------------
[INFO] SUCCESS (took 1 seconds)
-------------------------------

Also captured the total counts for merge/refresh/count (from the _stats endpoint, under _all/primaries) after the end of each benchmark:

soft_deletes_enabled: false:

"merges" : {
   "total" : 332,
"refresh" : {
   "total" : 463,
 "flush" : {
   "total" : 96,

soft_deletes_enabled:true:

"merges" : {
       "total" : 188,
 "refresh" : {
       "total" : 262,
     "flush" : {
       "total" : 97,

With soft_deletes enabled:
Time spent on merge had a 12.6% decrease and median also showed a decrease. Total refresh time was higher by 75%, however, looking at the median refresh time per shard the impact with soft_deletes_enabled: true is approx 4%.
There are different memory patterns with soft_deletes enabled and this can explain the reduced amount of merge/refresh operations.
Median indexing throughput showed a 4% increase with soft_deletes enabled, but this is likely due to run-to-run variation and probably more benchmark iterations would be needed to validate with certainty if there are concrete improvements in indexing throughput.

Rally command:

esrally --keep-cluster-running --runtime-jdk=8 --car="4gheap" --target-hosts=192.168.14.3:39200 --pipeline=from-sources-complete --revision=4a825e2e86582fa2d693d64246c2fb6669f42fdb --track=http_logs --challenge=update --track-params=./custom_track_params.json

Additional parameters used (for contender "index.soft_deletes.enabled" was changed to true):

{
  "ingest_percentage": 100,
  "on_conflict": "index",
  "conflict-probability": 25,
  "recency": 0.75,
  "index_settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0,
    "index.soft_deletes.enabled": false
  }
}

s1monw · 2018-12-13T12:58:03Z

@dliappis I take from these numbers that we are more memory efficient now with soft-delets which I expected which caused less refreshes. We are refreshing about 56% of the time compared to hard deletes which means we are filling up buffers quite heavily causing more load on the refresh which results in the higher number time wise spend in refresh Max refresh time per shard (note this is not the ideal name for this, I think what this is is the max cumulative per shard refresh time across all shards @danielmitterdorfer correct me if I am wrong).

I personally think we can close this PR since the numbers look much better now. Yet there is one more experiment that I would like to run if possible. I would love to see how numbers behave if we set index.search.idle.after: 1d which basically means we will refresh every second. This should be set on both baseline and contender. @dliappis would you be able to run this experiment?

dliappis · 2018-12-13T23:30:45Z

would love to see how numbers behave if we set index.search.idle.after: 1d which basically means we will refresh every second.

@s1monw I guess the setting value should be 1s rather than 1d, correct?

s1monw · 2018-12-14T06:34:16Z

@dliappis I meant 1d to make sure we don’t mark it search idle so we have a number that is based on refresh every second

dliappis · 2018-12-14T08:44:58Z

Thanks for the clarification @s1monw

Use cumulative instead of total for summary reports and comparisons and clarify that min/median/max are across primary shards. Relates: elastic/elasticsearch#35594 (comment)

Due to #608 it's likely we need to benchmark scenarios without using the node-stats telemetry device. At the same time we want to get a general idea of how many refreshes/merges/flushes happened (in total) by accessing the index stats. Add total count for merges/refresh/flush in summary output; this is collected from `_all/primaries` in `_stats`. Also that these values are cumulatives from primary shards and also clarify desciprion for min/median/max in the summary report. Finally fix bug where index stats where time/count == 0 got skipped from the summary. Closes #614 Relates: #615 Relates: elastic/elasticsearch#35594 (comment)

dliappis · 2018-12-14T17:19:33Z

Benchmark results using index.search.idle.after: 1d on both baseline (soft_deletes.enabled: false) and contender (soft_deletes.enabled: true):

(Note: the names of the columns contain the renamed fields done in elastic/rally#615 but still display the same fields as the ones in previous results).

$ rally compare --baseline=20181214T110634Z --contender=20181214T123444Z
|                                                     Metric |   Task |   Baseline |   Contender |     Diff |   Unit |
|-----------------------------------------------------------:|-------:|-----------:|------------:|---------:|-------:|
|                 Cumulative indexing time of primary shards |        |    308.177 |     326.459 |   18.282 |    min |
|          Min cumulative indexing time across primary shard |        |    2.00722 |      2.0033 | -0.00392 |    min |
|       Median cumulative indexing time across primary shard |        |    9.06298 |     10.1192 |  1.05623 |    min |
|          Max cumulative indexing time across primary shard |        |    255.239 |     270.216 |  14.9772 |    min |
|                    Cumulative merge time of primary shards |        |    121.077 |     121.541 |   0.4647 |    min |
|             Min cumulative merge time across primary shard |        |   0.280383 |    0.345317 |  0.06493 |    min |
|          Median cumulative merge time across primary shard |        |    2.07707 |     2.49417 |   0.4171 |    min |
|             Max cumulative merge time across primary shard |        |    107.752 |     106.739 | -1.01252 |    min |
|           Cumulative merge throttle time of primary shards |        |    57.6583 |      51.627 | -6.03135 |    min |
|    Min cumulative merge throttle time across primary shard |        |  0.0226667 |   0.0349667 |   0.0123 |    min |
| Median cumulative merge throttle time across primary shard |        |     0.5813 |     0.47885 | -0.10245 |    min |
|    Max cumulative merge throttle time across primary shard |        |    53.3289 |     48.1736 |  -5.1553 |    min |
|                  Cumulative refresh time of primary shards |        |    16.6727 |     42.8803 |  26.2076 |    min |
|           Min cumulative refresh time across primary shard |        |     0.1117 |      0.1885 |   0.0768 |    min |
|        Median cumulative refresh time across primary shard |        |      0.465 |     1.07048 |  0.60548 |    min |
|           Max cumulative refresh time across primary shard |        |    14.0625 |     36.9627 |  22.9001 |    min |
|                    Cumulative flush time of primary shards |        |    2.96627 |     6.26865 |  3.30238 |    min |
|             Min cumulative flush time across primary shard |        | 0.00248333 |      0.0009 | -0.00158 |    min |
|          Median cumulative flush time across primary shard |        |  0.0874333 |      0.1419 |  0.05447 |    min |
|             Max cumulative flush time across primary shard |        |    2.51173 |      5.4797 |  2.96797 |    min |
|                                         Total Young Gen GC |        |    157.059 |     154.685 |   -2.374 |      s |
|                                           Total Old Gen GC |        |      3.229 |        2.77 |   -0.459 |      s |
|                                                 Store size |        |    20.7935 |     22.5187 |  1.72515 |     GB |
|                                              Translog size |        |    3.74946 |     3.71403 | -0.03543 |     GB |
|                                            Totally written |        |    190.833 |     189.605 | -1.22774 |     GB |
|                                     Heap used for segments |        |    55.3721 |     51.2664 | -4.10573 |     MB |
|                                   Heap used for doc values |        |  0.0352287 |   0.0641937 |  0.02896 |     MB |
|                                        Heap used for terms |        |    45.3404 |     40.8353 | -4.50513 |     MB |
|                                        Heap used for norms |        | 0.00958252 |  0.00964355 |    6e-05 |     MB |
|                                       Heap used for points |        |    4.41825 |     4.56084 |  0.14258 |     MB |
|                                Heap used for stored fields |        |    5.56862 |     5.79642 |   0.2278 |     MB |
|                                              Segment count |        |        157 |         158 |        1 |        |
|                                             Min Throughput | update |    55747.5 |     53696.4 | -2051.15 | docs/s |
|                                          Median Throughput | update |    60765.5 |     59023.2 | -1742.27 | docs/s |
|                                             Max Throughput | update |    63406.6 |     63144.2 | -262.366 | docs/s |
|                                    50th percentile latency | update |    463.521 |     461.751 | -1.76977 |     ms |
|                                    90th percentile latency | update |    1893.82 |     1919.34 |  25.5193 |     ms |
|                                    99th percentile latency | update |    2495.46 |     2718.14 |  222.689 |     ms |
|                                  99.9th percentile latency | update |    2984.23 |     3777.08 |  792.846 |     ms |
|                                 99.99th percentile latency | update |    3484.53 |     4924.91 |  1440.38 |     ms |
|                                   100th percentile latency | update |    3771.14 |     6076.91 |  2305.77 |     ms |
|                               50th percentile service time | update |    463.521 |     461.751 | -1.76977 |     ms |
|                               90th percentile service time | update |    1893.82 |     1919.34 |  25.5193 |     ms |
|                               99th percentile service time | update |    2495.46 |     2718.14 |  222.689 |     ms |
|                             99.9th percentile service time | update |    2984.23 |     3777.08 |  792.846 |     ms |
|                            99.99th percentile service time | update |    3484.53 |     4924.91 |  1440.38 |     ms |
|                              100th percentile service time | update |    3771.14 |     6076.91 |  2305.77 |     ms |
|                                                 error rate | update |          0 |           0 |        0 |      % |


-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

Refresh/merge/flush counts:

baseline:

"merges" : {
  "total" : 2131,
},
"refresh" : {
  "total" : 3688,
},
"flush" : {
  "total" : 98,
},

contender:

"merges" : {
  "total" : 1364,
},
"refresh" : {
  "total" : 2364,
},
"flush" : {
  "total" : 98,
},

dnhatn and others added 4 commits November 15, 2018 15:50

cache the numDeletes to merge

012aded

add settings

e51418b

don't cache size explicitly

127a378

fix copy / paste issue

8dccc3d

s1monw added >enhancement v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.6.0 labels Nov 15, 2018

s1monw requested review from martijnvg, bleskes, dliappis and jasontedor November 15, 2018 15:32

dliappis mentioned this pull request Nov 19, 2018

[CCR] Re-evaluate shard follow parameter defaults #31717

Closed

s1monw added 7 commits November 21, 2018 10:12

try prevent indexing thread from help flushing

3f0a7b1

Merge branch 'master' into cached-mp

13d455e

Revert "try prevent indexing thread from help flushing"

b8bb83e

This reverts commit 3f0a7b1.

Merge branch 'master' into cached-mp

6826ec6

add some logging for debugging

fc0ba09

Merge branch 'master' into cached-mp

4321e01

Revert "add some logging for debugging"

dc427aa

This reverts commit fc0ba09.

dnhatn approved these changes Dec 12, 2018

View reviewed changes

jasontedor added v6.7.0 and removed v6.6.0 labels Dec 19, 2018

s1monw closed this Jan 2, 2019

s1monw added >non-issue and removed v6.7.0 v7.0.0 labels Jan 2, 2019

easyice mentioned this pull request Jul 26, 2021

SoftDeletesRetentionMergePolicy#numDeletesToMerge caused indexing backlogged #75675

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache `numDeletesToMerge` in MergePolicy #35594

Cache `numDeletesToMerge` in MergePolicy #35594

s1monw commented Nov 15, 2018

elasticmachine commented Nov 15, 2018

dliappis commented Nov 16, 2018 •

edited

Loading

dnhatn left a comment

dliappis commented Dec 13, 2018 •

edited

Loading

s1monw commented Dec 13, 2018

dliappis commented Dec 13, 2018

s1monw commented Dec 14, 2018

dliappis commented Dec 14, 2018

dliappis commented Dec 14, 2018 •

edited

Loading

Cache numDeletesToMerge in MergePolicy #35594

Cache numDeletesToMerge in MergePolicy #35594

Conversation

s1monw commented Nov 15, 2018

elasticmachine commented Nov 15, 2018

dliappis commented Nov 16, 2018 • edited Loading

ES master commit 2da4bc8:

This PR

dnhatn left a comment

Choose a reason for hiding this comment

dliappis commented Dec 13, 2018 • edited Loading

s1monw commented Dec 13, 2018

dliappis commented Dec 13, 2018

s1monw commented Dec 14, 2018

dliappis commented Dec 14, 2018

dliappis commented Dec 14, 2018 • edited Loading

Cache `numDeletesToMerge` in MergePolicy #35594

Cache `numDeletesToMerge` in MergePolicy #35594

dliappis commented Nov 16, 2018 •

edited

Loading

ES master commit `2da4bc8`:

dliappis commented Dec 13, 2018 •

edited

Loading

dliappis commented Dec 14, 2018 •

edited

Loading