Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cache numDeletesToMerge in MergePolicy #35594

Closed
wants to merge 11 commits into from
Closed

Conversation

s1monw
Copy link
Contributor

@s1monw s1monw commented Nov 15, 2018

With update and refresh heavy usecases we might execute a lot of queries given we have a retention policy when we flush or refresh to make the right decision in the MergePolicy. Yet, having not absolutely correct values isn't crucial for the merge policy to be accurate. Yet, this can cause significant slowdowns on indexing speed. In order to find a middle-ground this change caches the results of the merge policy for 1 minute by default.

@s1monw s1monw added >enhancement v7.0.0 :Distributed Indexing/CCR Issues around the Cross Cluster State Replication features v6.6.0 labels Nov 15, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@dliappis
Copy link
Contributor

dliappis commented Nov 16, 2018

I benchmarked this in terms of impact on total_time in merges and refreshes. I also measured the impact in indexing throughput, even thought the PR is about the merge policy, but as expected there was no impact there. All this was done on the AWS environment we used for most of the benchmarks in #31717.

Used a subset of http_logs (7780628 docs, abt 860mb index size) to iterate faster, ran baseline metrics using the current es#master (commit: 2da4bc8) and compared performance between 3 scenarios: soft_deletes: off, soft_deletes: on and no replication and finally with CCR replication on.

The indexing scenario is with 25% conflicts, with a bias towards more recent ids (recency: 0.75). on-conflict is index. 8 bulk indexing clients we used. All other ES settings were left to defaults.

ES master commit 2da4bc8:

| Metric                               | soft_deletes: off |   soft_deletes: on |       Δ % |
| -----------------------------------  | ----------------- | ------------------ |    ------ |
| indices_merges_total_time_in_millis  |            235093 |             279914 | 19.065221 |
| indices_refresh_total_time_in_millis |             77861 |             162428 | 108.61278 |

This PR

| Metric                               | soft_deletes: off |   soft_deletes: on |       Δ % |
| -----------------------------------  | ----------------- | ------------------ |    ------ |
| indices_merges_total_time_in_millis  |            232987 |             246994 | 6.0119234 |
| indices_refresh_total_time_in_millis |             77051 |             143383 | 86.088435 |

So the Δ% between soft_deletes: off and soft_deletes on went down with this PR, for both merges and refresh total time by (relative change) 68.5% and 20.74% respectively.

For the record attached below the throughput values:

Indexing throughput es master commit `2da4bc8`

soft_deletes: off vs soft_deletes: on without replication:
| Median Throughput | bulk-leader-index-sequential-ids | 125385 | 119569 | -5815.55 | docs/s |

soft_deletes_off vs CCR on:

| Median Throughput | bulk-leader-index-sequential-ids | 119569 | 112103 | -7466.44 | docs/s |

Indexing throughput for this PR

soft_deletes: off vs soft_deletes: on without replication:

| Median Throughput | bulk-leader-index-sequential-ids | 124466 | 118745 | -5721.42 | docs/s |

soft_deletes_off vs CCR on:

| Median Throughput | bulk-leader-index-sequential-ids | 118745 | 115427 | -3317.93 | docs/s |

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@dliappis
Copy link
Contributor

dliappis commented Dec 13, 2018

To have a better understanding of the impact of an update workload (index on conflict) we benchmarked ES master at this commit:
4a825e2

which contains 3fb5a12 including the PRs: apache/lucene-solr#521 and apache/lucene-solr#522 with specific improvements for updates.

The http_logs track was used, in a 1-node scenario and we compared performance with soft_deletes_enabled: false (baseline) vs soft_deletes_enabled: true (contender):

$ rally compare --baseline=20181212T141753Z --contender=20181212T122139Z
Auto-updating Rally from origin
    ____        ____
   / __ \____ _/ / /_  __
  / /_/ / __ `/ / / / / /
 / _, _/ /_/ / / / /_/ /
/_/ |_|\__,_/_/_/\__, /
                /____/
Comparing baseline
  Race timestamp: 2018-12-12 14:17:53
  Challenge: update
  Car: 4gheap
with contender
  Race timestamp: 2018-12-12 12:21:39
  Challenge: update
  Car: 4gheap
------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
|                               Metric |   Task |   Baseline |   Contender |     Diff |   Unit |
|-------------------------------------:|-------:|-----------:|------------:|---------:|-------:|
|                  Total indexing time |        |    312.699 |     290.881 | -21.8186 |    min |
|          Min indexing time per shard |        |    2.20418 |      1.9393 | -0.26488 |    min |
|       Median indexing time per shard |        |     9.5667 |       8.821 |  -0.7457 |    min |
|          Max indexing time per shard |        |    255.243 |     240.141 | -15.1021 |    min |
|                     Total merge time |        |    91.5886 |     80.1072 | -11.4814 |    min |
|             Min merge time per shard |        |  0.0940667 |           0 | -0.09407 |    min |
|          Median merge time per shard |        |    1.37393 |      1.0618 | -0.31213 |    min |
|             Max merge time per shard |        |    82.4002 |     74.0482 | -8.35202 |    min |
|            Total merge throttle time |        |    51.4853 |     37.3038 | -14.1815 |    min |
|    Min merge throttle time per shard |        |  0.0154167 |           0 | -0.01542 |    min |
| Median merge throttle time per shard |        |   0.587067 |    0.298817 | -0.28825 |    min |
|    Max merge throttle time per shard |        |    47.0388 |      35.527 | -11.5117 |    min |
|                   Total refresh time |        |    11.5231 |     20.1915 |  8.66843 |    min |
|           Min refresh time per shard |        |   0.137133 |    0.172817 |  0.03568 |    min |
|        Median refresh time per shard |        |   0.407183 |     0.56185 |  0.15467 |    min |
|           Max refresh time per shard |        |    9.27443 |     17.0656 |  7.79118 |    min |
|                     Total flush time |        |     4.5164 |     16.6682 |  12.1518 |    min |
|             Min flush time per shard |        |  0.0171833 | 0.000716667 | -0.01647 |    min |
|          Median flush time per shard |        |   0.143683 |    0.310483 |   0.1668 |    min |
|             Max flush time per shard |        |    3.73527 |      14.657 |  10.9217 |    min |
|                   Total Young Gen GC |        |    204.049 |     165.049 |      -39 |      s |
|                     Total Old Gen GC |        |      3.946 |       4.178 |    0.232 |      s |
|                           Store size |        |    20.4358 |     21.6128 |  1.17702 |     GB |
|                        Translog size |        |    3.72426 |     3.82259 |  0.09833 |     GB |
|                      Totally written |        |    148.353 |     140.922 | -7.43103 |     GB |
|               Heap used for segments |        |    56.2567 |     53.1571 |  -3.0996 |     MB |
|             Heap used for doc values |        |  0.0203896 |   0.0575466 |  0.03716 |     MB |
|                  Heap used for terms |        |    46.0091 |     42.6506 | -3.35855 |     MB |
|                  Heap used for norms |        | 0.00909424 |  0.00994873 |  0.00085 |     MB |
|                 Heap used for points |        |    4.46687 |     4.56505 |  0.09818 |     MB |
|          Heap used for stored fields |        |    5.75117 |     5.87392 |  0.12276 |     MB |
|                        Segment count |        |        149 |         163 |       14 |        |
|                       Min Throughput | update |    53042.5 |     55003.4 |  1960.96 | docs/s |
|                    Median Throughput | update |    56923.4 |     59198.8 |  2275.42 | docs/s |
|                       Max Throughput | update |    59490.9 |     61646.3 |  2155.38 | docs/s |
|              50th percentile latency | update |    432.729 |     400.974 | -31.7553 |     ms |
|              90th percentile latency | update |    1974.45 |     1939.48 | -34.9747 |     ms |
|              99th percentile latency | update |    3122.52 |     3062.97 | -59.5558 |     ms |
|            99.9th percentile latency | update |    3795.65 |      9986.2 |  6190.55 |     ms |
|           99.99th percentile latency | update |    4990.56 |     14176.1 |  9185.52 |     ms |
|             100th percentile latency | update |    5564.62 |     17730.7 |    12166 |     ms |
|         50th percentile service time | update |    432.729 |     400.974 | -31.7553 |     ms |
|         90th percentile service time | update |    1974.45 |     1939.48 | -34.9747 |     ms |
|         99th percentile service time | update |    3122.52 |     3062.97 | -59.5558 |     ms |
|       99.9th percentile service time | update |    3795.65 |      9986.2 |  6190.55 |     ms |
|      99.99th percentile service time | update |    4990.56 |     14176.1 |  9185.52 |     ms |
|        100th percentile service time | update |    5564.62 |     17730.7 |    12166 |     ms |
|                           error rate | update |          0 |           0 |        0 |      % |
-------------------------------
[INFO] SUCCESS (took 1 seconds)
-------------------------------

Also captured the total counts for merge/refresh/count (from the _stats endpoint, under _all/primaries) after the end of each benchmark:

soft_deletes_enabled: false:

"merges" : {
   "total" : 332,
"refresh" : {
   "total" : 463,
 "flush" : {
   "total" : 96,

soft_deletes_enabled:true:

"merges" : {
       "total" : 188,
 "refresh" : {
       "total" : 262,
     "flush" : {
       "total" : 97,

With soft_deletes enabled:
Time spent on merge had a 12.6% decrease and median also showed a decrease. Total refresh time was higher by 75%, however, looking at the median refresh time per shard the impact with soft_deletes_enabled: true is approx 4%.
There are different memory patterns with soft_deletes enabled and this can explain the reduced amount of merge/refresh operations.
Median indexing throughput showed a 4% increase with soft_deletes enabled, but this is likely due to run-to-run variation and probably more benchmark iterations would be needed to validate with certainty if there are concrete improvements in indexing throughput.

Rally command:

esrally --keep-cluster-running --runtime-jdk=8 --car="4gheap" --target-hosts=192.168.14.3:39200 --pipeline=from-sources-complete --revision=4a825e2e86582fa2d693d64246c2fb6669f42fdb --track=http_logs --challenge=update --track-params=./custom_track_params.json

Additional parameters used (for contender "index.soft_deletes.enabled" was changed to true):

{
  "ingest_percentage": 100,
  "on_conflict": "index",
  "conflict-probability": 25,
  "recency": 0.75,
  "index_settings": {
    "index.number_of_shards": 1,
    "index.number_of_replicas": 0,
    "index.soft_deletes.enabled": false
  }
}

@s1monw
Copy link
Contributor Author

s1monw commented Dec 13, 2018

@dliappis I take from these numbers that we are more memory efficient now with soft-delets which I expected which caused less refreshes. We are refreshing about 56% of the time compared to hard deletes which means we are filling up buffers quite heavily causing more load on the refresh which results in the higher number time wise spend in refresh Max refresh time per shard (note this is not the ideal name for this, I think what this is is the max cumulative per shard refresh time across all shards @danielmitterdorfer correct me if I am wrong).

I personally think we can close this PR since the numbers look much better now. Yet there is one more experiment that I would like to run if possible. I would love to see how numbers behave if we set index.search.idle.after: 1d which basically means we will refresh every second. This should be set on both baseline and contender. @dliappis would you be able to run this experiment?

@dliappis
Copy link
Contributor

would love to see how numbers behave if we set index.search.idle.after: 1d which basically means we will refresh every second.

@s1monw I guess the setting value should be 1s rather than 1d, correct?

@s1monw
Copy link
Contributor Author

s1monw commented Dec 14, 2018

@dliappis I meant 1d to make sure we don’t mark it search idle so we have a number that is based on refresh every second

@dliappis
Copy link
Contributor

Thanks for the clarification @s1monw

dliappis added a commit to dliappis/rally that referenced this pull request Dec 14, 2018
Use cumulative instead of total for summary reports and comparisons
and clarify that min/median/max are across primary shards.

Relates: elastic/elasticsearch#35594 (comment)
dliappis added a commit to elastic/rally that referenced this pull request Dec 14, 2018
Due to #608 it's likely we need to benchmark scenarios without using
the node-stats telemetry device. At the same time we want to get a
general idea of how many refreshes/merges/flushes happened (in total)
by accessing the index stats.

Add total count for merges/refresh/flush in summary output; this is
collected from `_all/primaries` in `_stats`.

Also that these values are cumulatives from primary shards and also 
clarify desciprion for min/median/max in the summary report.

Finally fix bug where index stats where time/count == 0 got skipped
from the summary.

Closes #614

Relates: #615
Relates: elastic/elasticsearch#35594 (comment)
@dliappis
Copy link
Contributor

dliappis commented Dec 14, 2018

Benchmark results using index.search.idle.after: 1d on both baseline (soft_deletes.enabled: false) and contender (soft_deletes.enabled: true):

(Note: the names of the columns contain the renamed fields done in elastic/rally#615 but still display the same fields as the ones in previous results).

$ rally compare --baseline=20181214T110634Z --contender=20181214T123444Z
|                                                     Metric |   Task |   Baseline |   Contender |     Diff |   Unit |
|-----------------------------------------------------------:|-------:|-----------:|------------:|---------:|-------:|
|                 Cumulative indexing time of primary shards |        |    308.177 |     326.459 |   18.282 |    min |
|          Min cumulative indexing time across primary shard |        |    2.00722 |      2.0033 | -0.00392 |    min |
|       Median cumulative indexing time across primary shard |        |    9.06298 |     10.1192 |  1.05623 |    min |
|          Max cumulative indexing time across primary shard |        |    255.239 |     270.216 |  14.9772 |    min |
|                    Cumulative merge time of primary shards |        |    121.077 |     121.541 |   0.4647 |    min |
|             Min cumulative merge time across primary shard |        |   0.280383 |    0.345317 |  0.06493 |    min |
|          Median cumulative merge time across primary shard |        |    2.07707 |     2.49417 |   0.4171 |    min |
|             Max cumulative merge time across primary shard |        |    107.752 |     106.739 | -1.01252 |    min |
|           Cumulative merge throttle time of primary shards |        |    57.6583 |      51.627 | -6.03135 |    min |
|    Min cumulative merge throttle time across primary shard |        |  0.0226667 |   0.0349667 |   0.0123 |    min |
| Median cumulative merge throttle time across primary shard |        |     0.5813 |     0.47885 | -0.10245 |    min |
|    Max cumulative merge throttle time across primary shard |        |    53.3289 |     48.1736 |  -5.1553 |    min |
|                  Cumulative refresh time of primary shards |        |    16.6727 |     42.8803 |  26.2076 |    min |
|           Min cumulative refresh time across primary shard |        |     0.1117 |      0.1885 |   0.0768 |    min |
|        Median cumulative refresh time across primary shard |        |      0.465 |     1.07048 |  0.60548 |    min |
|           Max cumulative refresh time across primary shard |        |    14.0625 |     36.9627 |  22.9001 |    min |
|                    Cumulative flush time of primary shards |        |    2.96627 |     6.26865 |  3.30238 |    min |
|             Min cumulative flush time across primary shard |        | 0.00248333 |      0.0009 | -0.00158 |    min |
|          Median cumulative flush time across primary shard |        |  0.0874333 |      0.1419 |  0.05447 |    min |
|             Max cumulative flush time across primary shard |        |    2.51173 |      5.4797 |  2.96797 |    min |
|                                         Total Young Gen GC |        |    157.059 |     154.685 |   -2.374 |      s |
|                                           Total Old Gen GC |        |      3.229 |        2.77 |   -0.459 |      s |
|                                                 Store size |        |    20.7935 |     22.5187 |  1.72515 |     GB |
|                                              Translog size |        |    3.74946 |     3.71403 | -0.03543 |     GB |
|                                            Totally written |        |    190.833 |     189.605 | -1.22774 |     GB |
|                                     Heap used for segments |        |    55.3721 |     51.2664 | -4.10573 |     MB |
|                                   Heap used for doc values |        |  0.0352287 |   0.0641937 |  0.02896 |     MB |
|                                        Heap used for terms |        |    45.3404 |     40.8353 | -4.50513 |     MB |
|                                        Heap used for norms |        | 0.00958252 |  0.00964355 |    6e-05 |     MB |
|                                       Heap used for points |        |    4.41825 |     4.56084 |  0.14258 |     MB |
|                                Heap used for stored fields |        |    5.56862 |     5.79642 |   0.2278 |     MB |
|                                              Segment count |        |        157 |         158 |        1 |        |
|                                             Min Throughput | update |    55747.5 |     53696.4 | -2051.15 | docs/s |
|                                          Median Throughput | update |    60765.5 |     59023.2 | -1742.27 | docs/s |
|                                             Max Throughput | update |    63406.6 |     63144.2 | -262.366 | docs/s |
|                                    50th percentile latency | update |    463.521 |     461.751 | -1.76977 |     ms |
|                                    90th percentile latency | update |    1893.82 |     1919.34 |  25.5193 |     ms |
|                                    99th percentile latency | update |    2495.46 |     2718.14 |  222.689 |     ms |
|                                  99.9th percentile latency | update |    2984.23 |     3777.08 |  792.846 |     ms |
|                                 99.99th percentile latency | update |    3484.53 |     4924.91 |  1440.38 |     ms |
|                                   100th percentile latency | update |    3771.14 |     6076.91 |  2305.77 |     ms |
|                               50th percentile service time | update |    463.521 |     461.751 | -1.76977 |     ms |
|                               90th percentile service time | update |    1893.82 |     1919.34 |  25.5193 |     ms |
|                               99th percentile service time | update |    2495.46 |     2718.14 |  222.689 |     ms |
|                             99.9th percentile service time | update |    2984.23 |     3777.08 |  792.846 |     ms |
|                            99.99th percentile service time | update |    3484.53 |     4924.91 |  1440.38 |     ms |
|                              100th percentile service time | update |    3771.14 |     6076.91 |  2305.77 |     ms |
|                                                 error rate | update |          0 |           0 |        0 |      % |


-------------------------------
[INFO] SUCCESS (took 0 seconds)
-------------------------------

Refresh/merge/flush counts:

baseline:

"merges" : {
  "total" : 2131,
},
"refresh" : {
  "total" : 3688,
},
"flush" : {
  "total" : 98,
},

contender:

"merges" : {
  "total" : 1364,
},
"refresh" : {
  "total" : 2364,
},
"flush" : {
  "total" : 98,
},

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/CCR Issues around the Cross Cluster State Replication features >enhancement >non-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants