Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Icing2 2.11.2-1 HA Master sync overhead #7711

Closed
dmitriy-terzeman opened this issue Dec 13, 2019 · 9 comments
Closed

Icing2 2.11.2-1 HA Master sync overhead #7711

dmitriy-terzeman opened this issue Dec 13, 2019 · 9 comments
Labels
area/distributed Distributed monitoring (master, satellites, clients)

Comments

@dmitriy-terzeman
Copy link

dmitriy-terzeman commented Dec 13, 2019

Describe the bug

Icinga2 in Master in HA mode 3 times slower compared to Icinga2 Master in standalone mode
On high load i face memory leaks in Icinga2 Master HA setup, on same or bigger workloads memory leaks are not reproduced in Icinga2 master in standalone mode

To Reproduce

Provide a link to a live example, or an unambiguous set of steps to reproduce this bug. Include configuration, logs, etc. to reproduce, if relevant.

  1. Configure distributed setup with standalone master with 8 CPU cores/16 GB RAM
  2. Configure 500000 services with 1 minute frequency
  3. Collect memory/cpu metrics from it
  4. Collect "ApiListener, RelayQueue" related logs from /var/log/icinga2/icinga2.log
  5. Configure distributed setup with masters in HA mode with 8 CPU cores/16 GB RAM on each master
  6. Configure 500000 services with 1 minute frequency
  7. Collect memory/cpu metrics from it
  8. Collect "ApiListener, RelayQueue" related logs from /var/log/icinga2/icinga2.log
  9. Compare both clusters performance metrics and stability
  10. Compare "ApiListener, RelayQueue" message speed and queue on both installations

Test results with comparison and attached graphs/logs described in https://community.icinga.com/t/icinga2-at-large-scale/2178, comparison between Test 3 and Test 5 the most important.

Expected behavior

Expecting nearly the same performance of Icinga2 master in HA as i have with Standalone mode

Log comparison

"ApiListener, RelayQueue" related logs

HA mode, 300000 services, 1 min frequency:

[2019-11-29 02:55:00 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1, rate: 5216.35/s (312981/min 312981/5min 312981/15min);
[2019-11-29 02:55:10 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 80304, rate: 7876.15/s (472569/min 472569/5min 472569/15min); empty in 9 seconds
[2019-11-29 02:55:20 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 179869, rate: 10479.8/s (628785/min 628785/5min 628785/15min); empty in 18 seconds
[2019-11-29 02:55:30 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 283860, rate: 13006.9/s (780414/min 780414/5min 780414/15min); empty in 27 seconds
[2019-11-29 02:55:40 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 399336, rate: 15455.5/s (927329/min 927329/5min 927329/15min); empty in 34 seconds
[2019-11-29 02:55:50 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 518236, rate: 13332.3/s (799938/min 1071679/5min 1071679/15min); empty in 43 seconds
[2019-11-29 02:56:00 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 637251, rate: 14938.8/s (896329/min 1211154/5min 1211154/15min); empty in 53 seconds
[2019-11-29 02:56:10 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 756500, rate: 14645.1/s (878709/min 1354390/5min 1354390/15min); empty in 1 minute and 3 seconds
[2019-11-29 02:56:20 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 879603, rate: 14372.7/s (862363/min 1493768/5min 1493768/15min); empty in 1 minute and 11 seconds
[2019-11-29 02:56:30 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1014816, rate: 13901.6/s (834096/min 1616707/5min 1616707/15min); empty in 1 minute and 15 seconds
[2019-11-29 02:56:40 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1140154, rate: 13726/s (823560/min 1753195/5min 1753195/15min); empty in 1 minute and 30 seconds
[2019-11-29 02:56:50 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1262784, rate: 13699.8/s (821989/min 1895614/5min 1895614/15min); empty in 1 minute and 42 seconds
[2019-11-29 02:57:00 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1383192, rate: 13746.4/s (824783/min 2038166/5min 2038166/15min); empty in 1 minute and 54 seconds
[2019-11-29 02:57:10 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1499154, rate: 13778.1/s (826688/min 2183558/5min 2183558/15min); empty in 2 minutes and 9 seconds
[2019-11-29 02:57:20 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1615627, rate: 13874.6/s (832476/min 2328664/5min 2328664/15min); empty in 2 minutes and 18 seconds
[2019-11-29 02:57:30 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1732766, rate: 14126.9/s (847616/min 2466894/5min 2466894/15min); empty in 2 minutes and 27 seconds
[2019-11-29 02:57:40 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1856457, rate: 14151.9/s (849114/min 2604760/5min 2604760/15min); empty in 2 minutes and 30 seconds
[2019-11-29 02:57:50 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 1975574, rate: 14228.4/s (853703/min 2751306/5min 2751306/15min); empty in 2 minutes and 45 seconds
[2019-11-29 02:58:00 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2094517, rate: 14195.5/s (851733/min 2892344/5min 2892344/15min); empty in 2 minutes and 56 seconds
[2019-11-29 02:58:10 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2212559, rate: 14104.3/s (846259/min 3032364/5min 3032364/15min); empty in 3 minutes and 7 seconds
[2019-11-29 02:58:20 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2342521, rate: 13928.9/s (835732/min 3166993/5min 3166993/15min); empty in 3 minutes
[2019-11-29 02:58:30 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2461257, rate: 13863.3/s (831796/min 3300993/5min 3300993/15min); empty in 3 minutes and 27 seconds
[2019-11-29 02:58:40 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2583046, rate: 13924.9/s (835495/min 3443022/5min 3443022/15min); empty in 3 minutes and 32 seconds
[2019-11-29 02:58:50 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2700537, rate: 13973/s (838377/min 3592124/5min 3592124/15min); empty in 3 minutes and 49 seconds
[2019-11-29 02:59:00 -0700] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 2815752, rate: 14053.7/s (843222/min 3737734/5min 3737734/15min); empty in 4 minutes and 4 seconds

Standalone mode, 500000 services, 1 min frequency:

[2019-12-12 06:24:02 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 1341, rate: 42606.6/s (2556396/min 12794928/5min 38846066/15min);
[2019-12-12 06:24:12 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 179, rate: 42690.1/s (2561407/min 12807613/5min 38914856/15min);
[2019-12-12 06:24:22 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 42512.4/s (2550745/min 12792016/5min 38937682/15min);
[2019-12-12 06:24:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 10, rate: 42540.4/s (2552425/min 12805184/5min 38961697/15min);
[2019-12-12 06:24:52 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 42, rate: 42437.2/s (2546230/min 12800608/5min 39000197/15min);
[2019-12-12 06:25:22 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 24, rate: 42100.5/s (2526029/min 12795742/5min 39073217/15min);
[2019-12-12 06:25:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 12, rate: 41916.2/s (2514970/min 12796443/5min 39074377/15min);
[2019-12-12 06:25:52 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 1, rate: 42035.8/s (2522148/min 12814652/5min 39099364/15min);
[2019-12-12 06:26:22 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 258, rate: 42451.9/s (2547115/min 12839445/5min 39172641/15min);
[2019-12-12 06:26:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 1, rate: 42479.3/s (2548759/min 12844078/5min 39203415/15min);
[2019-12-12 06:26:42 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 271, rate: 42487/s (2549220/min 12821474/5min 39210678/15min);
[2019-12-12 06:26:52 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 40, rate: 42239.2/s (2534349/min 12802741/5min 39196745/15min);
[2019-12-12 06:27:02 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 2, rate: 42231/s (2533858/min 12786579/5min 39141632/15min);
[2019-12-12 06:27:22 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 202, rate: 41704.4/s (2502264/min 12766433/5min 39016238/15min);
[2019-12-12 06:27:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 27, rate: 41635.7/s (2498142/min 12758562/5min 38964173/15min);
[2019-12-12 06:28:12 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 1, rate: 42651.8/s (2559105/min 12794018/5min 38824881/15min);
[2019-12-12 06:28:22 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 1, rate: 42846.8/s (2570810/min 12801761/5min 38791080/15min);
[2019-12-12 06:28:32 -0700] information/WorkQueue: #5 (ApiListener, RelayQueue) items: 5, rate: 42858.4/s (2571507/min 12788852/5min 38735522/15min);

Your Environment

Include as many relevant details about the environment you experienced the problem in

icinga2 - The Icinga 2 network monitoring daemon (version: 2.11.2-1)

Copyright (c) 2012-2019 Icinga GmbH (https://icinga.com/)
License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl2.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

System information:
  Platform: CentOS Linux
  Platform version: 7 (Core)
  Kernel: Linux
  Kernel version: 3.10.0-1062.4.1.el7.x86_64
  Architecture: x86_64

Build information:
  Compiler: GNU 4.8.5
  Build host: runner-LTrJQZ9N-project-322-concurrent-0

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

icinga2 feature list

Disabled features: checker compatlog debuglog elasticsearch gelf graphite ido-mysql influxdb livestatus notification opentsdb perfdata statusdata syslog
Enabled features: api command mainlog
@dnsmichi
Copy link
Contributor

Likely related to #7687 @lippserd @Al2Klimov

@Al2Klimov
Copy link
Member

Well, each node queues its cluster messages in RAM and caches them on the HDD. The more messages/time, the more traffic and I/O happens. And if your resources aren't enough for that traffic and I/O, your RAM and HDD usage grows (of course).

It's kinda the same as you get "your DB isn't able too keep up" form IDO in too large setups. IMAO it's not an actual memory leak nor related to the crash.

@dmitriy-terzeman
Copy link
Author

I didn't saw any disk usage or caching on HDD at icinga2 masters during all tests(replay log disabled), only RAM usage increase.
I'm also facing memory leaks in HA setup while secondary master is down, as i understand app should drop old messages(that marked for cluster sync) and handle almost the same workload as a standalone instance

@Al2Klimov
Copy link
Member

By the way: On OOM Linux kills the waster, but the JSON-RPC crash is about a SEGV.

@dnsmichi dnsmichi added the area/distributed Distributed monitoring (master, satellites, clients) label Dec 18, 2019
@dnsmichi
Copy link
Contributor

@lippserd observed a memory leak within the replay log handling, which is seemingly gone with the git master (and highly likely fixed with the JSON library update). This could play a role here.

@dmitriy-terzeman
Copy link
Author

I'll be able to provide you additional test results once 2.12 / 2.12RC will be released

@Al2Klimov
Copy link
Member

It's been released, please test it.

@Al2Klimov Al2Klimov added the needs feedback We'll only proceed once we hear from you again label Mar 18, 2020
@dmitriy-terzeman
Copy link
Author

Hello @Al2Klimov, yeah i already saw the blog post, great news and thanks for all the work that have been done.
I already planned to start tests at Mon 23, i'll share results in 1-2 weeks.

@Al2Klimov
Copy link
Member

IMAO the lack of external feedback for a long time indicates that that feedback will never happen. Therefore closing this one.

Feel free to re-open if the problem persists with the latest Icinga 2 version as long as you provide the desired information.

@Al2Klimov Al2Klimov removed the needs feedback We'll only proceed once we hear from you again label Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients)
Projects
None yet
Development

No branches or pull requests

3 participants