Improve performance with large Grafana installations #2

jangaraj · 2019-05-06T18:27:03Z

1.) Dashboard search run serially
It is slow approach for Grafana instances with many dashboards, e.g.:

2019-05-06 18:12:11,809 [grafana_wtf.core    ] INFO   : Found 1000 dashboards
2019-05-06 18:12:11,809 [grafana_wtf.core    ] INFO   : Fetching dashboards
 20%|███████████████████████▏                  | 198/1000 [00:56<05:12,  2.57it/s

Python multiprocessing (parallel runs) can increase search speed in this case. There should be configuration to limit number of search processes, otherwise it can be "DDoS attack"

2.) Actually, Grafana instance from first example contains more than 1k dashboards, but there is default 1k API limits. There should be config/env variable to configure this limit.

Thanks.

The text was updated successfully, but these errors were encountered:

amotl · 2019-05-07T17:19:53Z

Dear Jan,

thanks for using grafana-wtf and also thank very much for your suggestions. I recognize that from the perspective of a Grafana installation which is that large, grafana-wtf in its current form is really just a start.

While it already has an appropriate caching subsystem which saves you from hitting Grafana each and every time to do searches on the JSON artefacts, the time-to-live is currently hardcoded to five minutes or so. This essentially renders it useless for your scenario, so you might not even have recognized that there is such a machinery under the hood at all.

Saying that, I definitively second your suggestion about improving raw performance by parallelizing requests and will take this into account for the next iteration on the code.

Also thanks for pointing out the API limit which caps the maximum number of returned results to 1000 as you say. I assume there will be appropriate paging or offset/limit parameters then which should be unlocked by grafana-wtf?

Cheers,
Andreas.

jangaraj · 2019-05-07T17:57:58Z

Yes, there is limit – Limit the number of returned results (max 5000) (doc)

amotl · 2019-05-07T20:08:49Z

Dear Jan,

we just released grafana-wtf-0.5.0 which might improve the situation for you slightly. When requesting dashboards, we are now using limit=5000 and there's a new --cache-ttl option you might want to play around with.

--cache-ttl accepts the cache expiration time in seconds as well as a special literal like --cache-ttl=infinite to turn off cache expiration at all, essentially caching forever. On the other hand, using --cache-ttl=0 disables caching completely, essentially requesting all resources each time again. The value still defaults to 300 seconds.

Increasing the cache expiration time might give you a more reasonable balance between freshness and waiting time, at least you are now under control.

With kind regards,
Andreas.

amotl · 2019-05-07T22:26:50Z

Dear Jan,

we just added the --concurrency option which significantly improves performance. It defaults to "5" concurrent requests. Also, --debug has been improved to be able to watch the list of dashboard names grafana-wtf is downloading.

So, you might want to invoke grafana-wtf like

time grafana-wtf find '#299c46' --concurrency=20 --cache-ttl=inf --debug

to warm up your local cache by running twenty requests in parallel and to keep cache content forever. For subsequent invocations, things should be faster than before¹ while trading in a bit of data freshness.

We will be happy to hear about the outcome.

With kind regards,
Andreas.

amotl · 2019-05-07T23:42:14Z

¹ Saying that, the SQLite cache and the search method will probably soon become the bottlenecks when operating on the larger data set grafana-wtf will be able to ingest after improving the http transport layer.

Regarding the cache backend, we have been able to improve cache performance for Luftdatenpumpe by using Redis. Regarding the searching itself, we might try PyPy first and will likely have to move on to Go or Rust if this doesn't help.

Let me know if this works out reasonably for you and whether you see there's a chance we can tune the current Python implementation to cope with the installation scenario regarding high numbers of dashboards like you are operating with.

jangaraj · 2019-05-08T04:51:58Z

👍 Cold cache test:

$ time grafana-wtf --cache-ttl=600 --concurrency=50 find this-string-doesnt-exist
...
2019-05-08 04:36:54,312 [grafana_wtf.core      ] INFO   : Found 1302 dashboards
  0%|      | 0/1302 [00:00<?, ?it/s]2019-05-08 04:36:54,321 [grafana_wtf.core      ]
INFO   : Fetching dashboards in parallel with 50 concurrent requests
...
real    0m35.215s
user    0m7.111s
sys     0m9.171s

Warm cache for the same search:

real    0m5.284s
user    0m2.970s
sys     0m2.382s

Good job.

BTW: according to doc user needs to use export GRAFANA_URL=https://daq.example.org/grafana/. But that trailing slash is a problem in my case (I use https://domain.org/). Could you handle that in the code, so user will be able to use URL with/without trailing slash, please?

amotl · 2019-05-08T06:00:08Z

Happy to see this kind of speedup on the large installation you are operating there. Thanks for letting us know and enjoy your searches.

amotl changed the title ~~Performance~~ Improve performance with large Grafana installations May 7, 2019

amotl mentioned this issue May 7, 2019

Add support for result paging #3

Open

jangaraj closed this as completed May 8, 2019

amotl mentioned this issue May 8, 2019

Compensate for leading slash in API URL inserted by grafana_api #4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance with large Grafana installations #2

Improve performance with large Grafana installations #2

jangaraj commented May 6, 2019

amotl commented May 7, 2019 •

edited

Loading

jangaraj commented May 7, 2019

amotl commented May 7, 2019 •

edited

Loading

amotl commented May 7, 2019 •

edited

Loading

amotl commented May 7, 2019 •

edited

Loading

jangaraj commented May 8, 2019

amotl commented May 8, 2019

Improve performance with large Grafana installations #2

Improve performance with large Grafana installations #2

Comments

jangaraj commented May 6, 2019

amotl commented May 7, 2019 • edited Loading

jangaraj commented May 7, 2019

amotl commented May 7, 2019 • edited Loading

amotl commented May 7, 2019 • edited Loading

amotl commented May 7, 2019 • edited Loading

jangaraj commented May 8, 2019

amotl commented May 8, 2019

amotl commented May 7, 2019 •

edited

Loading

amotl commented May 7, 2019 •

edited

Loading

amotl commented May 7, 2019 •

edited

Loading

amotl commented May 7, 2019 •

edited

Loading