Using InfluxDB output affects the test #182

geekdave · 2017-04-18T22:40:55Z

Maybe this is just the observer effect in action, but I thought I'd run it by the experts 😉

Summary

Turning on InfluxDB output seems to:

reduce the total iterations in a given run
increase latency (http_req_duration)

With InfluxDB output turned on

d@k2med:~/k6-v0.11.0-linux64$ ./k6 run --vus 50 --duration 10s --out influxdb=http://influxdb.xxxxxxx.com:8086/k6  http://punchingbag.xxxxxxx.com:8080

          /\      |‾‾|  /‾‾/  /‾/
     /\  /  \     |  |_/  /  / /
    /  \/    \    |   _  |  /  ‾‾\
   /          \   |  | \  \ | (_) |
  / __________ \  |__|  \__\ \___/  Welcome to k6 v0.11.0!

  execution: local
     output: influxdb (influxdb.xxxxxxx.com:8086)
     script: [cmdline] (url)

   duration: 10s, iterations: 0
        vus: 50, max: 50

    web ui: http://127.0.0.1:6565/

      done [==========================================================]        10s / 10s

    http_req_blocked......: avg=28.12µs, max=29.02ms, med=1.2µs, min=755ns, p90=2.13µs, p95=2.47µs
    http_req_connecting...: avg=1.32µs, max=8.76ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_duration.....: avg=5.77ms, max=219.84ms, med=3.35ms, min=450.81µs, p90=14.2ms, p95=19.74ms
    http_req_looking_up...: avg=24.64µs, max=22.57ms, med=0s, min=241.86µs, p90=0s, p95=0s
    http_req_receiving....: avg=172.49µs, max=53.8ms, med=44.54µs, min=27.08µs, p90=141.06µs, p95=246.64µs
    http_req_sending......: avg=32.72µs, max=29ms, med=9.5µs, min=5.67µs, p90=20.67µs, p95=26.85µs
    http_req_waiting......: avg=5.54ms, max=219.57ms, med=3.22ms, min=376.48µs, p90=13.59ms, p95=18.96ms
    http_reqs.............: 86361
    iterations............: 86361
    vus...................: 50
    vus_max...............: 50

With InfluxDB output turned off

d@k2med:~/k6-v0.11.0-linux64$ ./k6 run --vus 50 --duration 10s http://punchingbag.xxxxxxx.com:8080

          /\      |‾‾|  /‾‾/  /‾/
     /\  /  \     |  |_/  /  / /
    /  \/    \    |   _  |  /  ‾‾\
   /          \   |  | \  \ | (_) |
  / __________ \  |__|  \__\ \___/  Welcome to k6 v0.11.0!

  execution: local
     output: -
     script: [cmdline] (url)

   duration: 10s, iterations: 0
        vus: 50, max: 50

    web ui: http://127.0.0.1:6565/

      done [==========================================================]        10s / 10s

    http_req_blocked......: avg=17.31µs, max=23.35ms, med=1.17µs, min=734ns, p90=2µs, p95=2.39µs
    http_req_connecting...: avg=728ns, max=3.12ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_duration.....: avg=3.96ms, max=48.55ms, med=2.49ms, min=430.59µs, p90=8.73ms, p95=14.25ms
    http_req_looking_up...: avg=18.78µs, max=24.53ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_receiving....: avg=165.59µs, max=25.16ms, med=41.03µs, min=27µs, p90=177.64µs, p95=316.99µs
    http_req_sending......: avg=24.2µs, max=24.45ms, med=9.16µs, min=5.34µs, p90=20.07µs, p95=36.15µs
    http_req_waiting......: avg=3.75ms, max=48.5ms, med=2.4ms, min=386.03µs, p90=7.98ms, p95=13.31ms
    http_reqs.............: 125726
    iterations............: 125726
    vus...................: 50
    vus_max...............: 50

Possible solutions

Use a UDP protocol like statsd to reduce the overhead on the load generator
Batch up the metrics into 1 sec buckets and only emit the aggregates for those buckets

Either of these would also significantly reduce the load on the influxDB server itself (mine is running on a micro instance and crashes under the load of 50 VUs)

I'm happy to dive in and work on a solution, if it would be helpful. Thanks!

The text was updated successfully, but these errors were encountered:

liclac · 2017-04-19T19:58:21Z

Hm, interesting - InfluxDB collection does introduce a bit of load on the emitting system because every k6 sample has to be turned into an InfluxDB sample, but it shouldn't be nearly this noticeable - it spawns a separate goroutine that's shipped samples every ~10ms, completely out of band from any VU goroutines… but you may be onto something here.

The collector does batch samples up in 1s buckets, but that's a rather arbitrary number - perhaps tweaking it could yield better performance? UDP is an option as well, but as I recall, when we tried feeding InfluxDB samples that way it actually resulted in a significant number of them straight-up vanishing into the ether.

This is something I'm really interested in optimizing as much as possible, because it will be integral to clustered setups, but there are some other deadlines coming up before I can get on it - the InfluxDB code isn't nearly as optimized as it deserves to be… but I can certainly help in any way I can.

Just to be absolutely sure, check that the usual suspects aren't skewing the results here unnecessarily:

There are no lingering TIME_WAIT sockets (check ss -s on Linux)
The first test didn't populate caches that are now serving things faster to the second one
There are no cron jobs or similar messing things up

geekdave · 2017-05-08T22:27:39Z

@liclac Just revisiting this. It's worth noting that my InfluxDB server is currently residing on a very underpowered t2.micro instance. If Influx is slow-to-respond to each 1s bucket that's posted to it from the k6 client, would that explain the slowdown? i.e. is k6 blocking on responses from Influx in some way?

ghost · 2017-06-20T09:25:48Z

Currently I am trying to get an environment up on my Macbook Pro Retina 2015 (2.5ghz I7, 16GB RAM). I will try to execute a similar scenario and post my results here.

jordandlaman · 2018-10-23T19:14:55Z

I think I am encountering a similar issue. When running my volume tests with influx + grafana/chronograf visualizing the data the test will have serious gaps in the requests per second. At first I believed I was being rate limited by cognito but we replaced cognito with our own jwt authorizer lambda and it is extremely performant. The other thing that happens is the response time for all the endpoints starts sky rocketing up. They usually sit in the 90 -150 ms range but will balloon up to 20s across the board. I did not encounter this problem a few months ago while running tests and have had to disable outputting results to influxdb for the tests to return metrics worth anything. Luckily we have cloudwatch but it isn't ideal as cloudwatch doesn't have insight into the response times of all the endpoints outside of AWS.

Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead the wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit there will be a configurable number of parallel writes (10 by default) that will trigger again every 1 second (also now configurable), but if those get exhausted it will start queueing the samples each second instead of combining them and than writing a big chunk which has a chance of hitting the max body. I tested with a simple script doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it. Some minor problems with this solution is that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem 2. We can probably better batch stuff if we add/keep all the unsend samples together 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182

Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182

na-- added the performance label Oct 3, 2018

mstoykov closed this as completed in 7517d22 Sep 3, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using InfluxDB output affects the test #182

Using InfluxDB output affects the test #182

geekdave commented Apr 18, 2017 •

edited

Loading

liclac commented Apr 19, 2017

geekdave commented May 8, 2017

ghost commented Jun 20, 2017

jordandlaman commented Oct 23, 2018

Using InfluxDB output affects the test #182

Using InfluxDB output affects the test #182

Comments

geekdave commented Apr 18, 2017 • edited Loading

Summary

With InfluxDB output turned on

With InfluxDB output turned off

Possible solutions

liclac commented Apr 19, 2017

geekdave commented May 8, 2017

ghost commented Jun 20, 2017

jordandlaman commented Oct 23, 2018

geekdave commented Apr 18, 2017 •

edited

Loading