Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using InfluxDB output affects the test #182

Closed
geekdave opened this issue Apr 18, 2017 · 4 comments
Closed

Using InfluxDB output affects the test #182

geekdave opened this issue Apr 18, 2017 · 4 comments

Comments

@geekdave
Copy link
Contributor

geekdave commented Apr 18, 2017

Maybe this is just the observer effect in action, but I thought I'd run it by the experts 😉

Summary

Turning on InfluxDB output seems to:

  1. reduce the total iterations in a given run
  2. increase latency (http_req_duration)

With InfluxDB output turned on

d@k2med:~/k6-v0.11.0-linux64$ ./k6 run --vus 50 --duration 10s --out influxdb=http://influxdb.xxxxxxx.com:8086/k6  http://punchingbag.xxxxxxx.com:8080

          /\      |‾‾|  /‾‾/  /‾/
     /\  /  \     |  |_/  /  / /
    /  \/    \    |   _  |  /  ‾‾\
   /          \   |  | \  \ | (_) |
  / __________ \  |__|  \__\ \___/  Welcome to k6 v0.11.0!

  execution: local
     output: influxdb (influxdb.xxxxxxx.com:8086)
     script: [cmdline] (url)

   duration: 10s, iterations: 0
        vus: 50, max: 50

    web ui: http://127.0.0.1:6565/

      done [==========================================================]        10s / 10s

    http_req_blocked......: avg=28.12µs, max=29.02ms, med=1.2µs, min=755ns, p90=2.13µs, p95=2.47µs
    http_req_connecting...: avg=1.32µs, max=8.76ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_duration.....: avg=5.77ms, max=219.84ms, med=3.35ms, min=450.81µs, p90=14.2ms, p95=19.74ms
    http_req_looking_up...: avg=24.64µs, max=22.57ms, med=0s, min=241.86µs, p90=0s, p95=0s
    http_req_receiving....: avg=172.49µs, max=53.8ms, med=44.54µs, min=27.08µs, p90=141.06µs, p95=246.64µs
    http_req_sending......: avg=32.72µs, max=29ms, med=9.5µs, min=5.67µs, p90=20.67µs, p95=26.85µs
    http_req_waiting......: avg=5.54ms, max=219.57ms, med=3.22ms, min=376.48µs, p90=13.59ms, p95=18.96ms
    http_reqs.............: 86361
    iterations............: 86361
    vus...................: 50
    vus_max...............: 50

With InfluxDB output turned off

d@k2med:~/k6-v0.11.0-linux64$ ./k6 run --vus 50 --duration 10s http://punchingbag.xxxxxxx.com:8080

          /\      |‾‾|  /‾‾/  /‾/
     /\  /  \     |  |_/  /  / /
    /  \/    \    |   _  |  /  ‾‾\
   /          \   |  | \  \ | (_) |
  / __________ \  |__|  \__\ \___/  Welcome to k6 v0.11.0!

  execution: local
     output: -
     script: [cmdline] (url)

   duration: 10s, iterations: 0
        vus: 50, max: 50

    web ui: http://127.0.0.1:6565/

      done [==========================================================]        10s / 10s

    http_req_blocked......: avg=17.31µs, max=23.35ms, med=1.17µs, min=734ns, p90=2µs, p95=2.39µs
    http_req_connecting...: avg=728ns, max=3.12ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_duration.....: avg=3.96ms, max=48.55ms, med=2.49ms, min=430.59µs, p90=8.73ms, p95=14.25ms
    http_req_looking_up...: avg=18.78µs, max=24.53ms, med=0s, min=0s, p90=0s, p95=0s
    http_req_receiving....: avg=165.59µs, max=25.16ms, med=41.03µs, min=27µs, p90=177.64µs, p95=316.99µs
    http_req_sending......: avg=24.2µs, max=24.45ms, med=9.16µs, min=5.34µs, p90=20.07µs, p95=36.15µs
    http_req_waiting......: avg=3.75ms, max=48.5ms, med=2.4ms, min=386.03µs, p90=7.98ms, p95=13.31ms
    http_reqs.............: 125726
    iterations............: 125726
    vus...................: 50
    vus_max...............: 50

Possible solutions

  1. Use a UDP protocol like statsd to reduce the overhead on the load generator
  2. Batch up the metrics into 1 sec buckets and only emit the aggregates for those buckets

Either of these would also significantly reduce the load on the influxDB server itself (mine is running on a micro instance and crashes under the load of 50 VUs)

I'm happy to dive in and work on a solution, if it would be helpful. Thanks!

@liclac
Copy link
Contributor

liclac commented Apr 19, 2017

Hm, interesting - InfluxDB collection does introduce a bit of load on the emitting system because every k6 sample has to be turned into an InfluxDB sample, but it shouldn't be nearly this noticeable - it spawns a separate goroutine that's shipped samples every ~10ms, completely out of band from any VU goroutines… but you may be onto something here.

The collector does batch samples up in 1s buckets, but that's a rather arbitrary number - perhaps tweaking it could yield better performance? UDP is an option as well, but as I recall, when we tried feeding InfluxDB samples that way it actually resulted in a significant number of them straight-up vanishing into the ether.

This is something I'm really interested in optimizing as much as possible, because it will be integral to clustered setups, but there are some other deadlines coming up before I can get on it - the InfluxDB code isn't nearly as optimized as it deserves to be… but I can certainly help in any way I can.

Just to be absolutely sure, check that the usual suspects aren't skewing the results here unnecessarily:

  • There are no lingering TIME_WAIT sockets (check ss -s on Linux)
  • The first test didn't populate caches that are now serving things faster to the second one
  • There are no cron jobs or similar messing things up

@geekdave
Copy link
Contributor Author

geekdave commented May 8, 2017

@liclac Just revisiting this. It's worth noting that my InfluxDB server is currently residing on a very underpowered t2.micro instance. If Influx is slow-to-respond to each 1s bucket that's posted to it from the k6 client, would that explain the slowdown? i.e. is k6 blocking on responses from Influx in some way?

@ghost
Copy link

ghost commented Jun 20, 2017

Currently I am trying to get an environment up on my Macbook Pro Retina 2015 (2.5ghz I7, 16GB RAM). I will try to execute a similar scenario and post my results here.

@jordandlaman
Copy link

I think I am encountering a similar issue. When running my volume tests with influx + grafana/chronograf visualizing the data the test will have serious gaps in the requests per second. At first I believed I was being rate limited by cognito but we replaced cognito with our own jwt authorizer lambda and it is extremely performant. The other thing that happens is the response time for all the endpoints starts sky rocketing up. They usually sit in the 90 -150 ms range but will balloon up to 20s across the board. I did not encounter this problem a few months ago while running tests and have had to disable outputting results to influxdb for the tests to return metrics worth anything. Luckily we have cloudwatch but it isn't ideal as cloudwatch doesn't have insight into the response times of all the endpoints outside of AWS.

mstoykov added a commit that referenced this issue Aug 13, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
the wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit there will be a configurable number of parallel writes
(10 by default) that will trigger again every 1 second (also now
configurable), but if those get exhausted it will start queueing the
samples each second instead of combining them and than writing a big
chunk which has a chance of hitting the max body.

I tested with a simple script doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it.

Some minor problems with this solution is that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem
2. We can probably better batch stuff if we add/keep all the unsend
samples together
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
mstoykov added a commit that referenced this issue Aug 13, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit a configurable number of parallel writes
(10 by default), starting again every 1 second (also now configurable).
Additionally if we reach the 10 concurrent writes instead of sending all
the data that accumulates we will just queue the samples that were
generated. This should considerably help with no hitting the max body
size of influxdb.

I tested with a simple script, doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

Previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it and I had to kill it ;(.

Some problems with this solution are that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem, but still good idea to fix.
2. We can probably better batch stuff if we add/keep all the unsend
samples together and cut them in let say 50k samples.
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
mstoykov added a commit that referenced this issue Aug 29, 2019
Previously to this k6 will write to influxdb every second, but if that
write took more than 1 second it won't start a second write but instead
wait for it. This will generally lead to the write times getting
bigger and bigger as more and more data is being written until the max
body that influxdb will take is reached when it will return an error and
k6 will drop that data.

With this commit a configurable number of parallel writes
(10 by default), starting again every 1 second (also now configurable).
Additionally if we reach the 10 concurrent writes instead of sending all
the data that accumulates we will just queue the samples that were
generated. This should considerably help with no hitting the max body
size of influxdb.

I tested with a simple script, doing batch request for an empty local
file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of
memory usage.

Previous to this commit the usage of ram was ~5.7GB for 5736 rps and
practically all the data gets lost if you don't up the max body and even
than a lot of the data is lost while the memory usage goes up.

After this commit the usage of ram was ~2.4GB(or less in some of the
tests) with 6273 RPS and there was no lost of data.

Even with this commit doing 2 hour of that simple script dies after
1hour and 35 minutes using around 15GB (the test system has 16). Can't
be sure of lost of data, as influxdb eat 32GB of memory trying to
visualize it and I had to kill it ;(.

Some problems with this solution are that:
1. We use a lot of goroutines if things start slowing down - probably
not a big problem, but still good idea to fix.
2. We can probably better batch stuff if we add/keep all the unsend
samples together and cut them in let say 50k samples.
3. By far the biggest: because the writes are slow if the test is
stopped (with Ctrl+C) or it finishes naturally, waiting for those writes
can take considerable amount of time - in the above example the 4
minutes tests generally took around 5 minutes :(

All of those can be better handled with some more sophisticated queueing
code at later time.

closes #1081, fixes #1100, fixes #182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants
@liclac @geekdave @na-- @jordandlaman and others