-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using InfluxDB output affects the test #182
Comments
Hm, interesting - InfluxDB collection does introduce a bit of load on the emitting system because every k6 sample has to be turned into an InfluxDB sample, but it shouldn't be nearly this noticeable - it spawns a separate goroutine that's shipped samples every ~10ms, completely out of band from any VU goroutines… but you may be onto something here. The collector does batch samples up in 1s buckets, but that's a rather arbitrary number - perhaps tweaking it could yield better performance? UDP is an option as well, but as I recall, when we tried feeding InfluxDB samples that way it actually resulted in a significant number of them straight-up vanishing into the ether. This is something I'm really interested in optimizing as much as possible, because it will be integral to clustered setups, but there are some other deadlines coming up before I can get on it - the InfluxDB code isn't nearly as optimized as it deserves to be… but I can certainly help in any way I can. Just to be absolutely sure, check that the usual suspects aren't skewing the results here unnecessarily:
|
@liclac Just revisiting this. It's worth noting that my InfluxDB server is currently residing on a very underpowered t2.micro instance. If Influx is slow-to-respond to each 1s bucket that's posted to it from the k6 client, would that explain the slowdown? i.e. is k6 blocking on responses from Influx in some way? |
Currently I am trying to get an environment up on my Macbook Pro Retina 2015 (2.5ghz I7, 16GB RAM). I will try to execute a similar scenario and post my results here. |
I think I am encountering a similar issue. When running my volume tests with influx + grafana/chronograf visualizing the data the test will have serious gaps in the requests per second. At first I believed I was being rate limited by cognito but we replaced cognito with our own jwt authorizer lambda and it is extremely performant. The other thing that happens is the response time for all the endpoints starts sky rocketing up. They usually sit in the 90 -150 ms range but will balloon up to 20s across the board. I did not encounter this problem a few months ago while running tests and have had to disable outputting results to influxdb for the tests to return metrics worth anything. Luckily we have cloudwatch but it isn't ideal as cloudwatch doesn't have insight into the response times of all the endpoints outside of AWS. |
Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead the wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit there will be a configurable number of parallel writes (10 by default) that will trigger again every 1 second (also now configurable), but if those get exhausted it will start queueing the samples each second instead of combining them and than writing a big chunk which has a chance of hitting the max body. I tested with a simple script doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it. Some minor problems with this solution is that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem 2. We can probably better batch stuff if we add/keep all the unsend samples together 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182
Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182
Previously to this k6 will write to influxdb every second, but if that write took more than 1 second it won't start a second write but instead wait for it. This will generally lead to the write times getting bigger and bigger as more and more data is being written until the max body that influxdb will take is reached when it will return an error and k6 will drop that data. With this commit a configurable number of parallel writes (10 by default), starting again every 1 second (also now configurable). Additionally if we reach the 10 concurrent writes instead of sending all the data that accumulates we will just queue the samples that were generated. This should considerably help with no hitting the max body size of influxdb. I tested with a simple script, doing batch request for an empty local file with 40VUs. Without an output it was getting 8.1K RPS with 650MB of memory usage. Previous to this commit the usage of ram was ~5.7GB for 5736 rps and practically all the data gets lost if you don't up the max body and even than a lot of the data is lost while the memory usage goes up. After this commit the usage of ram was ~2.4GB(or less in some of the tests) with 6273 RPS and there was no lost of data. Even with this commit doing 2 hour of that simple script dies after 1hour and 35 minutes using around 15GB (the test system has 16). Can't be sure of lost of data, as influxdb eat 32GB of memory trying to visualize it and I had to kill it ;(. Some problems with this solution are that: 1. We use a lot of goroutines if things start slowing down - probably not a big problem, but still good idea to fix. 2. We can probably better batch stuff if we add/keep all the unsend samples together and cut them in let say 50k samples. 3. By far the biggest: because the writes are slow if the test is stopped (with Ctrl+C) or it finishes naturally, waiting for those writes can take considerable amount of time - in the above example the 4 minutes tests generally took around 5 minutes :( All of those can be better handled with some more sophisticated queueing code at later time. closes #1081, fixes #1100, fixes #182
Maybe this is just the observer effect in action, but I thought I'd run it by the experts 😉
Summary
Turning on InfluxDB output seems to:
http_req_duration
)With InfluxDB output turned on
With InfluxDB output turned off
Possible solutions
Either of these would also significantly reduce the load on the influxDB server itself (mine is running on a micro instance and crashes under the load of 50 VUs)
I'm happy to dive in and work on a solution, if it would be helpful. Thanks!
The text was updated successfully, but these errors were encountered: