-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate telegraf integration in k6 #1064
Comments
A slight update to the vendor numbers above. I saw that their [prune]
go-tests = true
unused-packages = true
non-go = true ... and re-run |
After a quick look through some of the open k6 issues, if we have a nice and configurable telegraf output, we can:
As I mentioned above, the TOML telegraf configuration may be a bit tricky to integrate with k6 in a nice way, especially when it comes to bundling it in archives. However, I think that, overall, the actual configuration complexity it introduces would be magnitudes lower than the alternative. And by alternative, I mean k6 having a bespoke implementation (with a custom configuration) for each possible metric output that people may want... So, I think this would make #883 and maybe even #587 easier, especially if we decide to eventually deprecate some of our existing outputs like kafka and datadog. It will also make #743 easier, since we can just point people to the telegraf documentation for a lot of things. |
As an addendum, I don't think we should replace every single metric output with telegraf. For example, the CSV output (#321 / #1067), once the PR is fixed, seems to be a good output to have natively in k6. It's simple, has not external dependencies, and since we want performance out of it, it's probably not worth the overhead to convert between our metrics and the telegraf ones... Moreover, I don't immediately see a way to do it with telegraf. There's a |
As mentioned in this discussion #478 (comment), we might investigate https://openmetrics.io/ as a base to work from. Connected issue: #858 |
For the record, I'm biased as a Prometheus and OpenMetrics contributor. 😄 The Prometheus/OpenMetrics data model and metrics output format is very widely supported by a variety of monitoring systems. Adopting Prometheus doesn't lock you into Prometheus. The format is already supported by InfluxDB/Telegraf, Datadog, Elastic, etc. |
This isn't my job anymore, but please remember that k6' metrics pipeline is performance-critical, and doing this would essentially involve serialising metrics into one format, asking Telegraf to deserialise it, do its own processing to it, and then re-serialising it into another one. This is heavyweight even as a standalone process (I've seen the CPU graphs of a production workload running through it), but it's at least justified there; doing it in-process just seems roundabout to me. (#858 seems like a much more sensible way to do it if you want one output to rule them all; if you really like Telegraf, you can point OpenMetrics at its Prometheus ingress.) |
The Prometheus/OpenMetrics format was designed for fast and cheap encoding and parsing. Originally Prometheus used JSON for metrics. But, like you said, handling metrics should be low overhead. JSON was just to CPU intensive. The currently used format was created to reduce CPU use in encoding and decoding metrics. It's extremely efficient, so the overhead in Telegraf should be quite minimal. Also, the Prometheus client_golang library is extremely efficient itself. Incrementing counters is a 12-15 CPU nanosecond operation. Histogram observations are 25ns. InfluxDB was involved with creating OpenMetrics, so you can be assured that they're going to support it well. The amount of data and overhead we're talking about is extremely small. While I understand that you would be concerned about overhead in exchanging data, the amount we're talking about here is extremely tiny. If you're running into excess CPU use when handling metric samples, you may possibly have other problems going on. IMO, going with telegraf as a library is a much heavier weight option. |
@liclac, thanks for chiming in! 🎊
There seems to be some misunderstanding here. We want to investigate using telegraf as a library, not as a standalone process, so we're not going to serialize metrics before we pass them on to it. For the simplest use cases, just using the telegraf outputs, the overhead should be just wrapping our metric If it was just about outputs, you'd probably be right that just using something like OpenMetrics would probably be better, especially in the long term. As I mentioned elsewhere, I have nothing against supporting OpenMetrics natively, it's worth evaluating and it'd make sense to support it if the industry is headed in that direction. But this situation isn't either/or, we can happily do both... 😄 The biggest reason OpenMetrics by itself is not sufficient is because it's just another data format, which doesn't solve some of the problems we've had. The basic problem is that k6 produces a lot of data. Currently we emit at least 8 metric samples for every HTTP request we make. Soon, it may be 9, once we start tracking DNS times again... Double that (or more) when there are redirects... And as you know, this isn't counting other metrics that are emitted for every iteration, or And while it's probably worth it, at least for filtering data, to implement something natively in k6 (#570), or it might be worth it to extend the current And I also want to stress again that this is an investigation. We likely won't pursue telegraf integration if it turns our that it's actually super heavy and affects running k6 tests overly much. It's probably worth it to investigate other approaches as well... For example, if the real-time data processing turns out to be too heavy, we can evaluate dumping all k6 metrics on disk in an efficient format (binary or OpenMetrics or whatever) and then post-processing them after the load test is done, so we don't affect the actual test execution. Or something else... |
I think I'm beginning to understand a bit more of what's going on here. What is currently being done is seems to be using InfluxDB metrics to generate event sample logs, rather than metrics for monitoring k6 itself. These metrics sound a lot more like event logs, rather than metrics. Part of the reason we created Prometheus/OpenMetrics libraries the way they are is that each event inside the system doesn't generate any kind of visible output. We intentionally don't care about individual events, only sampling counters of events. As you generate more and more events per second, you might start to consider throwing away the data from each event and record the data to a histogram datatype. You lose the individual event granularity, but as you scale up the traffic, metric output is constant. If you want all event data for deep analysis, you might consider a structured logging output. We had the same issue in GitLab. As our traffic grew, our implementation of InfluxDB metric events for each request was overwhelming. We're phasing out our InfluxDB use here, and moving this data to JSON structured logs for deep analysis and Prometheus metrics for real-time monitoring. |
Again, for us this isn't either/or 😄 For example, if you don't use any metric outputs and just rely on the end-of-test summary, we currently throw away most of the data, and once we move to using HDR histograms (#763) or something similar, we'll throw away all of the data and have a constant memory footprint. But using histograms for some period (say, 1s, 10s, etc.) is a whole nother kettle of fish 😄 Definitely worth investigating, but much more complicated. Also conveniently, telegraf has a histogram aggregator like that... See why I don't mind spending some time investigating what they're doing and maybe integrating what currently exists in k6 as a stop-gap until we have something better? 😄
This makes sense, and we also currently have something like this. The JSON output fills that gap (though it could probably use some performance improvements, even after we merge #1114) and we'll soon have a CSV output (#1067)... |
I'm not sure what you mean by this. With histograms, it's the same as with normal counters, you always have every data point. But with histograms you keep things in bucketed granularity. Maybe you're thinking of storing percentile summaries that have a decay factor? These can't be aggregated, as the math doesn't work out. We support these in Prometheus, but we don't encourage them because of the aggregation issue. Prometheus supports arbitrary histograms, but not as efficiently as it could. There is some work in progress to be able to directly support "HDR Histograms" in both Prometheus and OpenMetrics. |
I'm also not sure exactly what you mean here, sorry 😄 . Can you share a link with more information? I'm starting to get the impression that we're talking about slightly different things, or more likely, discussing similar things from very different perspectives... I'll try to backtrack... a little bit (😅) and explain how metric values are generated and used in k6, as well as what is currently lacking from my perspective, leading up to this issue and others. Sorry for the wall of text... 😊 So, in k6, a lot of actions emit values for different metrics:
These are the different measurements k6 makes (or will soon make). Once they are measured, they currently have 3 possible purposes:
For a single k6 test run, a user may use any combination of the above 3 points, including none or all of them. And when we're discussing Prometheus/OpenMetrics, in my head it's only relevant to point 3 above, unless I'm missing something. What's more, it doesn't address one of the biggest issues k6 currently has with external outputs (point 3) - that k6 just produces too much data... All of the measurements from the first list, including all of their tags, are currently just directly sent to the external outputs. There's currently no way to filter (#570), restrict (#884 (comment)) or aggregate that data stream yet. As you can see from the linked issues, we plan to implement features like that, but a Regarding Prometheus/OpenMetrics - on the one hand, as you say, it would probably reduce the CPU requirements for encoding all of the metrics and is basically becoming the standard format, which is nice, but solves few of our actual problems 😄. On the other hand, the pull model of Prometheus (where AFAIK we have to expose an endpoint for Prometheus to pull the data, instead of us pushing it) seems like a very poor fit for k6. First, because we don't want to keep all of the raw data in k6's memory until something scrapes it, and also because k6 test run duration time aren't usually very long. I guess that's what the pushgateway is for, but still something that needs consideration... Finally, to get back to HDR histograms and what I meant by "histograms for some period (say, 1s, 10s, etc.)". Currently, the implementation of the end-of-test summary stats and the thresholds (points 1 and 2 in the list above) is somewhat naive. First, to be able to calculate percentiles, it's keeping all But the thresholds have another restriction. You can delay their evaluation, and you can even filter by tags in them like this: import http from "k6/http";
import { check, sleep } from "k6";
export let options = {
thresholds: {
// We want the 99.9th percentile of all HTTP request durations to be less than 500ms
"http_req_duration": ["p(99.9)<500"],
// Requests with the staticAsset tag should finish even faster
"http_req_duration{staticAsset:yes}": ["p(99)<250"],
// Global failure rate should be less than 1%
"checks": ["rate<0.01"],
// Abort the test early if static file failures climb over 5%, but wait 10s to evaluate that
"checks{staticAsset:yes}": [
{ threshold: "rate<=0.05", abortOnFail: true, delayAbortEval: "10s" },
],
},
duration: "1m",
vus: 3,
};
export default function () {
let requests = [
["GET", "https://test.loadimpact.com/"],
["GET", "https://test.loadimpact.com/404", null, { tags: { staticAsset: "yes" } }],
["GET", "https://test.loadimpact.com/style.css", null, { tags: { staticAsset: "yes" } }],
["GET", "https://test.loadimpact.com/images/logo.png", null, { tags: { staticAsset: "yes" } }]
];
let responses = http.batch(requests);
requests.forEach((req, i) => {
check(responses[i], {
"status is 200": (resp) => resp.status === 200,
}, req[3] ? req[3].tags : {});
});
sleep(Math.random() * 2 + 1); // Random sleep between 1s and 3s
} but they are only ever evaluated for all emitted metrics since the start of the test run - you can't restrict them in a time window. I think that's what you meant by "decay factor" above, but my understanding of the proper terms in this area is somewhat poor, so I might be mistaken. In any case, I created a separate issue (#1136) for tracking this potential feature, but we have a lot of work to do before we tackle it. |
In Prometheus client libraries, we have two main "Observation" methods. "Histogram" and "Summary". When you do something like The difference is that Summary tracks pre-computed percentile buckets like 50th, 90th, 99th, etc. Histogram on the other hand tracks static buckets like 0.001s, 0.05s, 1s, etc. With Histogram mode, the memory use is constant, as you just need a few float64 values to track each statically defined bucket.
Prometheus metrics are always implemented as "now". So the typical concern with polling isn't valid as the memory use is constant. There is no buffering. I find it's much easier to convert the pull-based counters into push-based data by having the client library set a push timer internally. This helps make the metric load for push-based systems more consistent. Your example of the "every second the number of active VUs". In a Prometheus style metrics library, you would track the number of active in real-time. Prometheus could pull at whatever frequency, or you can set a push timer. From a code perspective, you only need to increment and decrement the active gauge. The client library handles the rest. |
Ah, thanks, this cleared up many misconceptions that I had... 😊 How do you deal with different metric value I'm asking because we heavily annotate any metric values we measure during the execution of a k6 script, and it's impossible to predict the set of possible tag values for any given metric. For example, each of the values for the 8+ For the InfluxDB output, we've somewhat handled the over-abundance of tags by having the |
We use a metric vector data structure to allocate the mapping of lables. Having lots of label/tag options isn't a big problem here. As long as the tags aren't completely unlimited like client IPs, user IDs, etc. URLs should be shortened to eliminate any unique IDs. For example I have a couple of apps that emit about 30-40k different metrics per instance of the app server. This is well within the limit of a single Prometheus server's capacity. (10M metrics per sever starts to be a memory bottleneck) |
Any news for this feature? It will be great to have other output like prometheus or something else. |
Sorry @piclemx, no progress on this or on a standalone Prometheus output yet. Add a 👍 and watch the issue for news, it will be referenced when we make some progress. |
Hey guys, first of all thanks so much for the work on k6, it's an awesome project. ⭐ I have two questions regarding this even though I'm not, as others may be, eagerly expecting this feature. Instead, I'm more worried about the current functionality vs. what this addition might bring or take away.
I confess I don't know much about
Furthermore, regarding the problems you describe about histograms: particularly when using Datadog this might not be as much of an issue, since Datadog supports a custom
Note: Adding the ability to collect It's highly likely I'm missing something here. Apologies in advance if some assumption doesn't make sense. Also, I appreciate it may not be possible to accommodate every use case from the get go. Just interested in understanding what the path may be moving forward. Thanks again for all your hard work on |
We haven't investigated things thoroughly yet, so nothing is decided. As a general rule, if we're going to deprecate something, it will take at least a single k6 version (usually more) where we warn people about the fact, and usually there needs to be a good alternative that covers the same use cases. In this specific case, if we add a telegraf output natively, and if we decide to deprecate any current k6 outputs in favor of it, they will almost surely be available for at least one more version, in tandem with the telegraf output. That said, I'm not very familiar with the DataDog, but if what you say is true, it's unlikely we'll deprecate the current k6 Datadog output in favor of a telegraf-based one, unless it reaches full feature parity with what we currently offer. |
Just to add to the discussion, I did a small PoC with Telegraf and indeed it works very well with k6. The only problem that I found that could be a blocker, is that by aggregating the metrics before sending it to the upstream it will lose the percentiles metrics since it won't have the whole sample to properly calculate it. But even for that, there is a solution being discussed there influxdata/telegraf#6440, that would implement t-digest algorithm and that gives a very close percentile approximation on-line. Also looking into the Telegraf project I don't see it as an API/library that could be used on K6 code and for me it would better to recommend Telegraf as a sidecar tool to aggregate k6 metrics. If this is something to do, I would be glad to help to document or show some examples of how to integrate both tools. |
Yeah, the possibility of using telegraf internally as a Go library is far from decided. On the one hand, the interfaces look simple enough that we might be able to reuse them. On the other hand, there are bound to be complications, and as I wrote above, "two main sticking points I foresee are the configuration and metric mismatches between it and k6". And, as you've mentioned, using it as a sidecar tool has been possible for a long time, basically since k6 could output metrics to InfluxDB, and isn't very inconvenient. It's not as efficient, but for smaller load tests, it should be good enough.
If you're willing, that'd be awesome! ❤️ Our docs are in a public repo, for example here are the ones for the outputs. Every page in the docs should have a |
Hi @arukiidou, I do think it's now unlikely that we will ever put the whole telegraf in k6. Since this was proposed we have output extensions effectively letting people write an extension to output to whatever they want. This also can be used to make a telegraf output extension 🎉 . I do doubt the k6 team themselves will do this. Additional to that if somebody will make an extension and then use it - it might be better to just run telegraf on the side. In both cases there will be additional work and the configuration will be the telegraf one (or a subset of it I guess). This fairly old comment of mine showcases how to use it for additional amount of features we don't support currently as well. |
After #1060 and and #1032 (comment), I think it makes some sense to investigate potentially integrating telegraf in k6. And I don't just mean sending metrics from k6 to a telegraf process, since that should currently be possible with no extra changes in k6 - telegraf has an InfluxDB listener input plugin that k6 can send metrics to.
Rather, since telegraf is a Go program with a seemingly fairly modular and clean architecture, it may be worth investigating if we can't use parts of it as a library in k6. If we can figure out a way to use it, we'd pretty much have a very universal metrics output "for free":
The two main sticking points I foresee are the configuration and metric mismatches between it and k6. So, instead of a huge refactoring effort, my initial idea is to investigate if we can't just add a
telegraf
output in k6 that can accept any telegraf options. That is, instead of one new k6 output type per one telegraf output type, we could have a single "universal" k6 output type that could, via configuration, be used to filter/aggregate/output k6 metrics in every way telegraf supports.This way, we don't have to refactor a lot of k6 - we can transparently convert the k6 metrics to whatever telegraf expects. The configuration would be trickier, since telegraf expects its configuration in a TOML format... And I'm not sure there's any way to change that, considering that even the simple
file
output just hastoml
struct tags and apparently that's enough, since the constructor just returns the empty struct (which I assume the config is unmarshaled into).We can try to convert JSON to TOML, though I don't think it's worth it, since the k6 outputs can't be configured from the exported script
options
yet anyway (#587). Instead, we probably should stick with the TOML config and just pass it via the CLI, like how the JSON output file is currently being specified:k6 run script.js --out telegraf=my_telegraf.conf
or something like that.Another thing we should evaluate is how big of a dependency telegraf would be. The current repo has ~200k Go LoC, but its vendor has around 5 million... I think a lot of those would be dropped, since we won't need any of its 150+ input plugins and other things, but there's still a good chance that this dependency would actually be bigger than the rest of k6 😄 Even so, I don't think that would be a huge issue, since with the number of plugins it has, I assume that the base APIs are very stable... It's just something that we need to keep in mind, given that we vendor our dependencies in our repo (and they don't).
The text was updated successfully, but these errors were encountered: