-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for collecting Kubernetes pod log file in nodes #2266
Comments
I have some ideas how to tackle that and some experience with doing it via Fluentbit & co. Would be happy to work on the design and implementation |
@pmm-sumo great! Please go ahead, once you have the design please submit here and let's review it. |
Here's draft design doc link and a gist with sample config |
@pmm-sumo thank you for the research and the design. Here is what I think would be good to do so that the k8s support is complete:
|
The kudos should go to @sumo-drosiek, who did all the hard work here :) There are more details in the document, but we've started doing performance tests with simply tailing a local file having content such as (BTW, I would like these to become fully automated):
The preliminary results are following:
I am wondering if they should be considered OK for this phase or if rather we should first optimize the pipeline. I think #2330 and dedicated receiver could help. What do you think @tigrannajaryan @djaglowski ? |
@pmm-sumo can you also add CPU usage numbers for each case? The logs/sec look quite low to me. As a comparison we can do 10,000 log records / second when reading simple filelog and only consuming about 20% of CPU: https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/10180/workflows/fcb249dc-bd31-4b20-b8b3-7d31956ccce3/jobs/85000 |
@tigrannajaryan Tests should be rather consider in relative to each other. Rather to show ratio between using different configurations (autodetect k8s format, etc) I will add some tests to the |
@tigrannajaryan To summary stanza's implementation of plugins:
|
@sumo-drosiek thank you for doing the tests. Just to make it clear: I want to understand how slow our k8s log reading implementation is compared to the potential maximum we can get if we hand-code the parsing of k8s formats. The upper bound for that maximum is approximately what we get with the simplest filelog receiving test. I believe it is important to know how much worse our k8s solution is doing compared so that we then decide if it makes sense to spend time on improving performance because there is potentially large gains possible. |
@tigrannajaryan @djaglowski I performed more tests Today, and here is a result: table
summaryfilelog: descriptionauto detection without any additional features decreased performance from It's pretty strange for me that every operator in pipeline decreases performance by ~30%. On the other hand, all of the operations are performed for each log sourcesConfiguration used for tests: https://github.com/sumo-drosiek/opentelemetry-collector-contrib/blob/2de61745a2ec906f5557c286bcd74fdf1435f6a7/testbed/datasenders/k8s.go |
@sumo-drosiek excellent, thank you for doing the tests. So, to make it clear: the most feature-rich version, that autodetects the format, correctly parses filepath and puts attributes where they belong performs about 3x slower than a simple filelog. This is not great, but it is not terrible. It is something we can work with and improve and optimize. |
@djaglowski is this expected or there is a better way to achieve this that has less performance impact? |
@sumo-drosiek Thanks for the detailed findings. This will be very helpful in optimizing. I'm confident we can improve it quite a bit. @tigrannajaryan I'm a little surprised, but I'm hopeful that much of this can be improved substantially and rapidly in the codebase. I don't think 30% is going to end up as a typical cost for most operators. Unfortunately, I'm not aware of any immediate config changes that would yield improvements. Some details on how we might optimize the operators to improve this use case: Autodetection, routing, if condition evaluation, and filename parsing all boil down to regex evaluation. I don't think a lot of time has been spent optimizing this. There may be some low hanging fruit there that would improve all of these in one go. Additionally, in all the cases above, we could almost eliminate the need to apply the regex by supporting a caching mechanism. Each filename should only need to be parsed once. Similarly, autodection, routing, and if condition evaluation are only needed once per file to determine if the file is CRI/Docker. Some more thought is needed on config design, but likely we can add a The cost to move labels is surprising to me. I'm guessing there is an inefficient allocation that takes place on each entry. I suspect this can be improved quite a bit. In cases where each entry needs to be parsed by regex, we might expect a notable cost per operator, but I suspect we can improve from 30%. |
OK, to summarize: we have a solution for k8s pod log that works and is not too bad performance-wise. @sumo-drosiek will you be able to submit a PR with your k8s perf tests to the repo and add a k8s collection config to the examples directory? I think we should use the most complete autodetect config even if it is slowest. It would be nice to include perf numbers in the README of the example. Once we do that we can consider that the basic requirements of this issue are met and further improvements can be done as needed by filing separate issues. |
I was thinking about caching the regex too (which perhaps can be invalidated e.g. when it doesn't match the line), the Dan's idea makes a lot of sense to me, though I wasn't sure if it can be made without a dedicated operator (which would also hide the complexity of the config) In either case - after submitting the PR, do you think it's the time to move on to Helm chart support @tigrannajaryan ? |
Yes, I think so. Great progress! :-) |
@pmm-sumo Hello, Will you be working on logging helm chart support right away? I will have some bandwidth around next week. Let me know if you would like me to take on this task. |
@rockb1017 @tigrannajaryan we've been preparing to tackle #2536 with @sumo-drosiek (and we have some experience with a fairly large Helm chart we help with development) Will be more than happy to cooperate too, or do the review and focus on something else :) |
Notes from the SIG meeting:
|
PR with performance test (the most complex by now) is ready (#2564). Should I add rest of the tests to the testbed or we are fine with the most complex? I am going to add |
I would support adding the other scenarios as well. Since we understand there to be significant performance differences between the scenarios, it will be quite helpful for validating potential improvements. (e.g. this issue) Perhaps later we would reduce the number of tests once significant performance gaps between scenarios are closed. |
There is one more thing that I listed above that I think is very important to have:
This will be a correctness test, not a performance test. Ideally all our k8s tests should use the same configuration (the correctness test and the performance test). |
I think the worst case perf test is sufficient for now, no need to add other cases.
Great, please do! |
Oh, I see @djaglowski wants to see perf tests for other cases as well. I don't mind if you want to do it. |
I believe this is now done. Any additional requests can be filed as new issues. Closing this. |
We need to be able to collect logs from Kubernetes. It may be implemented as a separate receiver or as a filelog receiver with a plugin that defines the settings for collecting Kubernetes logs. This needs to be discussed.
What we want to collect:
It may be that 1 and 2 require different approached, in which case this issue may be best split into 2 separate issues and addressed individually.
Depends on #2268
The text was updated successfully, but these errors were encountered: