Add support for collecting Kubernetes pod log file in nodes #2266

tigrannajaryan · 2021-02-01T20:08:38Z

We need to be able to collect logs from Kubernetes. It may be implemented as a separate receiver or as a filelog receiver with a plugin that defines the settings for collecting Kubernetes logs. This needs to be discussed.

What we want to collect:

Required: pod logs that Kubernetes makes available on the nodes.
Need to decide if we need: Kubernertes System component logs.

It may be that 1 and 2 require different approached, in which case this issue may be best split into 2 separate issues and addressed individually.

Depends on #2268

pmm-sumo · 2021-02-11T21:15:18Z

I have some ideas how to tackle that and some experience with doing it via Fluentbit & co. Would be happy to work on the design and implementation

tigrannajaryan · 2021-02-12T15:04:03Z

@pmm-sumo great! Please go ahead, once you have the design please submit here and let's review it.

pmm-sumo · 2021-02-19T17:10:39Z

Here's draft design doc link and a gist with sample config

tigrannajaryan · 2021-02-24T14:31:57Z

@pmm-sumo thank you for the research and the design. Here is what I think would be good to do so that the k8s support is complete:

Since the design uses the filelog exporter there is no new receiver type to be added. However the config to collect k8s logs is quite complicated and needs to have automated tests on its own. I would want to see an automated test that uses the proposed configuration and collects logs from real files in this format, then verifies that the collected log pdata has the expected structure. It would be best to have testdata that contains golden log files in each of the supported container formats and unit tests that verify the parsing.
Ideally we would want to have automated tests that verify the operation on a real k8s cluster with real pods running on it, producing logs and the actual Otelcontribcol deployed as a daemonset collecting the logs. This is likely not easy to do. Please think about it and see if there is a way to have such automated test with reasonable effort.
I also would like to see performance tests, preferably automated. Now, again performance tests on a real k8s cluster would be great, but it may be difficult to automate. So, perhaps what we can do is do performance tests manually on a real k8s cluster and then add automated performance tests that try to mimic k8s collection from a node but without the real k8s cluster (e.g. implement a file log generator that output log files mimicking the output of several pod logs and then use the Collector to collect them). For automated perf tests we would want to use our testbed.
If the performance turns out to be bad due to complicated parsing configuration (e.g. it is several time slower than collecting comparable file logs but with simpler parser) then we may need to implement a k8s log receiver with manually coded parser to get better performance.
Assuming that the performance is good we still need to decide how we can hide the complexity of the configuration from the end user. One possibility is the plugins approach that Stanza uses. I would like to discuss this. (@djaglowski I haven't had a chance to properly look into how Stanza implements plugins. Can you please point me to the relevant docs/code that I can read to understand it?)

pmm-sumo · 2021-02-25T12:54:15Z

The kudos should go to @sumo-drosiek, who did all the hard work here :)

There are more details in the document, but we've started doing performance tests with simply tailing a local file having content such as (BTW, I would like these to become fully automated):

2021-02-16T09:21:15.518430714Z stdout F example: 0 troublemakers unluxuriantly Lapointe unheroical synthesize
2021-02-16T09:21:16.519721603Z stdout F example: 1 basement's nonbulbar trellis-sheltered thesaurismosis snakeflower
2021-02-16T09:21:17.521307236Z stdout F example: 2 Owasco cornbrash subchannels actinocutitis abjudicate
...

The preliminary results are following:

Processed in total 1018334 logs in 300s: 3387.013557 logs/sec - autodetection (~0.36MB/s)
Processed in total 1337153 logs in 299s: 4460.928373 logs/sec - containerd directly (~0.47MB/s)
Processed in total 1784265 logs in 299s: 5959.308995 logs/sec - without moving attributes (log parser + file path parser)
Processed in total 5604660 logs in 299s: 18694.579597 logs/sec - only log parser

I am wondering if they should be considered OK for this phase or if rather we should first optimize the pipeline. I think #2330 and dedicated receiver could help. What do you think @tigrannajaryan @djaglowski ?

tigrannajaryan · 2021-02-25T14:04:52Z

@pmm-sumo can you also add CPU usage numbers for each case?

The logs/sec look quite low to me. As a comparison we can do 10,000 log records / second when reading simple filelog and only consuming about 20% of CPU: https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/10180/workflows/fcb249dc-bd31-4b20-b8b3-7d31956ccce3/jobs/85000

sumo-drosiek · 2021-02-25T15:52:18Z

@tigrannajaryan Tests should be rather consider in relative to each other. Rather to show ratio between using different configurations (autodetect k8s format, etc)

I will add some tests to the make e2e-test 👍

djaglowski · 2021-02-25T16:50:50Z

@tigrannajaryan To summary stanza's implementation of plugins:

The functionality primarily depends on golang's text/template package. Here is where plugins are actually rendered.
Plugin files are expected to be .yaml, even though they are technically not yaml until rendered. The .yaml naming convention was chosen over other alternatives (such as .yaml.template) in order to allow IDE's to apply syntax highlighting. This works quite well in most cases.
There is one custom function type, default, that was added to the templating schema. This allows for setting of default values, as seen here.
Stanza expects that each operator has a unique id. In plugins this is solved by prepending the plugin's own id to the id's of the operators that it contains. (my_regex_parser becomes my_plugin_instance.my_regex_parser)

tigrannajaryan · 2021-02-25T17:02:24Z

@sumo-drosiek thank you for doing the tests.

Just to make it clear: I want to understand how slow our k8s log reading implementation is compared to the potential maximum we can get if we hand-code the parsing of k8s formats. The upper bound for that maximum is approximately what we get with the simplest filelog receiving test. I believe it is important to know how much worse our k8s solution is doing compared so that we then decide if it makes sense to spend time on improving performance because there is potentially large gains possible.

sumo-drosiek · 2021-03-01T17:33:31Z

@tigrannajaryan @djaglowski I performed more tests Today, and here is a result:

table

Test                                    |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items|Received tems/s|
----------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:|--------------:|
LogMaxDPS/OTLP                          |PASS  |     15s|   133.5|   134.3|         81|         87|   3624300|       3624300|         241620|
LogMaxDPS/filelog                       |FAIL  |     26s|   173.7|   179.0|         76|         86|   4358100|       3879882|         149226|
LogMaxDPS/k8d_containerd                |FAIL  |     26s|   177.0|   180.0|         78|         89|   4583000|       3386024|         130231|
LogMaxDPS/k8s_auto_detection            |FAIL  |     26s|   157.2|   168.7|         83|         98|   4199000|        930119|          35773|
LogMaxDPS/k8s_auto_detection_routed     |FAIL  |     26s|   152.3|   157.7|         84|         96|   4264800|       1029652|          39602|
LogMaxDPS/k8s_auto_routed_only          |FAIL  |     26s|   165.2|   173.3|         80|         91|   4266900|       2247512|          86442|
LogMaxDPS/k8s_auto_routed_filepath_only |FAIL  |     26s|   157.1|   160.3|         81|         92|   4529400|       1482458|          57017|
Log10kDPS/OTLP                          |PASS  |     16s|    16.5|    18.7|         60|         74|    149900|        149900|           9993|
Log10kDPS/filelog                       |PASS  |     15s|    16.4|    17.7|         61|         75|    150000|        150000|          10000|
Log10kDPS/k8d_containerd                |PASS  |     15s|    19.3|    21.0|         63|         77|    149900|        149900|           9993|
Log10kDPS/k8s_auto_detection            |PASS  |     15s|    44.3|    46.0|         65|         80|    149900|        149900|           9993|
Log10kDPS/k8s_auto_detection_routed     |PASS  |     15s|    37.1|    37.7|         65|         79|    150000|        150000|          10000|
Log10kDPS/k8s_auto_routed_only          |PASS  |     15s|    22.6|    23.3|         62|         76|    150000|        150000|          10000|
Log10kDPS/k8s_auto_routed_filepath_only |PASS  |     15s|    31.0|    32.3|         65|         79|    150000|        150000|          10000|

summary

filelog: 149226 lps
containerd regex: 130231 lps
auto detection (using ifs): 35773 lps
auto detection (using routed) 39602 lps
auto detection (using routed; without parsing filepath; without moving attributes): 86442 lps
auto detection (using routed; with parsing filepath; without moving attributes): 57017 lps

description

auto detection without any additional features decreased performance from 130231 (which is similar to pure filelog test) to 86442 (-33%). In addition, extracting metadata from file path decreased it to 57017 (-34%) and additional operations (moving labels to attributes) decreased it to 39602 (-30% )

It's pretty strange for me that every operator in pipeline decreases performance by ~30%. On the other hand, all of the operations are performed for each log

sources

Configuration used for tests: https://github.com/sumo-drosiek/opentelemetry-collector-contrib/blob/2de61745a2ec906f5557c286bcd74fdf1435f6a7/testbed/datasenders/k8s.go
Tests implementaion: https://github.com/sumo-drosiek/opentelemetry-collector-contrib/blob/drosiek-stanza/testbed/tests_unstable_exe/log_test.go

tigrannajaryan · 2021-03-01T19:09:06Z

@sumo-drosiek excellent, thank you for doing the tests.

So, to make it clear: the most feature-rich version, that autodetects the format, correctly parses filepath and puts attributes where they belong performs about 3x slower than a simple filelog. This is not great, but it is not terrible. It is something we can work with and improve and optimize.

tigrannajaryan · 2021-03-01T19:10:10Z

It's pretty strange for me that every operator in pipeline decreases performance by ~30%. On the other hand, all of the operations are performed for each log

@djaglowski is this expected or there is a better way to achieve this that has less performance impact?

djaglowski · 2021-03-01T20:11:07Z

@sumo-drosiek Thanks for the detailed findings. This will be very helpful in optimizing. I'm confident we can improve it quite a bit.

@tigrannajaryan I'm a little surprised, but I'm hopeful that much of this can be improved substantially and rapidly in the codebase. I don't think 30% is going to end up as a typical cost for most operators.

Unfortunately, I'm not aware of any immediate config changes that would yield improvements.

Some details on how we might optimize the operators to improve this use case:

Autodetection, routing, if condition evaluation, and filename parsing all boil down to regex evaluation. I don't think a lot of time has been spent optimizing this. There may be some low hanging fruit there that would improve all of these in one go.

Additionally, in all the cases above, we could almost eliminate the need to apply the regex by supporting a caching mechanism. Each filename should only need to be parsed once. Similarly, autodection, routing, and if condition evaluation are only needed once per file to determine if the file is CRI/Docker. Some more thought is needed on config design, but likely we can add a cacheable: true field to regex_parser. If nothing else we could build a dedicated k8s_file_input operator that includes such a caching mechanism. Either way, we should be able to eliminate much of the loss here.

The cost to move labels is surprising to me. I'm guessing there is an inefficient allocation that takes place on each entry. I suspect this can be improved quite a bit.

In cases where each entry needs to be parsed by regex, we might expect a notable cost per operator, but I suspect we can improve from 30%.

tigrannajaryan · 2021-03-01T20:36:26Z

OK, to summarize: we have a solution for k8s pod log that works and is not too bad performance-wise.

@sumo-drosiek will you be able to submit a PR with your k8s perf tests to the repo and add a k8s collection config to the examples directory? I think we should use the most complete autodetect config even if it is slowest. It would be nice to include perf numbers in the README of the example.

Once we do that we can consider that the basic requirements of this issue are met and further improvements can be done as needed by filing separate issues.

pmm-sumo · 2021-03-01T21:37:11Z

I was thinking about caching the regex too (which perhaps can be invalidated e.g. when it doesn't match the line), the Dan's idea makes a lot of sense to me, though I wasn't sure if it can be made without a dedicated operator (which would also hide the complexity of the config)

In either case - after submitting the PR, do you think it's the time to move on to Helm chart support @tigrannajaryan ?

tigrannajaryan · 2021-03-01T22:04:28Z

In either case - after submitting the PR, do you think it's the time to move on to Helm chart support @tigrannajaryan ?

Yes, I think so. Great progress! :-)

rockb1017 · 2021-03-02T23:14:02Z

@pmm-sumo Hello, Will you be working on logging helm chart support right away? I will have some bandwidth around next week. Let me know if you would like me to take on this task.

pmm-sumo · 2021-03-03T15:42:27Z

@rockb1017 @tigrannajaryan we've been preparing to tackle #2536 with @sumo-drosiek (and we have some experience with a fairly large Helm chart we help with development) Will be more than happy to cooperate too, or do the review and focus on something else :)

tigrannajaryan · 2021-03-03T19:12:09Z

Notes from the SIG meeting:

Let's make sure we have a reliable way to exclude our own logs.
It would be great to put the log record in the Body as a string, instead of putting it inside a map with "log" property.

sumo-drosiek · 2021-03-05T15:05:05Z

PR with performance test (the most complex by now) is ready (#2564). Should I add rest of the tests to the testbed or we are fine with the most complex?

I am going to add README.md, example and more tests eventually in another PR, as I want to unblock helm chart development ASAP

djaglowski · 2021-03-05T15:20:38Z

I would support adding the other scenarios as well. Since we understand there to be significant performance differences between the scenarios, it will be quite helpful for validating potential improvements. (e.g. this issue)

Perhaps later we would reduce the number of tests once significant performance gaps between scenarios are closed.

tigrannajaryan · 2021-03-05T15:24:40Z

There is one more thing that I listed above that I think is very important to have:

the config to collect k8s logs is quite complicated and needs to have automated tests on its own. I would want to see an automated test that uses the proposed configuration and collects logs from real files in this format, then verifies that the collected log pdata has the expected structure. It would be best to have testdata that contains golden log files in each of the supported container formats and unit tests that verify the parsing.

This will be a correctness test, not a performance test.

Ideally all our k8s tests should use the same configuration (the correctness test and the performance test).

tigrannajaryan · 2021-03-05T15:25:34Z

PR with performance test (the most complex by now) is ready (#2564). Should I add rest of the tests to the testbed or we are fine with the most complex?

I think the worst case perf test is sufficient for now, no need to add other cases.

I am going to add README.md, example and more tests eventually in another PR, as I want to unblock helm chart development ASAP

Great, please do!

tigrannajaryan · 2021-03-05T15:26:21Z

Oh, I see @djaglowski wants to see perf tests for other cases as well. I don't mind if you want to do it.

…iver (#2564) Add performance test for filelog based kubernetes container logs receiver This PR contains one performance test (for kubernetes containerd logs with format autodetection) **Link to tracking Issue:** #2266

**Description:** This PR adds example configuration to handle kubernetes container logs using filelog receiver This depends on the #2564 **Link to tracking Issue:** #2266 **Testing:** N/A tests which were performed to get results are going to be issued in another PR **Documentation:** README.md

tigrannajaryan · 2021-03-17T19:30:25Z

I believe this is now done. Any additional requests can be filed as new issues. Closing this.

…iver (#2564) Add performance test for filelog based kubernetes container logs receiver This PR contains one performance test (for kubernetes containerd logs with format autodetection) **Link to tracking Issue:** #2266

**Description:** This PR adds example configuration to handle kubernetes container logs using filelog receiver This depends on the #2564 **Link to tracking Issue:** #2266 **Testing:** N/A tests which were performed to get results are going to be issued in another PR **Documentation:** README.md

tigrannajaryan added release:after-ga spec:logs labels Feb 1, 2021

tigrannajaryan mentioned this issue Feb 1, 2021

Add basic log collection capabilities #2264

Closed

18 tasks

djaglowski assigned pmm-sumo Feb 11, 2021

This was referenced Mar 2, 2021

Regex Parser - consider adding an option to cache regex and result open-telemetry/opentelemetry-log-collection#41

Closed

Metadata operator - allow use of fields without requiring expression open-telemetry/opentelemetry-log-collection#42

Closed

tigrannajaryan added this to the Basic Logs Support milestone Mar 3, 2021

sumo-drosiek mentioned this issue Mar 5, 2021

Add performance test for filelog based kubernetes container logs receiver #2564

Merged

sumo-drosiek mentioned this issue Mar 8, 2021

example: add kubernetes logs example #2606

Merged

tigrannajaryan closed this as completed Mar 17, 2021

rockb1017 mentioned this issue Mar 25, 2021

K8s filelog collection: put the log record in the Body as a string #2851

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for collecting Kubernetes pod log file in nodes #2266

Add support for collecting Kubernetes pod log file in nodes #2266

tigrannajaryan commented Feb 1, 2021 •

edited

Loading

pmm-sumo commented Feb 11, 2021

tigrannajaryan commented Feb 12, 2021

pmm-sumo commented Feb 19, 2021 •

edited

Loading

tigrannajaryan commented Feb 24, 2021 •

edited

Loading

pmm-sumo commented Feb 25, 2021 •

edited

Loading

tigrannajaryan commented Feb 25, 2021

sumo-drosiek commented Feb 25, 2021 •

edited

Loading

djaglowski commented Feb 25, 2021

tigrannajaryan commented Feb 25, 2021

sumo-drosiek commented Mar 1, 2021

tigrannajaryan commented Mar 1, 2021

tigrannajaryan commented Mar 1, 2021

djaglowski commented Mar 1, 2021

tigrannajaryan commented Mar 1, 2021

pmm-sumo commented Mar 1, 2021 •

edited

Loading

tigrannajaryan commented Mar 1, 2021

rockb1017 commented Mar 2, 2021

pmm-sumo commented Mar 3, 2021 •

edited

Loading

tigrannajaryan commented Mar 3, 2021

sumo-drosiek commented Mar 5, 2021 •

edited

Loading

djaglowski commented Mar 5, 2021 •

edited

Loading

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 17, 2021

Add support for collecting Kubernetes pod log file in nodes #2266

Add support for collecting Kubernetes pod log file in nodes #2266

Comments

tigrannajaryan commented Feb 1, 2021 • edited Loading

pmm-sumo commented Feb 11, 2021

tigrannajaryan commented Feb 12, 2021

pmm-sumo commented Feb 19, 2021 • edited Loading

tigrannajaryan commented Feb 24, 2021 • edited Loading

pmm-sumo commented Feb 25, 2021 • edited Loading

tigrannajaryan commented Feb 25, 2021

sumo-drosiek commented Feb 25, 2021 • edited Loading

djaglowski commented Feb 25, 2021

tigrannajaryan commented Feb 25, 2021

sumo-drosiek commented Mar 1, 2021

table

summary

description

sources

tigrannajaryan commented Mar 1, 2021

tigrannajaryan commented Mar 1, 2021

djaglowski commented Mar 1, 2021

tigrannajaryan commented Mar 1, 2021

pmm-sumo commented Mar 1, 2021 • edited Loading

tigrannajaryan commented Mar 1, 2021

rockb1017 commented Mar 2, 2021

pmm-sumo commented Mar 3, 2021 • edited Loading

tigrannajaryan commented Mar 3, 2021

sumo-drosiek commented Mar 5, 2021 • edited Loading

djaglowski commented Mar 5, 2021 • edited Loading

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 5, 2021

tigrannajaryan commented Mar 17, 2021

tigrannajaryan commented Feb 1, 2021 •

edited

Loading

pmm-sumo commented Feb 19, 2021 •

edited

Loading

tigrannajaryan commented Feb 24, 2021 •

edited

Loading

pmm-sumo commented Feb 25, 2021 •

edited

Loading

sumo-drosiek commented Feb 25, 2021 •

edited

Loading

pmm-sumo commented Mar 1, 2021 •

edited

Loading

pmm-sumo commented Mar 3, 2021 •

edited

Loading

sumo-drosiek commented Mar 5, 2021 •

edited

Loading

djaglowski commented Mar 5, 2021 •

edited

Loading