Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for collecting Kubernetes pod log file in nodes #2266

Closed
tigrannajaryan opened this issue Feb 1, 2021 · 25 comments
Closed

Add support for collecting Kubernetes pod log file in nodes #2266

tigrannajaryan opened this issue Feb 1, 2021 · 25 comments

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Feb 1, 2021

We need to be able to collect logs from Kubernetes. It may be implemented as a separate receiver or as a filelog receiver with a plugin that defines the settings for collecting Kubernetes logs. This needs to be discussed.

What we want to collect:

  1. Required: pod logs that Kubernetes makes available on the nodes.
  2. Need to decide if we need: Kubernertes System component logs.

It may be that 1 and 2 require different approached, in which case this issue may be best split into 2 separate issues and addressed individually.

Depends on #2268

@pmm-sumo
Copy link
Contributor

I have some ideas how to tackle that and some experience with doing it via Fluentbit & co. Would be happy to work on the design and implementation

@tigrannajaryan
Copy link
Member Author

@pmm-sumo great! Please go ahead, once you have the design please submit here and let's review it.

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Feb 19, 2021

Here's draft design doc link and a gist with sample config

@tigrannajaryan
Copy link
Member Author

tigrannajaryan commented Feb 24, 2021

@pmm-sumo thank you for the research and the design. Here is what I think would be good to do so that the k8s support is complete:

  1. Since the design uses the filelog exporter there is no new receiver type to be added. However the config to collect k8s logs is quite complicated and needs to have automated tests on its own. I would want to see an automated test that uses the proposed configuration and collects logs from real files in this format, then verifies that the collected log pdata has the expected structure. It would be best to have testdata that contains golden log files in each of the supported container formats and unit tests that verify the parsing.

  2. Ideally we would want to have automated tests that verify the operation on a real k8s cluster with real pods running on it, producing logs and the actual Otelcontribcol deployed as a daemonset collecting the logs. This is likely not easy to do. Please think about it and see if there is a way to have such automated test with reasonable effort.

  3. I also would like to see performance tests, preferably automated. Now, again performance tests on a real k8s cluster would be great, but it may be difficult to automate. So, perhaps what we can do is do performance tests manually on a real k8s cluster and then add automated performance tests that try to mimic k8s collection from a node but without the real k8s cluster (e.g. implement a file log generator that output log files mimicking the output of several pod logs and then use the Collector to collect them). For automated perf tests we would want to use our testbed.
    If the performance turns out to be bad due to complicated parsing configuration (e.g. it is several time slower than collecting comparable file logs but with simpler parser) then we may need to implement a k8s log receiver with manually coded parser to get better performance.

  4. Assuming that the performance is good we still need to decide how we can hide the complexity of the configuration from the end user. One possibility is the plugins approach that Stanza uses. I would like to discuss this. (@djaglowski I haven't had a chance to properly look into how Stanza implements plugins. Can you please point me to the relevant docs/code that I can read to understand it?)

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Feb 25, 2021

The kudos should go to @sumo-drosiek, who did all the hard work here :)

There are more details in the document, but we've started doing performance tests with simply tailing a local file having content such as (BTW, I would like these to become fully automated):

2021-02-16T09:21:15.518430714Z stdout F example: 0 troublemakers unluxuriantly Lapointe unheroical synthesize
2021-02-16T09:21:16.519721603Z stdout F example: 1 basement's nonbulbar trellis-sheltered thesaurismosis snakeflower
2021-02-16T09:21:17.521307236Z stdout F example: 2 Owasco cornbrash subchannels actinocutitis abjudicate
...

The preliminary results are following:

Processed in total 1018334 logs in 300s: 3387.013557 logs/sec - autodetection (~0.36MB/s)
Processed in total 1337153 logs in 299s: 4460.928373 logs/sec - containerd directly (~0.47MB/s)
Processed in total 1784265 logs in 299s: 5959.308995 logs/sec - without moving attributes (log parser + file path parser)
Processed in total 5604660 logs in 299s: 18694.579597 logs/sec - only log parser

I am wondering if they should be considered OK for this phase or if rather we should first optimize the pipeline. I think #2330 and dedicated receiver could help. What do you think @tigrannajaryan @djaglowski ?

@tigrannajaryan
Copy link
Member Author

@pmm-sumo can you also add CPU usage numbers for each case?

The logs/sec look quite low to me. As a comparison we can do 10,000 log records / second when reading simple filelog and only consuming about 20% of CPU: https://app.circleci.com/pipelines/github/open-telemetry/opentelemetry-collector-contrib/10180/workflows/fcb249dc-bd31-4b20-b8b3-7d31956ccce3/jobs/85000

@sumo-drosiek
Copy link
Member

sumo-drosiek commented Feb 25, 2021

@tigrannajaryan Tests should be rather consider in relative to each other. Rather to show ratio between using different configurations (autodetect k8s format, etc)

I will add some tests to the make e2e-test 👍

@djaglowski
Copy link
Member

@tigrannajaryan To summary stanza's implementation of plugins:

  • The functionality primarily depends on golang's text/template package. Here is where plugins are actually rendered.
  • Plugin files are expected to be .yaml, even though they are technically not yaml until rendered. The .yaml naming convention was chosen over other alternatives (such as .yaml.template) in order to allow IDE's to apply syntax highlighting. This works quite well in most cases.
  • There is one custom function type, default, that was added to the templating schema. This allows for setting of default values, as seen here.
  • Stanza expects that each operator has a unique id. In plugins this is solved by prepending the plugin's own id to the id's of the operators that it contains. (my_regex_parser becomes my_plugin_instance.my_regex_parser)

@tigrannajaryan
Copy link
Member Author

@sumo-drosiek thank you for doing the tests.

Just to make it clear: I want to understand how slow our k8s log reading implementation is compared to the potential maximum we can get if we hand-code the parsing of k8s formats. The upper bound for that maximum is approximately what we get with the simplest filelog receiving test. I believe it is important to know how much worse our k8s solution is doing compared so that we then decide if it makes sense to spend time on improving performance because there is potentially large gains possible.

@sumo-drosiek
Copy link
Member

@tigrannajaryan @djaglowski I performed more tests Today, and here is a result:


table

Test                                    |Result|Duration|CPU Avg%|CPU Max%|RAM Avg MiB|RAM Max MiB|Sent Items|Received Items|Received tems/s|
----------------------------------------|------|-------:|-------:|-------:|----------:|----------:|---------:|-------------:|--------------:|
LogMaxDPS/OTLP                          |PASS  |     15s|   133.5|   134.3|         81|         87|   3624300|       3624300|         241620|
LogMaxDPS/filelog                       |FAIL  |     26s|   173.7|   179.0|         76|         86|   4358100|       3879882|         149226|
LogMaxDPS/k8d_containerd                |FAIL  |     26s|   177.0|   180.0|         78|         89|   4583000|       3386024|         130231|
LogMaxDPS/k8s_auto_detection            |FAIL  |     26s|   157.2|   168.7|         83|         98|   4199000|        930119|          35773|
LogMaxDPS/k8s_auto_detection_routed     |FAIL  |     26s|   152.3|   157.7|         84|         96|   4264800|       1029652|          39602|
LogMaxDPS/k8s_auto_routed_only          |FAIL  |     26s|   165.2|   173.3|         80|         91|   4266900|       2247512|          86442|
LogMaxDPS/k8s_auto_routed_filepath_only |FAIL  |     26s|   157.1|   160.3|         81|         92|   4529400|       1482458|          57017|
Log10kDPS/OTLP                          |PASS  |     16s|    16.5|    18.7|         60|         74|    149900|        149900|           9993|
Log10kDPS/filelog                       |PASS  |     15s|    16.4|    17.7|         61|         75|    150000|        150000|          10000|
Log10kDPS/k8d_containerd                |PASS  |     15s|    19.3|    21.0|         63|         77|    149900|        149900|           9993|
Log10kDPS/k8s_auto_detection            |PASS  |     15s|    44.3|    46.0|         65|         80|    149900|        149900|           9993|
Log10kDPS/k8s_auto_detection_routed     |PASS  |     15s|    37.1|    37.7|         65|         79|    150000|        150000|          10000|
Log10kDPS/k8s_auto_routed_only          |PASS  |     15s|    22.6|    23.3|         62|         76|    150000|        150000|          10000|
Log10kDPS/k8s_auto_routed_filepath_only |PASS  |     15s|    31.0|    32.3|         65|         79|    150000|        150000|          10000|

summary

filelog: 149226 lps
containerd regex: 130231 lps
auto detection (using ifs): 35773 lps
auto detection (using routed) 39602 lps
auto detection (using routed; without parsing filepath; without moving attributes): 86442 lps
auto detection (using routed; with parsing filepath; without moving attributes): 57017 lps


description

auto detection without any additional features decreased performance from 130231 (which is similar to pure filelog test) to 86442 (-33%). In addition, extracting metadata from file path decreased it to 57017 (-34%) and additional operations (moving labels to attributes) decreased it to 39602 (-30% )

It's pretty strange for me that every operator in pipeline decreases performance by ~30%. On the other hand, all of the operations are performed for each log


sources

Configuration used for tests: https://github.com/sumo-drosiek/opentelemetry-collector-contrib/blob/2de61745a2ec906f5557c286bcd74fdf1435f6a7/testbed/datasenders/k8s.go
Tests implementaion: https://github.com/sumo-drosiek/opentelemetry-collector-contrib/blob/drosiek-stanza/testbed/tests_unstable_exe/log_test.go

@tigrannajaryan
Copy link
Member Author

@sumo-drosiek excellent, thank you for doing the tests.

So, to make it clear: the most feature-rich version, that autodetects the format, correctly parses filepath and puts attributes where they belong performs about 3x slower than a simple filelog. This is not great, but it is not terrible. It is something we can work with and improve and optimize.

@tigrannajaryan
Copy link
Member Author

It's pretty strange for me that every operator in pipeline decreases performance by ~30%. On the other hand, all of the operations are performed for each log

@djaglowski is this expected or there is a better way to achieve this that has less performance impact?

@djaglowski
Copy link
Member

@sumo-drosiek Thanks for the detailed findings. This will be very helpful in optimizing. I'm confident we can improve it quite a bit.

@tigrannajaryan I'm a little surprised, but I'm hopeful that much of this can be improved substantially and rapidly in the codebase. I don't think 30% is going to end up as a typical cost for most operators.

Unfortunately, I'm not aware of any immediate config changes that would yield improvements.


Some details on how we might optimize the operators to improve this use case:

Autodetection, routing, if condition evaluation, and filename parsing all boil down to regex evaluation. I don't think a lot of time has been spent optimizing this. There may be some low hanging fruit there that would improve all of these in one go.

Additionally, in all the cases above, we could almost eliminate the need to apply the regex by supporting a caching mechanism. Each filename should only need to be parsed once. Similarly, autodection, routing, and if condition evaluation are only needed once per file to determine if the file is CRI/Docker. Some more thought is needed on config design, but likely we can add a cacheable: true field to regex_parser. If nothing else we could build a dedicated k8s_file_input operator that includes such a caching mechanism. Either way, we should be able to eliminate much of the loss here.

The cost to move labels is surprising to me. I'm guessing there is an inefficient allocation that takes place on each entry. I suspect this can be improved quite a bit.


In cases where each entry needs to be parsed by regex, we might expect a notable cost per operator, but I suspect we can improve from 30%.

@tigrannajaryan
Copy link
Member Author

OK, to summarize: we have a solution for k8s pod log that works and is not too bad performance-wise.

@sumo-drosiek will you be able to submit a PR with your k8s perf tests to the repo and add a k8s collection config to the examples directory? I think we should use the most complete autodetect config even if it is slowest. It would be nice to include perf numbers in the README of the example.

Once we do that we can consider that the basic requirements of this issue are met and further improvements can be done as needed by filing separate issues.

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 1, 2021

I was thinking about caching the regex too (which perhaps can be invalidated e.g. when it doesn't match the line), the Dan's idea makes a lot of sense to me, though I wasn't sure if it can be made without a dedicated operator (which would also hide the complexity of the config)

In either case - after submitting the PR, do you think it's the time to move on to Helm chart support @tigrannajaryan ?

@tigrannajaryan
Copy link
Member Author

In either case - after submitting the PR, do you think it's the time to move on to Helm chart support @tigrannajaryan ?

Yes, I think so. Great progress! :-)

@rockb1017
Copy link
Contributor

@pmm-sumo Hello, Will you be working on logging helm chart support right away? I will have some bandwidth around next week. Let me know if you would like me to take on this task.

@tigrannajaryan tigrannajaryan added this to the Basic Logs Support milestone Mar 3, 2021
@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 3, 2021

@rockb1017 @tigrannajaryan we've been preparing to tackle #2536 with @sumo-drosiek (and we have some experience with a fairly large Helm chart we help with development) Will be more than happy to cooperate too, or do the review and focus on something else :)

@tigrannajaryan
Copy link
Member Author

Notes from the SIG meeting:

  1. Let's make sure we have a reliable way to exclude our own logs.
  2. It would be great to put the log record in the Body as a string, instead of putting it inside a map with "log" property.

@sumo-drosiek
Copy link
Member

sumo-drosiek commented Mar 5, 2021

PR with performance test (the most complex by now) is ready (#2564). Should I add rest of the tests to the testbed or we are fine with the most complex?

I am going to add README.md, example and more tests eventually in another PR, as I want to unblock helm chart development ASAP

@djaglowski
Copy link
Member

djaglowski commented Mar 5, 2021

I would support adding the other scenarios as well. Since we understand there to be significant performance differences between the scenarios, it will be quite helpful for validating potential improvements. (e.g. this issue)

Perhaps later we would reduce the number of tests once significant performance gaps between scenarios are closed.

@tigrannajaryan
Copy link
Member Author

There is one more thing that I listed above that I think is very important to have:

the config to collect k8s logs is quite complicated and needs to have automated tests on its own. I would want to see an automated test that uses the proposed configuration and collects logs from real files in this format, then verifies that the collected log pdata has the expected structure. It would be best to have testdata that contains golden log files in each of the supported container formats and unit tests that verify the parsing.

This will be a correctness test, not a performance test.

Ideally all our k8s tests should use the same configuration (the correctness test and the performance test).

@tigrannajaryan
Copy link
Member Author

PR with performance test (the most complex by now) is ready (#2564). Should I add rest of the tests to the testbed or we are fine with the most complex?

I think the worst case perf test is sufficient for now, no need to add other cases.

I am going to add README.md, example and more tests eventually in another PR, as I want to unblock helm chart development ASAP

Great, please do!

@tigrannajaryan
Copy link
Member Author

Oh, I see @djaglowski wants to see perf tests for other cases as well. I don't mind if you want to do it.

tigrannajaryan pushed a commit that referenced this issue Mar 9, 2021
…iver (#2564)

Add performance test for filelog based kubernetes container logs receiver

This PR contains one performance test (for kubernetes containerd logs with format autodetection)

**Link to tracking Issue:**
#2266
tigrannajaryan pushed a commit that referenced this issue Mar 16, 2021
**Description:** 
This PR adds example configuration to handle kubernetes container logs using filelog receiver
This depends on the #2564

**Link to tracking Issue:**
#2266 

**Testing:**
N/A
tests which were performed to get results are going to be issued in another PR

**Documentation:**
README.md
@tigrannajaryan
Copy link
Member Author

I believe this is now done. Any additional requests can be filed as new issues. Closing this.

pmatyjasek-sumo referenced this issue in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
…iver (#2564)

Add performance test for filelog based kubernetes container logs receiver

This PR contains one performance test (for kubernetes containerd logs with format autodetection)

**Link to tracking Issue:**
#2266
pmatyjasek-sumo referenced this issue in pmatyjasek-sumo/opentelemetry-collector-contrib Apr 28, 2021
**Description:** 
This PR adds example configuration to handle kubernetes container logs using filelog receiver
This depends on the #2564

**Link to tracking Issue:**
#2266 

**Testing:**
N/A
tests which were performed to get results are going to be issued in another PR

**Documentation:**
README.md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants