Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable k8s log collection in Otel Helm Chart. #2536

Closed
tigrannajaryan opened this issue Mar 3, 2021 · 35 comments · Fixed by open-telemetry/opentelemetry-helm-charts#36
Closed

Comments

@tigrannajaryan
Copy link
Member

tigrannajaryan commented Mar 3, 2021

Some requirements which I believe are important:

@tigrannajaryan
Copy link
Member Author

@pmm-sumo @sumo-drosiek @rockb1017 please decide who is running this and whether you will want to split this into separate tasks for several people to contribute.

@rockb1017
Copy link
Contributor

I am happy to cooperate but I will only have availability around next week. I don't want to be hindering the progress. Maybe I should take perf test task?

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 4, 2021

No worries. :-) @rockb1017 I think it's fine if you start it and then when @sumo-drosiek comes back from sick leave (hopefully early next week) you could cooperate. We need to enable filelog receiver and add some small improvements there before anyway (I am working on it right now)

@rockb1017
Copy link
Contributor

okay sure then, i will start on it later!

@sumo-drosiek
Copy link
Member

With this commit (fb9660a) we should be fine with scraping k8s container logs

I need to make more research around using journactl in otc to scrape systemd logs

@andrzej-stencel
Copy link
Member

I started working on this with some help from @sumo-drosiek.

Here's my branch in case anybody's interested (it's very much Work In Progress though) https://github.com/astencel-sumo/opentelemetry-helm-charts/commits/add-k8s-logs.

@rockb1017
Copy link
Contributor

oh cool. thanks for sharing your work!

I have a general question. Should logging be in the same daemonset with traces and metrics or should it deploy another daemonset ? Reasons and benefits of splitting them are:

  • logging daemonset need to be run as root user.
  • it must be daemonset, it can't be standalone deployment.
  • any failures in logging pod doesn't hinder collection of the metrics and traces.

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 9, 2021

I have a general question. Should logging be in the same daemonset with traces and metrics or should it deploy another daemonset ? Reasons and benefits of splitting them are:

  • logging daemonset need to be run as root user.
  • it must be daemonset, it can't be standalone deployment.
  • any failures in logging pod doesn't hinder collection of the metrics and traces.

This is a very good question @rockb1017 Maybe @dashpole has some recommendations?

@dashpole
Copy link
Contributor

dashpole commented Mar 9, 2021

I'm not very familiar with the current state of the helm chart, but i'd probably combine them, as long as the collector is being run as a daemonset. If it isn't, then you'd have to separate them.

@pmm-sumo
Copy link
Contributor

pmm-sumo commented Mar 9, 2021

Thank you @dashpole!
My feeling is that maybe we should start with the simplest (which I believe is single daemonset for all 3 signals) and if that's not sufficient split it

@rockb1017
Copy link
Contributor

I will first implement it in combined method. maybe later we can separate them if needed.

I have another question.
Why are we mapping hostPorts by default? https://github.com/open-telemetry/opentelemetry-helm-charts/blob/4e7490331219a5561834b59ba3de7ac9c70bd0bf/charts/opentelemetry-collector/values.yaml#L105
these ports can be accessed via Service IP address from all pods. I think it shouldn't use hostPort.

@rockb1017
Copy link
Contributor

 make docker-otelcontribcol
COMPONENT=otelcontribcol /Applications/Xcode.app/Contents/Developer/usr/bin/make docker-component
GOOS=linux GOARCH=amd64 /Applications/Xcode.app/Contents/Developer/usr/bin/make otelcontribcol
GO111MODULE=on CGO_ENABLED=0 go build -o ./bin/otelcontribcol_linux_amd64 \
                -ldflags "-X github.com/open-telemetry/opentelemetry-collector-contrib/internal/version.GitHash=656611365b89 -X github.com/open-telemetry/opentelemetry-collector-contrib/internal/version.Version=v0.21.0-96-g656611365b89 -X go.opentelemetry.io/collector/internal/version.BuildType=release" ./cmd/otelcontribcol
# go.opentelemetry.io/otel/trace
../../../go/pkg/mod/go.opentelemetry.io/otel/[email protected]/config.go:119:2: duplicate method private
note: module requires Go 1.14
make[2]: *** [otelcontribcol] Error 2
make[1]: *** [docker-component] Error 2
make: *** [docker-otelcontribcol] Error 2

When I try to build a custom image this commit (fb9660a), I get this error. Could someone help?
For now, @astencel-sumo 's image is working.

@rockb1017
Copy link
Contributor

Users need to apply different multiline concat configuration line_start_pattern for each container.
Can I get some help? documentation is not clear. it seems i need to define a new file to tail in order to apply multiline concatenation.
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver
https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/operators/file_input.md

An example would be great. Thanks!

@sumo-drosiek
Copy link
Member

sumo-drosiek commented Mar 10, 2021

When I try to build a custom image this commit (fb9660a), I get this error. Could someone help?

@rockb1017 you can use main branch or 0.22.0docker image: https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.22.0/images/sha256-8f92c311a3b330d74de0628df2600d97d4c890e49fd260341a6c3320e2b89a37?context=explore

@rockb1017
Copy link
Contributor

k8s_metadata_decorator operator is not working?

Error: cannot setup pipelines: cannot build receivers: cannot create receiver filelog: unsupported type 'k8s_metadata_decorator'
2021/03/10 17:09:21 application run finished with error: cannot setup pipelines: cannot build receivers: cannot create receiver filelog: unsupported type 'k8s_metadata_decorator'

I added this operator at receiver.filelog.operators

      # Enrich log with k8s metadata
      - type: k8s_metadata_decorator
        id: k8s-metadata-enrichment
        namespace_field: k8s.namespace.name
        pod_name_field: k8s.pod.name
        cache_ttl: 10m
        timeout: 10s

@rockb1017
Copy link
Contributor

@rockb1017 you can use main branch or 0.22.0docker image: https://hub.docker.com/layers/otel/opentelemetry-collector-contrib/0.22.0/images/sha256-8f92c311a3b330d74de0628df2600d97d4c890e49fd260341a6c3320e2b89a37?context=explore

I used main branch to run make docker-otelcontribcol after adding 3 lines of commit you mentioned.

@sumo-drosiek
Copy link
Member

@rockb1017
I meant, this is already merged in current main. I have not been investigating why it is not working for you

I suppose we should use otelcol processor for metadata enrichment. Thats why I didn't add it to the PR. WDYT?

@rockb1017
Copy link
Contributor

Oh i see, my mistake. Thank you for pointing that out!

@rockb1017
Copy link
Contributor

I suppose we should use otelcol processor for metadata enrichment. Thats why I didn't add it to the PR. WDYT?

You mean this https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sprocessor
but not this ? https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/operators/k8s_metadata_decorator.md

is there any reference of using the otelcol processor for metadata enrichment?

@pmm-sumo
Copy link
Contributor

is there any reference of using the otelcol processor for metadata enrichment?

I think a good start might be looking at the sample config.yaml. More details on the package docs site

One caveat - the processor expects that attributes (such ad pod UID) are present on the Resource level. If they are present at the LogRecord level, they would need to be moved. With that in mind (among other use cases), we recently added groupbyattrsprocessor

@rockb1017
Copy link
Contributor

Users need to apply different multiline concat configuration line_start_pattern for each container.
Can I get some help? documentation is not clear. it seems i need to define a new file to tail in order to apply multiline concatenation.
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver
https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/operators/file_input.md

An example would be great. Thanks!

Can i get some guide on this as well?

@rockb1017
Copy link
Contributor

rockb1017 commented Mar 11, 2021

I suppose we should use otelcol processor for metadata enrichment. Thats why I didn't add it to the PR. WDYT?

@sumo-drosiek @pmm-sumo
I looked at otelcol processor for metadata enrichment. it doesn't seem to have any caching implemented. It would be causing too many API calls to k8s_apiserver pod, and it would end up harming k8s cluster.
Can we just use k8s_metadata_decorator from stanza?
https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/operators/k8s_metadata_decorator.md

@pmm-sumo
Copy link
Contributor

I suppose we should use otelcol processor for metadata enrichment. Thats why I didn't add it to the PR. WDYT?

@sumo-drosiek @pmm-sumo
I looked at otelcol processor for metadata enrichment. it doesn't seem to have any caching implemented. It would be causing too many API calls to k8s_apiserver pod, and it would end up harming k8s cluster.

Each Pod update/create/delete event results in handling an event which actually updates the cache within the k8s processor

Additionally, when ran as a DaemonSet, node filtering can be applied

@pmm-sumo
Copy link
Contributor

Can we just use k8s_metadata_decorator from stanza?

That's a good question. My feeling is we should aim at having one way achieve given goal. The k8sprocessor could also be shared with traces and metrics

@sumo-drosiek
Copy link
Member

sumo-drosiek commented Mar 11, 2021

Users need to apply different multiline concat configuration line_start_pattern for each container.
Can I get some help? documentation is not clear. it seems i need to define a new file to tail in order to apply multiline concatenation.
https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver
https://github.com/open-telemetry/opentelemetry-log-collection/blob/main/docs/operators/file_input.md
An example would be great. Thanks!

Can i get some guide on this as well?

I see one issue with using multiline: logs in container logs are with additional information like date, stream, so for given printed logs:

start multiline example: 10 Tue Feb 16 09:21:15 UTC 2021
end multiline example: 11 Tue Feb 16 09:21:16 UTC 2021

the log file will have following content

2021-02-16T09:21:15.518430714Z stdout F example: 10 Tue Feb 16 09:21:15 UTC 2021
2021-02-16T09:21:16.519721603Z stdout F example: 11 Tue Feb 16 09:21:16 UTC 2021

So, because of that it would be nice to perform multiline detection after parsing log lines. I don't know if this is possible with current capabilities.

I would focus on simple (no multiline, no merging splitting logs) but working solution for now and improve it in separate issues/PRs
cc: @tigrannajaryan @djaglowski

@rockb1017
Copy link
Contributor

@sumo-drosiek
agree that it is best to handle that with separate issues/PR
@pmm-sumo oh so caching IS implemented. okay then I no longer have problem using k8sprocesser Thanks!

@pmm-sumo
Copy link
Contributor

BTW, we have a bummer on groupbyattrsprocessor - it was merged into wrong branch previously and in effect it's not part of the build yet (nor part of v0.22.0)

@rockb1017
Copy link
Contributor

I am implementing k8s enrichment into pipeline.
this is the entire generated configmap.

exporters:
  logging:
    loglevel: debug
    sampling_initial: 5
    sampling_thereafter: 200
  splunk_hec:
    disable_compression: true
    endpoint: https://172.31.18.227:8088/services/collector
    index: k8s_log
    insecure_skip_verify: true
    max_connections: 2000
    source: otel
    sourcetype: otel
    timeout: 10s
    token: XXX
extensions:
  health_check: {}
processors:
  batch: {}
  k8s_tagger:
    auth_type: kubeConfig
    extract:
      annotations:
      - key: splunk.com/index
      labels:
      - key: hello
      metadata:
      - podName
      - podUID
      - deployment
      - cluster
      - namespace
      - node
      - startTime
    filter:
      node_from_env_var: KUBE_NODE_NAME
    passthrough: false
    pod_association:
    - from: resource_attribute
      name: k8s.pod.uid
  memory_limiter:
    ballast_size_mib: 204
    check_interval: 5s
    limit_mib: 409
    spike_limit_mib: 128
receivers:
  filelog:
    exclude:
    - /var/log/pods/default_otel-opentelemetry-collector-agent-*_*/opentelemetry-collector/*.log
    include:
    - /var/log/pods/*/*/*.log
    include_file_name: false
    include_file_path: true
    operators:
    - id: parser-docker
      output: extract_metadata_from_filepath
      timestamp:
        layout: '%Y-%m-%dT%H:%M:%S.%LZ'
        parse_from: time
      type: json_parser
    - id: extract_metadata_from_filepath
      parse_from: $$labels.file_path
      regex: ^\/var\/log\/pods\/(?P<namespace>[^_]+)_(?P<pod_name>[^_]+)_(?P<uid>[^\/]+)\/(?P<container_name>[^\._]+)\/(?P<run_id>\d+)\.log$
      type: regex_parser
    - attributes:
        k8s.container.name: EXPR($.container_name)
        k8s.namespace.name: EXPR($.namespace)
        k8s.pod.name: EXPR($.pod_name)
        k8s.pod.uid: EXPR($.uid)
        run_id: EXPR($.run_id)
        stream: EXPR($.stream)
      resource:
        k8s.pod.uid: EXPR($.uid)
      type: metadata
    - id: clean-up-log-record
      ops:
      - remove: logtag
      - remove: stream
      - remove: container_name
      - remove: namespace
      - remove: pod_name
      - remove: run_id
      - remove: uid
      type: restructure
    start_at: beginning
  jaeger:
    protocols:
      grpc:
        endpoint: 0.0.0.0:14250
      thrift_http:
        endpoint: 0.0.0.0:14268
  otlp:
    protocols:
      grpc: null
      http: null
  prometheus:
    config:
      scrape_configs:
      - job_name: opentelemetry-collector
        scrape_interval: 10s
        static_configs:
        - targets:
          - ${MY_POD_IP}:8888
  zipkin:
    endpoint: 0.0.0.0:9411
service:
  extensions:
  - health_check
  pipelines:
    logs:
      exporters:
      - logging
      - splunk_hec
      processors:
      - batch
      - k8s_tagger
      receivers:
      - filelog
    metrics:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - prometheus
    traces:
      exporters:
      - logging
      processors:
      - memory_limiter
      - batch
      receivers:
      - jaeger
      - zipkin

but my pod fails with this message

* '' has invalid keys: pod_association
2021/03/12 03:07:05 application run finished with error: cannot load configuration: error reading processors configuration for k8s_tagger: 1 error(s) decoding:

can i get some help?

@rockb1017
Copy link
Contributor

rockb1017 commented Mar 12, 2021

When i disable enrichment and just try to send data to splunk, I am getting this error

2021-03-12T03:38:00.883Z	info	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "splunk_hec", "component_name": "splunk_hec", "error": "Post \"https://172.31.18.227:8088/services/collector\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "8.250812912s"}

Same thing when i try /services/collector/event endpoint

2021-03-12T03:45:58.983Z	info	exporterhelper/queued_retry.go:276	Exporting failed. Will retry the request after interval.	{"component_kind": "exporter", "component_type": "splunk_hec", "component_name": "splunk_hec", "error": "Post \"https://172.31.18.227:8088/services/collector/event\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)", "interval": "4.168638234s"}

When i do curl from the same node, it works

curl -k  https://172.31.18.227:8088/services/collector-H "Authorization: Splunk XXX" -d '{"event": "hello world"}'
{"text":"Success","code":0}

curl -k  https://172.31.18.227:8088/services/collector/event -H "Authorization: Splunk XXX" -d '{"event": "hello world"}'
{"text":"Success","code":0}

@rockb1017
Copy link
Contributor

anyways, here is my progress so far. couldn't get it to successfully ingest to splunk.
rockb1017/opentelemetry-helm-charts@30094a5

@andrzej-stencel
Copy link
Member

Folks, here's my update on open-telemetry/opentelemetry-helm-charts#36:

  • Switched to setting agentCollector.containerLogs.enabled to false by default,
  • Switched back to using core collector image in the chart,
  • Added docs on how to enable container log collection,
  • Added docs on the warning of looping logs from exporter back into receiver,
  • Changed file exporter in the example to otlphttp exporter, according to code review remarks.

Please comment on the PR.

I can see @rockb1017 has created a similar PR open-telemetry/opentelemetry-helm-charts#38, trying to achieve the same goal in a very similar way. Probably best to focus on only one of these and try to get the best of both solutions in a single effort, right?

@tigrannajaryan
Copy link
Member Author

I can see @rockb1017 has created a similar PR open-telemetry/opentelemetry-helm-charts#38, trying to achieve the same goal in a very similar way. Probably best to focus on only one of these and try to get the best of both solutions in a single effort, right?

I agree. @rockb1017 if you have additional improvement suggestions please make them in open-telemetry/opentelemetry-helm-charts#36 or alternatively you can create a new PR on top of open-telemetry/opentelemetry-helm-charts#36 (the later is probably preferable to avoid delaying the first PR).

@rockb1017
Copy link
Contributor

oh i see. I will create PR on top of #36.

@rockb1017
Copy link
Contributor

Created a bug report on k8s_tagger
#2719

@andrzej-stencel
Copy link
Member

Folks, here's a document describing the performance tests that @sumo-drosiek and I have run for the Otel Helm chart: https://docs.google.com/document/d/1cEgAt5vBGzFZKooIIdQ3OHh9CEWBnVft_pmYVNlYAh4.

Please comment on the document. I would like to chat about it in the next Log SIG's meeting (is this the correct SIG to talk about it?).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants