Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory peaks on agent v1.19.2 #2452

Closed
zigmund opened this issue Sep 3, 2020 · 44 comments · Fixed by #2780
Closed

Memory peaks on agent v1.19.2 #2452

zigmund opened this issue Sep 3, 2020 · 44 comments · Fixed by #2780
Labels
bug needs-info The ticket does not contain sufficient information performance

Comments

@zigmund
Copy link

zigmund commented Sep 3, 2020

Describe the bug
One group of jaeger agents (3 replicas), deployed on kubernetes. Not so loaded system - 500-700 spans per second.
Used v1.13.1 for year without any issues, upgraded all componenet to 1.19.2 (also replaced Tchannel with gRPC) and agent memory usage become unstable. Before upgrade agent instances used as low as 16 mb with 64 limit, but after upgrade memory usage peaks appeared and agent got oomkilled. I raised limits, but even 512 mb is not enough.

After few hours and tens of oomkills, I downgraded agent to 1.18.1 version and see no issues so far.

To Reproduce
Steps to reproduce the behavior:

  1. Deploy agent v1.19.2
  2. Send 500-700 spans per second
  3. ???
  4. Observe memory usage peaks.

Expected behavior
Agent memory usage close to lower versions.

Screenshots
image

Version (please complete the following information):

  • OS: Kubernetes nodes based on Ubuntu 20.04 LTS, 5.4.0-39 kernel
  • Jaeger version: 1.19.2
  • Deployment: Kubernetes

What troubleshooting steps did you try?
Collected /metrics, will provide if needed.

Additional context

@zigmund zigmund added the bug label Sep 3, 2020
@ghost ghost added the needs-triage label Sep 3, 2020
@jpkrohling jpkrohling added needs-info The ticket does not contain sufficient information performance and removed needs-triage labels Sep 3, 2020
@jpkrohling
Copy link
Contributor

When you downgraded to 1.18.1, did you change any other configuration? Would you mind sharing your agent configuration?

@zigmund
Copy link
Author

zigmund commented Sep 3, 2020

@jpkrohling the only agent parameter used before and after downgrade is --reporter.grpc.host-port=jaeger-collector:14250.

Also played for a while after got first OOMkills with parameters, but reverted changes singe they didn't helped much:

        - --processor.jaeger-binary.server-queue-size=...
        - --processor.jaeger-binary.workers=...
        - --processor.jaeger-compact.server-queue-size=...
        - --processor.jaeger-compact.workers=...
        - --processor.zipkin-compact.server-queue-size=...
        - --processor.zipkin-compact.workers=...

@jpkrohling
Copy link
Contributor

I'll take a look at list of commits affecting the agent between 1.18.1 and 1.19.2, perhaps I can spot a suspect there. I'll also try to reproduce the problem here. Do you think it would be reproducible with minikube and a simple agent with 64Mi of requests , 256Mi limit? Jaeger's tracegen would probably generate enough load ;-)

@jpkrohling jpkrohling removed the needs-info The ticket does not contain sufficient information label Sep 3, 2020
@jpkrohling
Copy link
Contributor

jpkrohling commented Sep 3, 2020

@zigmund I'll need your help here. I tried to reproduce your problem but didn't succeed. I have a local minikube cluster with 20Gi of memory available, initialized with:

$ minikube start --vm-driver kvm2 --cpus 6 --memory 20480 --container-runtime=crio --addons=ingress

Using the Jaeger Operator, I provisioned a Jaeger instance with the following spec:

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: agent-as-daemonset
spec:
  agent:
    image: jaegertracing/jaeger-agent:latest
    strategy: DaemonSet
    resources:
      requests:
        memory: 64Mi
      limits:
        memory: 256Mi
  storage:
    options:
      memory:
        max-traces: 100000

And here's a trace generator, creating 1 trace per millisecond:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: tracegen
  annotations:
    "sidecar.jaegertracing.io/inject": "false"
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tracegen
  template:
    metadata:
      labels:
        app: tracegen
    spec:
      containers:
      - name: tracegen
        image: jaegertracing/jaeger-tracegen:1.19.2
        env:
        - name: JAEGER_AGENT_HOST
          valueFrom:
            fieldRef:
              fieldPath: status.hostIP
        args:
        - -pause=1ms
        - -duration=1h
        - -workers=1

The agent pod was up for the whole time, and the metrics shows me that the agent didn't even blink. I changed the Jaeger CR to use 1.18.1, to see if latest was in the same ballpark, and I couldn't spot any differences:

Screenshot_2020-09-03 Prometheus Time Series Collection and Processing Server

The first part of the graph is for an early test, where I set the tracegen to have 10 workers and no pause. I also forgot to set the pod limits and left for lunch. If I'm reading this data correctly, it shows about 200Mi of memory consumption for about 4M spans per minute.

The second part is jaegertracing/jaeger-agent:latest, with a 64Mi of requests and 256Mi of limit. The tracegen is with a pause of 1ms and 1 worker.

The third part is the same a second part, only with jaegertracing/jaeger-agent:1.18.1 as the image.

Here's how I need your help: are you able to reproduce your performance problems based on this setup? If you can't, could you perhaps then run a git bisect with the first good commit being v1.18.1 and the bad as being v1.19.2? Like:

$ git bisect start 
$ git bisect good v1.18.1
$ git bisect bad v1.19.2
$ make docker # this is the easiest, but will build everything, which might take a while
$ podman tag jaegertracing/jaeger-agent:latest quay.io/jpkroehling/jaeger-agent:issue2452-step1
$ podman push quay.io/jpkroehling/jaeger-agent:issue2452-step1

Change the last two commands accordingly, to tag and push to a namespace you own. At this point, configure your agent to use the image you just pushed.

Then, test it:

$ # test it
$ git bisect good # if the performance is good, or use "bad" if it shows the problem already

Rinse and repeat, until git says that it found the problematic commit.

If you don't have Jaeger source code locally to run the previous steps, you should probably do this first (see CONTRIBUTING.md):

$ git clone https://github.com/jaegertracing/jaeger.git
$ cd jaeger
$ git submodule update --init --recursive
$ make install-tools

@jpkrohling jpkrohling added the needs-info The ticket does not contain sufficient information label Sep 3, 2020
@zigmund
Copy link
Author

zigmund commented Sep 4, 2020

@jpkrohling I'll try to test with git bisect, but it will take some time.

@zigmund
Copy link
Author

zigmund commented Sep 4, 2020

It's really hard to test good steps since it could take from minutes to much more time to catch first oom. But at least already reproduced oom's at the first two steps. At the moment I'm testing the third step for an hour already and I don't know if it is enough to be certain that the step is good.

@jpkrohling
Copy link
Contributor

How fast did you get OOMs for the first two steps? If they are at around 15m, it might be safe to say that a commit is good if it's not crashing after 30m.

@zigmund
Copy link
Author

zigmund commented Sep 4, 2020

@jpkrohling at the first two steps got first ooms in few minutes.
Got compile error at 4 and 5 steps, testing 6 step at the moment. It seems that be17169 affected memory usage behavior in out environment.

@zigmund
Copy link
Author

zigmund commented Sep 4, 2020

@jpkrohling completed 6 step with hour+ test period.

be17169a417954a347408708182519e267618500 is the first bad commit
commit be17169a417954a347408708182519e267618500
Author: Pavol Loffay <[email protected]>
Date:   Fri Jun 26 16:11:45 2020 +0200

    Upgrade thrift to 0.13 (#2311)
    
    * Upgrade thrift to 0.13
    
    Signed-off-by: Pavol Loffay <[email protected]>
    
    * Add context to sampling managers
    
    Signed-off-by: Pavol Loffay <[email protected]>

@jpkrohling
Copy link
Contributor

cc @pavolloffay

@pavolloffay
Copy link
Member

Here is the upgrade to thrift 0.13 PR #2311. The upgrade was done due to security issues in the old version and also it was needed in OpenTelemetry collector IIRC.

I am not sure what is causing the regression.

@jpkrohling
Copy link
Contributor

@zigmund are you able to get us a memory profile for the running agent? The admin port has pprof endpoints, which you can browse like this:

$ go tool pprof -http localhost:6060 http://your-agent-host:14271/debug/pprof/heap

This will get a web UI on http://localhost:6060 for browsing your agent's memory.

@zigmund
Copy link
Author

zigmund commented Sep 7, 2020

@jpkrohling I'll try but don't think it will help. Agent allocates large amount of memory and releases it next moment, so I have to catch that moment with profiler(?)

@jpkrohling
Copy link
Contributor

jpkrohling commented Sep 8, 2020

That's true. Ideally, you'd capture it every N minutes, so that we'd get at least one profile when the memory is ramping up. There are a few experimental tools that can help with that, but I'm really not sure about their state, like conprof (cc @brancz).

Once we understand better where this memory usage is coming from, we can probably try to reproduce it ourselves, which is the first step to fix the problem.

@brancz
Copy link

brancz commented Sep 8, 2020

Conprof is not well optimized itself but used by various organizations so it's definitely worth a try :)

@zigmund
Copy link
Author

zigmund commented Sep 9, 2020

@jpkrohling lowered replicas to 1 and collected profile every second via forwarded port with curl:

while true; do curl http://localhost:14271/debug/pprof/heap -o `date +%s`.pb.gz; sleep 1; done

Raised limit to 2Gb to avoid oom but agent still been killed. :D
But at least I got few identical profiles right before oom. Here is one of them

@jpkrohling
Copy link
Contributor

Looks interesting

image

I'll try to reproduce it here again. Just to confirm: data is flowing from the agent to the collector, right?

@zigmund
Copy link
Author

zigmund commented Sep 9, 2020

@jpkrohling yes, agent sends data to collector.

@jpkrohling
Copy link
Contributor

@zigmund it's still quite difficult to tell what's going on. The profile you sent seems to indicate that all memory is being allocated in the ReadString from Thrift's Compact protocol, which seems to indicate that you have a span bigger than 930MiB which I doubt is the case, especially because a batch has a max size equal to the limits of a UDP packet, typically 64KiB. Either that, or quite a bunch of spans are locked in that function, which also seems unlikely.

I tried to put as much pressure as I could on the agent, but it keeps with a minimal memory usage. I don't think Elasticsearch is part of the equation here, but I used it anyway as the backend, without luck. Even my memory profile is quite different than yours: every time I get a snapshot, it's mostly showing 1MiB here or 512KiB there, which is in line with the inflight data (like the image at the end).

Could you also get us a goroutine snapshot? If we have a single call to ReadString, it would probably be apparent there.

I would also like to ask you for your exact runtime configuration, like:

  1. where are you deploying it? Bare metal? VM? Kubernetes?
  2. the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)
  3. same for the Agent
  4. which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

image

@zigmund
Copy link
Author

zigmund commented Sep 9, 2020

@jpkrohling

Could you also get us a goroutine snapshot?

Do you mean profile from /debug/pprof/goroutine?debug=2?

where are you deploying it? Bare metal? VM? Kubernetes?

Kubernetes

the exact command used to start your Collector (or Deployment, in case of Kubernetes, or the Jaeger CR in case you provisioned it using the Jaeger Operator)

Env vars passed via configmap:

SPAN_STORAGE_TYPE=elasticsearch
ES_SERVER_URLS=http://es-node1:9200,http://es-node2:9200,http://es-node3:9200
ES_NUM_SHARDS=3
ES_NUM_REPLICAS=0
ES_MAX_SPAN_AGE=24h

same for the Agent

Single argument: --reporter.grpc.host-port=jaeger-collector:14250

which client are you using? Jaeger Go Client? OpenTelemetry? Something else?

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

@jpkrohling
Copy link
Contributor

Do you mean profile from /debug/pprof/goroutine?debug=2?

Yes, please

golang - https://github.com/jaegertracing/jaeger-client-go
node.js - https://github.com/jaegertracing/jaeger-client-node
php - https://github.com/jukylin/jaeger-php
nginx - https://github.com/opentracing-contrib/nginx-opentracing

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

@zigmund
Copy link
Author

zigmund commented Sep 9, 2020

Yes, please

I'll try to catch memory eating profile with method I used before.

Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead...

I don't think it's possible since we're have tens of services pointed to same jaeger agent endpoint. I probably could configure nginx and node.js services by myself... but all other services' are out of my control.

@jpkrohling jpkrohling added this to the Release 1.20.0 milestone Sep 10, 2020
@jpkrohling
Copy link
Contributor

@zigmund, do you have any news for us? I'm really curious on how this develops :-)

@zigmund
Copy link
Author

zigmund commented Sep 16, 2020

@jpkrohling sorry, I'm now in business trip for few days. Will test as soon as possible.

@zigmund
Copy link
Author

zigmund commented Sep 17, 2020

@jpkrohling, checked metrics.

sum(rate(jaeger_agent_thrift_udp_t_processor_handler_errors_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0
{model="jaeger",protocol="compact"} | 0.06667950988310203
{model="zipkin",protocol="compact"} | 0

sum(rate(jaeger_agent_thrift_udp_server_packets_processed_total[1m])) by (model, protocol):

{model="jaeger",protocol="binary"} | 0.35553778133262237
{model="jaeger",protocol="compact"} | 73.39500099980003
{model="zipkin",protocol="compact"} | 0

Currently unable to test debug image v2, will try to do it later.

@zigmund
Copy link
Author

zigmund commented Sep 17, 2020

@jpkrohling, I think I found two agent-breaking services, both are php-written. Will try to catch breaking traces, but don't know how at the moment. :D

@jpkrohling
Copy link
Contributor

I think I found two agent-breaking services, both are php-written

That's actually great news. This means that the Agent might not be at fault here at all for common cases. We still need to figure out what type of incoming data is causing the problem, but I believe it does change the criticality of the issue.

Will try to catch breaking traces, but don't know how at the moment.

I can't help much here, but perhaps you can start by checking which operations are expected to exist vs. which ones actually exist. The missing one(s) might be worth double-checking.

@zigmund
Copy link
Author

zigmund commented Sep 21, 2020

start by checking which operations are expected to exist vs. which ones actually exist.

These two services are pretty large monolith php applications, traces often consist of hundreds spans, so It's near impossible to catch missing.

@jpkrohling
Copy link
Contributor

Alright, let us know once you have more info then, or if there's anything we can do on our side to help you.

@jpkrohling
Copy link
Contributor

jpkrohling commented Sep 24, 2020

@zigmund, were you able to find a pattern already? I'm worried that we might have a bigger issue, as bad clients should never be able to bring down the agent...

@zigmund
Copy link
Author

zigmund commented Sep 24, 2020

@jpkrohling, unfortunately, didn't found any additional info. :(

@jpkrohling
Copy link
Contributor

Are you still experiencing those problems, or have you disabled tracing for the services that are causing problems?

@zigmund
Copy link
Author

zigmund commented Sep 24, 2020

@jpkrohling I'm using your extra verbose image for debug and reverting back to 1.18.1 for regular usage. The services' tracing is enabled.

@jpkrohling
Copy link
Contributor

Got it. We might have newer versions of the models/protobuf soon, which might help here. I'll ping you once we have it ready, in case you are able to try it out before we release it.

@zigmund
Copy link
Author

zigmund commented Sep 24, 2020

OK, I'm ready to test.

@jpkrohling
Copy link
Contributor

@zigmund We released 1.20 some days ago. Would you be able to try it out?

@zigmund
Copy link
Author

zigmund commented Oct 12, 2020

@jpkrohling checked 1.20 - same behavior as 1.19.2

@jpkrohling
Copy link
Contributor

Thanks for the confirmation. I'm leaving this open, but unless there's new information, there's not much we can do to reproduce and finally fix this problem.

@quinluong
Copy link

Hi @zigmund
Thank you so much because of this issue. I'm facing to the same problem. I'm also using jukylin/jaeger-php for my project. After a long day for debugging, I figure out that the reason is about my data. It isn't all data, just only malformed data - where percentage is very small. So it is really difficult to debug if you are on production environment.
You can read this comment #2638 (comment). Thank you so much @jpkrohling
Hope you can resolve this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug needs-info The ticket does not contain sufficient information performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants