-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory peaks on agent v1.19.2 #2452
Comments
When you downgraded to 1.18.1, did you change any other configuration? Would you mind sharing your agent configuration? |
@jpkrohling the only agent parameter used before and after downgrade is Also played for a while after got first OOMkills with parameters, but reverted changes singe they didn't helped much:
|
I'll take a look at list of commits affecting the agent between 1.18.1 and 1.19.2, perhaps I can spot a suspect there. I'll also try to reproduce the problem here. Do you think it would be reproducible with |
@zigmund I'll need your help here. I tried to reproduce your problem but didn't succeed. I have a local minikube cluster with 20Gi of memory available, initialized with: $ minikube start --vm-driver kvm2 --cpus 6 --memory 20480 --container-runtime=crio --addons=ingress Using the Jaeger Operator, I provisioned a Jaeger instance with the following spec: apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
name: agent-as-daemonset
spec:
agent:
image: jaegertracing/jaeger-agent:latest
strategy: DaemonSet
resources:
requests:
memory: 64Mi
limits:
memory: 256Mi
storage:
options:
memory:
max-traces: 100000 And here's a trace generator, creating 1 trace per millisecond: apiVersion: apps/v1
kind: Deployment
metadata:
name: tracegen
annotations:
"sidecar.jaegertracing.io/inject": "false"
spec:
replicas: 1
selector:
matchLabels:
app: tracegen
template:
metadata:
labels:
app: tracegen
spec:
containers:
- name: tracegen
image: jaegertracing/jaeger-tracegen:1.19.2
env:
- name: JAEGER_AGENT_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
args:
- -pause=1ms
- -duration=1h
- -workers=1 The agent pod was up for the whole time, and the metrics shows me that the agent didn't even blink. I changed the Jaeger CR to use 1.18.1, to see if The first part of the graph is for an early test, where I set the tracegen to have 10 workers and no pause. I also forgot to set the pod limits and left for lunch. If I'm reading this data correctly, it shows about 200Mi of memory consumption for about 4M spans per minute. The second part is The third part is the same a second part, only with Here's how I need your help: are you able to reproduce your performance problems based on this setup? If you can't, could you perhaps then run a $ git bisect start
$ git bisect good v1.18.1
$ git bisect bad v1.19.2
$ make docker # this is the easiest, but will build everything, which might take a while
$ podman tag jaegertracing/jaeger-agent:latest quay.io/jpkroehling/jaeger-agent:issue2452-step1
$ podman push quay.io/jpkroehling/jaeger-agent:issue2452-step1 Change the last two commands accordingly, to tag and push to a namespace you own. At this point, configure your agent to use the image you just pushed. Then, test it: $ # test it
$ git bisect good # if the performance is good, or use "bad" if it shows the problem already Rinse and repeat, until git says that it found the problematic commit. If you don't have Jaeger source code locally to run the previous steps, you should probably do this first (see CONTRIBUTING.md): $ git clone https://github.com/jaegertracing/jaeger.git
$ cd jaeger
$ git submodule update --init --recursive
$ make install-tools |
@jpkrohling I'll try to test with git bisect, but it will take some time. |
It's really hard to test |
How fast did you get OOMs for the first two steps? If they are at around 15m, it might be safe to say that a commit is |
@jpkrohling at the first two steps got first ooms in few minutes. |
@jpkrohling completed 6 step with hour+ test period.
|
cc @pavolloffay |
Here is the upgrade to thrift 0.13 PR #2311. The upgrade was done due to security issues in the old version and also it was needed in OpenTelemetry collector IIRC. I am not sure what is causing the regression. |
@zigmund are you able to get us a memory profile for the running agent? The admin port has $ go tool pprof -http localhost:6060 http://your-agent-host:14271/debug/pprof/heap This will get a web UI on http://localhost:6060 for browsing your agent's memory. |
@jpkrohling I'll try but don't think it will help. Agent allocates large amount of memory and releases it next moment, so I have to catch that moment with profiler(?) |
That's true. Ideally, you'd capture it every N minutes, so that we'd get at least one profile when the memory is ramping up. There are a few experimental tools that can help with that, but I'm really not sure about their state, like conprof (cc @brancz). Once we understand better where this memory usage is coming from, we can probably try to reproduce it ourselves, which is the first step to fix the problem. |
Conprof is not well optimized itself but used by various organizations so it's definitely worth a try :) |
@jpkrohling lowered replicas to 1 and collected profile every second via forwarded port with curl:
Raised limit to 2Gb to avoid oom but agent still been killed. :D |
@jpkrohling yes, agent sends data to collector. |
@zigmund it's still quite difficult to tell what's going on. The profile you sent seems to indicate that all memory is being allocated in the I tried to put as much pressure as I could on the agent, but it keeps with a minimal memory usage. I don't think Elasticsearch is part of the equation here, but I used it anyway as the backend, without luck. Even my memory profile is quite different than yours: every time I get a snapshot, it's mostly showing 1MiB here or 512KiB there, which is in line with the inflight data (like the image at the end). Could you also get us a goroutine snapshot? If we have a single call to ReadString, it would probably be apparent there. I would also like to ask you for your exact runtime configuration, like:
|
Do you mean profile from
Kubernetes
Env vars passed via configmap:
Single argument:
golang - https://github.com/jaegertracing/jaeger-client-go |
Yes, please
Would it be possible to have one agent per client type? If we notice that it only happens with spans from one specific client, we might have a lead... |
I'll try to catch memory eating profile with method I used before.
I don't think it's possible since we're have tens of services pointed to same jaeger agent endpoint. I probably could configure nginx and node.js services by myself... but all other services' are out of my control. |
@zigmund, do you have any news for us? I'm really curious on how this develops :-) |
@jpkrohling sorry, I'm now in business trip for few days. Will test as soon as possible. |
@jpkrohling, checked metrics.
Currently unable to test debug image v2, will try to do it later. |
@jpkrohling, I think I found two agent-breaking services, both are php-written. Will try to catch breaking traces, but don't know how at the moment. :D |
That's actually great news. This means that the Agent might not be at fault here at all for common cases. We still need to figure out what type of incoming data is causing the problem, but I believe it does change the criticality of the issue.
I can't help much here, but perhaps you can start by checking which operations are expected to exist vs. which ones actually exist. The missing one(s) might be worth double-checking. |
These two services are pretty large monolith php applications, traces often consist of hundreds spans, so It's near impossible to catch missing. |
Alright, let us know once you have more info then, or if there's anything we can do on our side to help you. |
@zigmund, were you able to find a pattern already? I'm worried that we might have a bigger issue, as bad clients should never be able to bring down the agent... |
@jpkrohling, unfortunately, didn't found any additional info. :( |
Are you still experiencing those problems, or have you disabled tracing for the services that are causing problems? |
@jpkrohling I'm using your extra verbose image for debug and reverting back to 1.18.1 for regular usage. The services' tracing is enabled. |
Got it. We might have newer versions of the models/protobuf soon, which might help here. I'll ping you once we have it ready, in case you are able to try it out before we release it. |
OK, I'm ready to test. |
@zigmund We released 1.20 some days ago. Would you be able to try it out? |
@jpkrohling checked 1.20 - same behavior as 1.19.2 |
Thanks for the confirmation. I'm leaving this open, but unless there's new information, there's not much we can do to reproduce and finally fix this problem. |
Hi @zigmund |
Describe the bug
One group of jaeger agents (3 replicas), deployed on kubernetes. Not so loaded system - 500-700 spans per second.
Used v1.13.1 for year without any issues, upgraded all componenet to 1.19.2 (also replaced Tchannel with gRPC) and agent memory usage become unstable. Before upgrade agent instances used as low as 16 mb with 64 limit, but after upgrade memory usage peaks appeared and agent got oomkilled. I raised limits, but even 512 mb is not enough.
After few hours and tens of oomkills, I downgraded agent to 1.18.1 version and see no issues so far.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Agent memory usage close to lower versions.
Screenshots
Version (please complete the following information):
What troubleshooting steps did you try?
Collected /metrics, will provide if needed.
Additional context
The text was updated successfully, but these errors were encountered: