-
Notifications
You must be signed in to change notification settings - Fork 522
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tempo Ingesters register to loki ring #2766
Comments
Created the same issue in loki as well: grafana/loki#10172. You can find the loki configs there as well! |
Hi, we have experienced this issue too and work around it by setting different memberlist:
...
cluster_label: <cluster>.<namespace> We set it to the cluster and namespace of the install, although anything will work as long as they are different. Can you try setting this and see if it resolves the issue? It sounds like maybe it needs a default in the helm chart. The root issue (as far as I can remember) is that Tempo and Loki use the same generic ring names like |
@mdisibio Odd that this would happen across namespaces, i assume it has to do with how dns discovery is being done? |
I believe this happens due to IP reuse in k8s. After a node in memberlist disappears the cluster will still reach out to it for a certain timeout period. If a Tempo cluster has a pod shutdown and a Loki cluster has a pod start up and claim its IP within the timeout then the two memberlist clusters can join together. We generally saw this occurring when rolling out two clusters at the same time. |
We have ran into this too, would it make sense to configure the cluster label (each mimir loki and tempo having a different value)for the helm chart by default to avoid people running into this? |
just ran into this myself in prod |
We are running Tempo/Loki/Mimir on spot nodes so there is probably more than average turnover for things and have hit this. I am wondering if that's why but also why it is happening across namespaces. It basically causes everything to freeze up so seems like a pretty important issue. We've rolled out LGTM with the helm charts and I'm wondering if it makes sense to default the cluster labels at least @joe-elliott? Or if this is something better solved in the code? @mdisibio I'm not seeing a For those using the
|
Yeah, I'd be on board with defaulting the cluster label in the helm chart. I don't know much about helm, but if there were some way to put a unique value in
Both Loki and Tempo use the same memberlist config. That |
Thanks @joe-elliott, I'll get some PRs going! |
Love it! Thanks @hobbsh |
This issue has been automatically marked as stale because it has not had any activity in the past 60 days. |
still relevant |
experienced just recently, @hobbsh thanks for solution |
Describe the bug
Tempo Ingesters registerd to lokis ingester ring which caused loki to go down and stop returning logs.
To Reproduce
Steps to reproduce the behavior:
Unsure of how to reproduce this issue as it has never happened in our current deployment before.
Expected behavior
Loki ingesters should register to loki and tempo ingesters should register to tempo.
Environment:
Current deployed is using tempo-distributed helm chart into eks. attached is the tempo.yaml and nginx conf for tempo-gateway
Tempo.yaml:
tempo-gateway nginx:
Additional Context
Only log line that directed us to the issue was:
level=warn ts=2023-08-04T14:32:18.386282517Z caller=logging.go:86 traceID=54e1a62fbdffbc09 orgID=fake msg="POST /loki/api/v1/push (500) 4.35479ms Response: \"rpc error: code = Unimplemented desc = unknown service logproto.Pusher\\n\" ws: false; Connection: close; Content-Length: 177219; Content-Type: application/x-protobuf; User-Agent: promtail/2.6.1; "
The text was updated successfully, but these errors were encountered: