Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gossip ring memberlist - cross-connect within namespace #4273

Closed
vkg23 opened this issue Nov 4, 2024 · 7 comments
Closed

gossip ring memberlist - cross-connect within namespace #4273

vkg23 opened this issue Nov 4, 2024 · 7 comments

Comments

@vkg23
Copy link

vkg23 commented Nov 4, 2024

Describe the bug
While it may not be ideal, i have two distributed instances of Tempo running in the same Namespace
Both are separated clearly by its own labels and its own gossip-ring service .
While everything was running fine for almost an year , suddenly i noticed Traces were getting created in both stores.
Even it was found in S3. Metrics were showing cross-tenant traces as well.

Memberlist was expected to working fine with this. However it didn't.

  join_members:
      - dns+mytempo-prod-gossip-ring:7946

Based on some events thrown by the members, A workaround was attempted to include this param

cluster_label: tempo-prod.<mynamespace>
cluster_label_verification_disabled: false

This helped to fix the issue .

Am wondering , why headless service didnt work as expected. Tried to explore the memberlist module , but haven't got a clarity yet.

To Reproduce
Deploy with in namespace , Check the member list via API query .

Expected behavior
Ideally only own headless service associated members only expected

Environment:
k8s. helm chart - tempo-distributed-1.7.3
Image: tempo:2.3.1-amd64

Additional Context

@vkg23
Copy link
Author

vkg23 commented Nov 4, 2024

While exploring memberlist module , came across a comment from @joe-elliott in this thread for LOKI

#2766 (comment) . it says

I believe this happens due to IP reuse in k8s. After a node in memberlist disappears the cluster will still reach out to it for a certain timeout period. If a Tempo cluster has a pod shutdown and a Loki cluster has a pod start up and claim its IP within the timeout then the two memberlist clusters can join together. We generally saw this occurring when rolling out two clusters at the same time.

Seems the case is same here .

Good to know this behaviour .
However , in my case TB's of data went across prod and pre-prod instances until the addition of cluster_label.

@joe-elliott
Copy link
Member

joe-elliott commented Nov 5, 2024

Thanks for reporting this and the solution. I'm considering submitting a PR to the helm charts repo that sets the cluster_label equal to this value:

https://github.com/grafana/helm-charts/blob/main/charts/tempo-distributed/values.yaml#L22

Do you think that would help? As long as people named their Tempo clusters differently it would prevent this and it would definitely prevent it from occurring between Tempo/Mimir or Tempo/Loki

@vkg23
Copy link
Author

vkg23 commented Nov 5, 2024

Thank You.
Why not directly include - "cluster_label" with the helm release name param?
cluster_label: <Releasename>.<namespace>

just like in this case of manual fix
cluster_label: tempo-prod.<mynamespace>.

I may be wrong, but wouldn't this be a simpler approach than Values.fullnameOverride.
Let me know your thoughts , Will be happy to help with the PR

@joe-elliott
Copy link
Member

I may be wrong, but wouldn't this be a simpler approach than Values.fullnameOverride.

I think you have the right idea. I honestly don't know much about the helm chart so I was just guessing. Will put up a PR if you could give it a review 🙏

@joe-elliott
Copy link
Member

joe-elliott commented Nov 6, 2024

Looks like someone already did:

https://github.com/grafana/helm-charts/pull/3058/files
https://github.com/grafana/helm-charts/pull/3059/files

I'm going to clean this person's PRs up and get them mergeable. Obviously we missed this.

EDIT: PRs updated. if you get a chance to review it would be appreciated!

@vkg23
Copy link
Author

vkg23 commented Nov 7, 2024

Thats great :) .
Np with review . Happy to help. Sure, Will attempt a test and confirm back to you shortly

@vkg23
Copy link
Author

vkg23 commented Nov 9, 2024

Tested and working good as expected.
Noticed the PR is already merged , Sharing the test results for anyone's reference.


# kubectl logs tempo-distributor-6455985c76-9ldcx  | grep cluster_label
level=info ts=2024-11-09T13:33:36.533000319Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" \ 
cluster_label=tempo.default node=tempo-distributor-6455985c76-9ldcx-9900f5f6

# kubectl exec nginx -- curl --silent http://tempo-querier:3100/memberlist  -H "Accept: application/json"  | jq '.SortedMembers[].Addr' | sort 
"10.42.0.72"
"10.42.0.73"
"10.42.0.75"
"10.42.1.64"
"10.42.1.65"
"10.42.2.43"
# 
# kubectl get ep tempo-gossip-ring  -o yaml  | grep ip: | sort                   //  ep == ring                                                                                      
  - ip: 10.42.0.72
  - ip: 10.42.0.73
  - ip: 10.42.0.75
  - ip: 10.42.1.64
  - ip: 10.42.1.65
  - ip: 10.42.2.43
# 

// Tempo-2 in same Namespace. 

# kubectl logs tempo-2-compactor-6d8db7d54f-f2cnk   | grep cluster_label
level=info ts=2024-11-09T13:33:52.315735243Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" \
 cluster_label=tempo-2.default node=tempo-2-compactor-6d8db7d54f-f2cnk-8cfa91d9


# kubectl exec nginx -- curl --silent http://tempo-2-querier:3100/memberlist  -H "Accept: application/json"  | jq '.SortedMembers[].Addr' | sort 
"10.42.0.76"
"10.42.0.77"
"10.42.0.78"
"10.42.1.67"
"10.42.1.69"
"10.42.2.44"
# kubectl get ep tempo-2-gossip-ring  -o yaml  | grep ip: | sort                 //  ep == ring                                                                         
  - ip: 10.42.0.76
  - ip: 10.42.0.77
  - ip: 10.42.0.78
  - ip: 10.42.1.67
  - ip: 10.42.1.69
  - ip: 10.42.2.44


# kubectl logs tempo-2-compactor-6d8db7d54f-f2cnk   | grep cluster_label
level=info ts=2024-11-09T13:33:52.315735243Z caller=memberlist_client.go:439 msg="Using memberlist cluster label and node name" cluster_label=tempo-2.default node=tempo-2-compactor-6d8db7d54f-f2cnk-8cfa91d9

@vkg23 vkg23 closed this as completed Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants