Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tempo Ingesters register to lokis ring #10172

Open
jlynch93 opened this issue Aug 4, 2023 · 4 comments
Open

Tempo Ingesters register to lokis ring #10172

jlynch93 opened this issue Aug 4, 2023 · 4 comments
Labels
component/ingester component/ring type/bug Somehing is not working as expected

Comments

@jlynch93
Copy link

jlynch93 commented Aug 4, 2023

Describe the bug
Tempo Ingesters registerd to lokis ingester ring which caused loki to go down and stop returning logs.

To Reproduce
Steps to reproduce the behavior:
Unsure of how to reproduce this issue as it has never happened in our current deployment before.

Expected behavior
Loki ingesters should register to loki and tempo ingesters should register to tempo.

Environment:
Current deployed is using tempo-distributed helm chart into eks. attached is the loki config

auth_enabled: false
chunk_store_config:
  chunk_cache_config:
    embedded_cache:
      enabled: true
      ttl: 24h
common:
  compactor_address: http://loki-loki-distributed-compactor:3100
compactor:
  compaction_interval: 10m
  deletion_mode: filter-and-delete
  retention_delete_delay: 10m
  retention_delete_worker_count: 150
  retention_enabled: true
  shared_store: s3
distributor:
  ring:
    kvstore:
      store: memberlist
frontend:
  compress_responses: true
  log_queries_longer_than: 0
  scheduler_address: loki-loki-distributed-query-scheduler:9095
  tail_proxy_url: http://loki-loki-distributed-querier:3100
frontend_worker:
  grpc_client_config:
    max_recv_msg_size: 1048576000
    max_send_msg_size: 1677721600
  match_max_concurrent: false
  parallelism: 500
  scheduler_address: loki-loki-distributed-query-scheduler:9095
ingester:
  autoforget_unhealthy: true
  chunk_encoding: snappy
  chunk_idle_period: 30m
  chunk_target_size: 262144
  lifecycler:
    ring:
      heartbeat_timeout: 0
      kvstore:
        store: memberlist
      replication_factor: 1
  max_chunk_age: 24h
  max_transfer_retries: 0
  query_store_max_look_back_period: 0
  sync_min_utilization: 0.5
  wal:
    dir: /var/loki/wal
ingester_client:
  pool_config:
    remote_timeout: 10s
  remote_timeout: 60s
limits_config:
  cardinality_limit: 1000000
  ingestion_burst_size_mb: 20000
  ingestion_rate_mb: 1000
  max_cache_freshness_per_query: 10m
  max_concurrent_tail_requests: 200
  max_entries_limit_per_query: 5000000
  max_global_streams_per_user: 0
  max_query_length: 0
  max_query_series: 500000
  max_streams_per_user: 0
  per_stream_rate_limit: 1000MB
  per_stream_rate_limit_burst: 20000MB
  reject_old_samples: false
  reject_old_samples_max_age: 168h
  retention_period: 365d
  split_queries_by_interval: 24h
memberlist:
  join_members:
  - loki-loki-distributed-memberlist.grafana-loki.svc.cluster.local
  randomize_node_name: false
querier:
  engine:
    max_look_back_period: 60m
    timeout: 60m
  max_concurrent: 500000
  query_ingester_only: false
  query_store_only: false
query_range:
  align_queries_with_step: true
  cache_results: true
  max_retries: 5
  results_cache:
    cache:
      default_validity: 24h
      embedded_cache:
        enabled: true
        ttl: 1h
      enable_fifocache: true
      fifocache:
        max_size_bytes: 10GB
        max_size_items: 0
        validity: 24h
query_scheduler:
  max_outstanding_requests_per_tenant: 500
  scheduler_ring:
    heartbeat_period: 0
    heartbeat_timeout: 0
    kvstore:
      store: memberlist
  use_scheduler_ring: false
ruler:
  alertmanager_url: http://prometheus-kube-prometheus-alertmanager.prometheus.svc:9093
  enable_api: true
  external_url: http://prometheus-kube-prometheus-alertmanager.prometheus.svc:9093
  ring:
    heartbeat_period: 0
    heartbeat_timeout: 0
    kvstore:
      store: inmemory
  rule_path: /opt/loki/ruler/scratch
  storage:
    local:
      directory: /opt/loki/ruler/rules
    type: local
schema_config:
  configs:
  - from: "2020-09-07"
    index:
      period: 24h
      prefix: loki_index_
    object_store: aws
    schema: v11
    store: boltdb-shipper
server:
  grpc_server_max_concurrent_streams: 0
  grpc_server_max_recv_msg_size: 419430400
  grpc_server_max_send_msg_size: 419430400
  http_listen_port: 3100
  http_server_idle_timeout: 15m
  http_server_read_timeout: 15m
  http_server_write_timeout: 15m
  log_level: info
storage_config:
  aws:
    backoff_config:
      max_period: 15s
      max_retries: 15
      min_period: 100ms
    bucketnames: HIDDEN
    http_config:
      idle_conn_timeout: 20m
    region: us-east-1
  boltdb_shipper:
    active_index_directory: /var/loki/index
    cache_location: /var/loki/cache
    cache_ttl: 24h
    query_ready_num_days: 1
    resync_interval: 5m
    shared_store: s3
  index_queries_cache_config: null
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s

Loki gateway nginx config

worker_processes  5;  ## Default: 1
error_log  /dev/stderr;
pid        /tmp/nginx.pid;
worker_rlimit_nofile 8192;
events {
  worker_connections  4096;  ## Default: 1024
}
http {
  proxy_read_timeout 90000s;
  proxy_connect_timeout 90000s;
  proxy_send_timeout 90000s;
  fastcgi_read_timeout 90000s;
  client_body_temp_path /tmp/client_temp;
  proxy_temp_path       /tmp/proxy_temp_path;
  fastcgi_temp_path     /tmp/fastcgi_temp;
  uwsgi_temp_path       /tmp/uwsgi_temp;
  scgi_temp_path        /tmp/scgi_temp;
  default_type application/octet-stream;
  log_format   main '$remote_addr - $remote_user [$time_local]  $status '
        '"$request" $body_bytes_sent "$http_referer" '
        '"$http_user_agent" "$http_x_forwarded_for"';
  access_log   /dev/stderr  main;
  sendfile     on;
  tcp_nopush   on;
  resolver kube-dns.kube-system.svc.cluster.local;
  server {
    listen             8080;
    location = / {
      return 200 'OK';
      auth_basic off;
    }
    location = /api/prom/push {
      proxy_pass       http://loki-loki-distributed-distributor.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /api/prom/tail {
      proxy_pass       http://loki-loki-distributed-querier.grafana-loki.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
    # Ruler
    location ~ /prometheus/api/v1/alerts.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /prometheus/api/v1/rules.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/rules.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/alerts.* {
      proxy_pass       http://loki-loki-distributed-ruler.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location ~ /api/prom/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /loki/api/v1/push {
      proxy_pass       http://loki-loki-distributed-distributor.grafana-loki.svc.cluster.local:3100$request_uri;
    }
    location = /loki/api/v1/tail {
      proxy_pass       http://loki-loki-distributed-querier.grafana-loki.svc.cluster.local:3100$request_uri;
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    }
    location ~ /loki/api/.* {
      proxy_pass       http://loki-loki-distributed-query-frontend.grafana-loki.svc.cluster.local:3100$request_uri;
    }
  }
}

Screenshots, Promtail config, or terminal output
Only log line that directed us to the issue was:
level=warn ts=2023-08-04T14:32:18.386282517Z caller=logging.go:86 traceID=54e1a62fbdffbc09 orgID=fake msg="POST /loki/api/v1/push (500) 4.35479ms Response: \"rpc error: code = Unimplemented desc = unknown service logproto.Pusher\\n\" ws: false; Connection: close; Content-Length: 177219; Content-Type: application/x-protobuf; User-Agent: promtail/2.6.1; "

@jlynch93
Copy link
Author

jlynch93 commented Aug 4, 2023

Created the same issue in Tempo as well: grafana/tempo#2766. You can find the Tempo configs there as well!

@JStickler JStickler added component/ingester component/ring type/bug Somehing is not working as expected labels Aug 10, 2023
@pawankkamboj
Copy link

We also faced the same and faced 3 times till now in last 1 years.

@mzupan
Copy link

mzupan commented Oct 15, 2023

I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this

    memberlistConfig:
      cluster_label: loki-dev
      join_members:
        - loki-memberlist.loki-dev.svc.cluster.local:7946

On the mimir side you can do the same thing.. pretty sure you can with tempo also

  memberlist:
        cluster_label: mimir
        join_members:
          - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}


@MikaelElkiaer
Copy link

MikaelElkiaer commented Jan 19, 2024

I've dealt with this randomly. What i found worked was a strong hostname for the join and a prefix like this

    memberlistConfig:
      cluster_label: loki-dev
      join_members:
        - loki-memberlist.loki-dev.svc.cluster.local:7946

On the mimir side you can do the same thing.. pretty sure you can with tempo also

  memberlist:
        cluster_label: mimir
        join_members:
          - dns+{{ include "mimir.fullname" . }}-gossip-ring.{{ .Release.Namespace }}.svc.{{ .Values.global.clusterDomain }}:{{ include "mimir.memberlistBindPort" . }}

Wow, this seems to do the trick, thanks!

But what a mess, how can this gotcha not be clearly documented somewhere?
Before finding this issue comment, I saw #10537 which was not very helpful.
This grafana/mimir#2865 pointed me in the right direction to finding this issue.

Edit: Spoke too soon, the problem persists...

Edit 2: Not sure if fixed or not. At least I have not seen the error for a day now. Seemed that it took a greater part of the weekend to stabilize.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ingester component/ring type/bug Somehing is not working as expected
Projects
None yet
Development

No branches or pull requests

5 participants