Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425

diranged · 2021-03-03T15:23:38Z

Slack Thread: https://grafana.slack.com/archives/CEPJRLQNL/p1614732770319400

Describe the bug
We're running Promtail 2.1.0 into a Loki Distributed setup. The Promtail configuration is pulling pod-logs only, and the underlying K8S hosts are running AWS BottleRocket v1.0.5 (which uses containerd for its CRI). We occationally see little spurts of "entry out of order" messages from our various Promtail pods in the cluster, they tend to go away... but we understand that we have likely dropped some logs when those happen.

Yesterday a single promtail pod became unable to send most of the log entries from a pod into Loki. Virtually every Post failed with an "entry out of order" message. This went on for ~45m before we killed the promtail pod and let it get recreated. Once it was recreated, it came up cleanly and worked smoothly.

Is it possible that some internal state in Promtail got messed up with regards to the timestamps? Is something wrong with our promtail.yaml file?

The odd triggering situation
The clusters these pods are running on are all running on Spot instances, so the instances rotate out regularly. In this case, we happened to lose the instance that held the loki-gateway pod itself, which of course then broke data ingestion while the pod was coming up on a new host. All of our other promtail pods were up and running already... so they simply stopped sending data until the gateway was back.

The promtail pod that failed was the one that came up on the newly booted instance that was replacing the original instance. When promtail comes up, you can see it first discovers the two log streams (jaeger and nginx in this case), but then fails to send data to the gateway service because its not quite up yet. After a few seconds the gateway comes up, and all of the other log streams start flowing.

Note about the timestamps
I initially thought that we just had a batch of early data that would eventually work its way out of the pipeline through failed retries/backoffs.. however, we noticed that even 45 minutes into the problem that the "newly failed" timestamps were continuing to update and be current timestamps. So we weren't failing some early batch of data, but were actually failing data continually. This is what lead us to restarting the promtail pod entirely.

To Reproduce
Steps to reproduce the behavior:

Running an environment with Promtail on all of your nodes as DaemonSets, pushing into a Loki Gateway (nginx) and ultimately pushing into a Loki Distributed setup. Run only 1 loki-gateway pod
Find the Node that happens to be running the loki-gateway pod.
Drain that host so that the pods have to be evacuated. Ideally this will cause a scale-up event in your cluster, forcing the new loki-gateway pod to be brought up on a new host (which has also a new promtail pod)
Watch the logs

Environment:
Loki: 2.1.0
Promtail: 2.1.0
Kubernetes: EKS 1.19
Host OS: AWS BottleRocket 1.0.5
Deployment Tool: ArgoCD/Helm

Screenshots, Promtail config, or terminal output
promtail log from offending pod

level=info ts=2021-03-03T00:27:06.732330078Z caller=filetargetmanager.go:261 msg="Adding target" key="{app=\"jaeger\", component=\"collector\", container=\"observability-jaeger-collector\", job=\"observability/jaeger\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-jaeger-collector-859bc4f65f-9j9dg\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\"}"
level=info ts=2021-03-03T00:27:06.73273141Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log
ts=2021-03-03T00:27:06.732871441Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log - &{Offset:0 Whence:0}"
level=info ts=2021-03-03T00:27:11.731854822Z caller=filetargetmanager.go:261 msg="Adding target" key="{app=\"loki\", component=\"gateway\", container=\"nginx\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\"}"
level=info ts=2021-03-03T00:27:11.733914065Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log
ts=2021-03-03T00:27:11.734018269Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log - &{Offset:0 Whence:0}"
ts=2021-03-03T00:27:11.954812939Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/1.log - &{Offset:0 Whence:0}"
level=info ts=2021-03-03T00:27:11.954920995Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/1.log
ts=2021-03-03T00:27:12.976213886Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/1.log - &{Offset:0 Whence:0}"
level=info ts=2021-03-03T00:27:12.976243701Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/1.log
level=warn ts=2021-03-03T00:27:13.926909955Z caller=client.go:288 component=client host=observability-loki-gateway msg="error sending batch, will retry" status=-1 error="Post \"https://observability-loki-gateway/loki/api/v1/push\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
level=info ts=2021-03-03T00:27:21.752458971Z caller=tailer.go:95 component=tailer msg="position timer: exited" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log
level=warn ts=2021-03-03T00:27:25.995363284Z caller=client.go:288 component=client host=observability-loki-gateway msg="error sending batch, will retry" status=-1 error="Post \"https://observability-loki-gateway/loki/api/v1/push\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
level=info ts=2021-03-03T00:27:26.733105234Z caller=tailer.go:95 component=tailer msg="position timer: exited" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log
level=warn ts=2021-03-03T00:27:43.963001606Z caller=client.go:288 component=client host=observability-loki-gateway msg="error sending batch, will retry" status=-1 error="Post \"https://observability-loki-gateway/loki/api/v1/push\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"
level=info ts=2021-03-03T00:27:58.275156987Z caller=tailer.go:138 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log
level=info ts=2021-03-03T00:27:58.275213765Z caller=tailer.go:131 component=tailer msg="tail routine: exited" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log
level=info ts=2021-03-03T00:27:58.275227355Z caller=tailer.go:204 component=tailer msg="stopped tailing file" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/0.log
ts=2021-03-03T00:27:58.277901916Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log - &{Offset:0 Whence:0}"
level=info ts=2021-03-03T00:27:58.277952712Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log
ts=2021-03-03T00:27:58.27828266Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log - &{Offset:0 Whence:0}"
level=info ts=2021-03-03T00:27:58.278311551Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log
level=info ts=2021-03-03T00:27:58.281230675Z caller=tailer.go:138 component=tailer msg="tail routine: tail channel closed, stopping tailer" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log
level=info ts=2021-03-03T00:27:58.281321974Z caller=tailer.go:131 component=tailer msg="tail routine: exited" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log
level=info ts=2021-03-03T00:27:58.281411676Z caller=tailer.go:204 component=tailer msg="stopped tailing file" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/0.log
level=info ts=2021-03-03T00:27:58.283527704Z caller=tailer.go:122 component=tailer msg="tail routine: started" path=/var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/2.log
ts=2021-03-03T00:27:58.283689207Z caller=log.go:124 component=tailer level=info msg="Seeked /var/log/pods/observability_observability-jaeger-collector-859bc4f65f-9j9dg_2c89c773-56c1-4408-a7e8-1a189a048daa/observability-jaeger-collector/2.log - &{Offset:0 Whence:0}"
level=error ts=2021-03-03T00:27:58.331236337Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:49.655718834 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"
level=error ts=2021-03-03T00:27:58.637227541Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:53.879768585 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"
level=error ts=2021-03-03T00:27:58.933630112Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:58.562883558 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"
level=error ts=2021-03-03T00:27:59.230893343Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:58.93359519 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"
level=error ts=2021-03-03T00:27:59.635567569Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:59.176377794 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"
level=error ts=2021-03-03T00:27:59.834747522Z caller=client.go:294 component=client host=observability-loki-gateway msg="final error sending batch" status=400 error="server returned HTTP status 400 Bad Request (400): entry with timestamp 2021-03-03 00:27:59.487681015 +0000 UTC ignored, reason: 'entry out of order' for stream: {app=\"loki\", component=\"gateway\", container=\"nginx\", filename=\"/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log\", job=\"observability/loki\", namespace=\"observability\", node_name=\"ip-xx-xx-xx.us-west-2.compute.internal\", pod=\"observability-loki-gateway-7cf8b5dbb-5rfxz\", scrape_job=\"kubernetes-pods-app-kubernetes-io-name\", scrape_namespace=\"observability\", scrape_pod=\"observability-promtail-kqrxl\", stream=\"stderr\"},"

promtail.yaml

server:
  log_level: info
  http_listen_port: 3101
client:
  url: https://observability-loki-gateway/loki/api/v1/push
  tls_config:
    ca_file: /tls/ca.crt
    cert_file: /tls/tls.crt
    key_file: /tls/tls.key
  batchsize: 102400  # default
  batchwait: 1s  # default
  backoff_config:
    # Initial backoff time between retries
    min_period: 2s  # default 500
    # Maximum backoff time between retries
    max_period: 1m  # default 5m
    # Maximum number of retries to do
    max_retries: 100 # default 10
positions:
  filename: /run/promtail/positions.yaml
scrape_configs:
  # Standard Scrape Configs
  # See also https://github.com/grafana/loki/blob/master/production/ksonnet/promtail/scrape_config.libsonnet for reference
  # Pods with a label 'app.kubernetes.io/name'
  - job_name: kubernetes-pods-app-kubernetes-io-name
    pipeline_stages:
      - cri: {}
      - docker: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
        target_label: app
      - action: drop
        regex: ''
        source_labels:
          - app
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_component
        target_label: component
      - action: replace
        replacement: kubernetes-pods-app-kubernetes-io-name
        target_label: scrape_job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__
  - job_name: kubernetes-pods-app
    pipeline_stages:
      - cri: {}
      - docker: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Drop pods with label 'app.kubernetes.io/name'. They are already considered above
      - action: drop
        regex: .+
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_app
        target_label: app
      - action: drop
        regex: ''
        source_labels:
          - app
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_component
        target_label: component
      - action: replace
        replacement: kubernetes-pods-app
        target_label: scrape_job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__
  # Pods with direct controllers, such as StatefulSet
  - job_name: kubernetes-pods-direct-controllers
    pipeline_stages:
      - cri: {}
      - docker: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Drop pods with label 'app.kubernetes.io/name' or 'app'. They are already considered above
      - action: drop
        regex: .+
        separator: ''
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
          - __meta_kubernetes_pod_label_app
      - action: drop
        regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
        source_labels:
          - __meta_kubernetes_pod_controller_name
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_controller_name
        target_label: app
      - action: replace
        replacement: kubernetes-pods-direct-controllers
        target_label: scrape_job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__
  # Pods with indirect controllers, such as Deployment
  - job_name: kubernetes-pods-indirect-controller
    pipeline_stages:
      - cri: {}
      - docker: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Drop pods with label 'app.kubernetes.io/name' or 'app'. They are already considered above
      - action: drop
        regex: .+
        separator: ''
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
          - __meta_kubernetes_pod_label_app
      - action: keep
        regex: '[0-9a-z-.]+-[0-9a-f]{8,10}'
        source_labels:
          - __meta_kubernetes_pod_controller_name
      - action: replace
        regex: '([0-9a-z-.]+)-[0-9a-f]{8,10}'
        source_labels:
          - __meta_kubernetes_pod_controller_name
        target_label: app
      - action: replace
        replacement: kubernetes-pods-indirect-controller
        target_label: scrape_job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__
  # All remaining pods not yet covered
  - job_name: kubernetes-other
    pipeline_stages:
      - cri: {}
      - docker: {}
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Drop what has already been covered
      - action: drop
        regex: .+
        separator: ''
        source_labels:
          - __meta_kubernetes_pod_label_app_kubernetes_io_name
          - __meta_kubernetes_pod_label_app
      - action: drop
        regex: .+
        source_labels:
          - __meta_kubernetes_pod_controller_name
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_name
        target_label: app
      - action: replace
        source_labels:
          - __meta_kubernetes_pod_label_component
        target_label: component
      - action: replace
        replacement: kubernetes-other
        target_label: scrape_job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_node_name
        target_label: node_name
      - action: replace
        source_labels:
        - __meta_kubernetes_namespace
        target_label: namespace
      - action: replace
        replacement: $1
        separator: /
        source_labels:
        - namespace
        - app
        target_label: job
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_name
        target_label: pod
      - action: replace
        source_labels:
        - __meta_kubernetes_pod_container_name
        target_label: container
      - action: replace
        replacement: /var/log/pods/*$1/*.log
        separator: /
        source_labels:
        - __meta_kubernetes_pod_uid
        - __meta_kubernetes_pod_container_name
        target_label: __path__

The text was updated successfully, but these errors were encountered:

diranged · 2021-03-03T15:24:35Z

CC: @LaikaN57

diranged · 2021-03-03T15:45:22Z

I scanned through the ingester logs.. you can see that it was rejecting a percentage of our nginx logs from this pod, but not all of them. I used this query: {namespace="observability", component="ingester"} |~ "7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831" | logfmt | regexp "total ignored: (?P<ignored>[\\d]+) out of (?P<total>[\\d]+)" | line_format "{{ .ignored }}/{{ .total }}":

Here is a small snapshot of the returned data:


2021-03-02 16:27:58 | 21/58
-- | --
  |   | 2021-03-02 16:27:58 | 28/64
  |   | 2021-03-02 16:27:58 | 2/6
  |   | 2021-03-02 16:27:59 | 1/4
  |   | 2021-03-02 16:27:59 | 3/11
  |   | 2021-03-02 16:27:59 | 4/10
  |   | 2021-03-02 16:28:00 | 1/4
  |   | 2021-03-02 16:28:02 | 2/8
  |   | 2021-03-02 16:28:04 | 1/10
  |   | 2021-03-02 16:28:08 | 1/10
  |   | 2021-03-02 16:28:10 | 5/18
  |   | 2021-03-02 16:28:11 | 3/26
  |   | 2021-03-02 16:28:14 | 1/22
  |   | 2021-03-02 16:28:16 | 2/16
  |   | 2021-03-02 16:28:17 | 2/8
  |   | 2021-03-02 16:28:18 | 3/14
  |   | 2021-03-02 16:28:19 | 1/4
  |   | 2021-03-02 16:28:20 | 4/14
  |   | 2021-03-02 16:28:23 | 3/16
  |   | 2021-03-02 16:28:24 | 1/10
  |   | 2021-03-02 16:28:25 | 2/12
  |   | 2021-03-02 16:28:26 | 1/10
  |   | 2021-03-02 16:28:28 | 1/14
  |   | 2021-03-02 16:28:29 | 4/16
  |   | 2021-03-02 16:28:31 | 1/10
  |   | 2021-03-02 16:28:32 | 1/8
  |   | 2021-03-02 16:28:33 | 2/12
  |   | 2021-03-02 16:28:34 | 3/12
  |   | 2021-03-02 16:28:35 | 5/16
  |   | 2021-03-02 16:28:36 | 2/10
  |   | 2021-03-02 16:28:37 | 4/14
  |   | 2021-03-02 16:28:38 | 3/12
  |   | 2021-03-02 16:28:40 | 2/10
  |   | 2021-03-02 16:28:41 | 2/8
  |   | 2021-03-02 16:28:43 | 2/12
  |   | 2021-03-02 16:28:44 | 3/16
  |   | 2021-03-02 16:28:45 | 4/12
  |   | 2021-03-02 16:28:46 | 3/12
  |   | 2021-03-02 16:28:47 | 2/8
  |   | 2021-03-02 16:28:49 | 1/12
  |   | 2021-03-02 16:28:50 | 2/12
  |   | 2021-03-02 16:28:51 | 3/14
  |   | 2021-03-02 16:28:52 | 1/8
  |   | 2021-03-02 16:28:53 | 1/14
  |   | 2021-03-02 16:28:54 | 2/10
  |   | 2021-03-02 16:28:55 | 3/14
  |   | 2021-03-02 16:28:56 | 1/14
  |   | 2021-03-02 16:29:05 | 2/16
  |   | 2021-03-02 16:29:09 | 7/22
  |   | 2021-03-02 16:29:10 | 1/10
  |   | 2021-03-02 16:29:11 | 1/12
  |   | 2021-03-02 16:29:14 | 1/16
  |   | 2021-03-02 16:29:21 | 1/16
  |   | 2021-03-02 16:29:24 | 1/18
  |   | 2021-03-02 16:29:26 | 1/16
  |   | 2021-03-02 16:29:27 | 1/14
  |   | 2021-03-02 16:29:28 | 1/18
  |   | 2021-03-02 16:29:29 | 1/10
  |   | 2021-03-02 16:29:33 | 1/12
  |   | 2021-03-02 16:29:34 | 1/18
  |   | 2021-03-02 16:29:37 | 3/16
  |   | 2021-03-02 16:29:39 | 1/14
  |   | 2021-03-02 16:29:41 | 1/16
  |   | 2021-03-02 16:29:44 | 1/18
  |   | 2021-03-02 16:29:49 | 2/14

Here's a few full length log lines just in case they help:




rpc error: code = Code(400) desc = entry with timestamp 2021-03-03 00:27:49.655718834 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:49.848217773 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:49.90478003 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:50.203339724 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:50.401435889 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:50.467510328 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:50.595205958 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:50.698007169 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:51.490732337 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:51.566277719 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, total ignored: 21 out of 58




rpc error: code = Code(400) desc = entry with timestamp 2021-03-03 00:27:59.176377794 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:59.179682449 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:27:59.362373031 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, total ignored: 3 out of 11
  |   |




rpc error: code = Code(400) desc = entry with timestamp 2021-03-03 00:28:09.255179075 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:28:09.979389079 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:28:10.075246119 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:28:10.236198841 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, entry with timestamp 2021-03-03 00:28:10.332174396 +0000 UTC ignored, reason: 'entry out of order' for stream: {app="loki", component="gateway", container="nginx", filename="/var/log/pods/observability_observability-loki-gateway-7cf8b5dbb-5rfxz_72ec1fd2-85c0-4307-8811-224504419831/nginx/2.log", job="observability/loki", namespace="observability", node_name="ip-10-32-19-156.us-west-2.compute.internal", pod="observability-loki-gateway-7cf8b5dbb-5rfxz", scrape_job="kubernetes-pods-app-kubernetes-io-name", scrape_namespace="observability", scrape_pod="observability-promtail-kqrxl", stream="stderr"}, total ignored: 5 out of 18
  |   |

stale · 2021-06-03T20:50:06Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

…tween Cortex and Thanos (grafana#3425) Signed-off-by: Marco Pracucci <[email protected]>

stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 3, 2021

stale bot closed this as completed Jun 11, 2021

cyriltovena pushed a commit to cyriltovena/loki that referenced this issue Jun 11, 2021

Intern Prometheus MultiError to solve blocking circular dependency be…

c06b917

…tween Cortex and Thanos (grafana#3425) Signed-off-by: Marco Pracucci <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425

Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425

diranged commented Mar 3, 2021

diranged commented Mar 3, 2021

diranged commented Mar 3, 2021

stale bot commented Jun 3, 2021

Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425

Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425

Comments

diranged commented Mar 3, 2021

diranged commented Mar 3, 2021

diranged commented Mar 3, 2021

stale bot commented Jun 3, 2021