-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promtail: 'entry out of order' errors sporadically, and also ran into a complete breakage of one stream #3425
Comments
CC: @LaikaN57 |
I scanned through the ingester logs.. you can see that it was rejecting a percentage of our nginx logs from this pod, but not all of them. I used this query: Here is a small snapshot of the returned data:
Here's a few full length log lines just in case they help:
|
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
…tween Cortex and Thanos (grafana#3425) Signed-off-by: Marco Pracucci <[email protected]>
Slack Thread: https://grafana.slack.com/archives/CEPJRLQNL/p1614732770319400
Describe the bug
We're running Promtail 2.1.0 into a Loki Distributed setup. The Promtail configuration is pulling pod-logs only, and the underlying K8S hosts are running AWS BottleRocket v1.0.5 (which uses containerd for its CRI). We occationally see little spurts of "entry out of order" messages from our various Promtail pods in the cluster, they tend to go away... but we understand that we have likely dropped some logs when those happen.
Yesterday a single promtail pod became unable to send most of the log entries from a pod into Loki. Virtually every Post failed with an "entry out of order" message. This went on for ~45m before we killed the promtail pod and let it get recreated. Once it was recreated, it came up cleanly and worked smoothly.
Is it possible that some internal state in Promtail got messed up with regards to the timestamps? Is something wrong with our promtail.yaml file?
The odd triggering situation
The clusters these pods are running on are all running on Spot instances, so the instances rotate out regularly. In this case, we happened to lose the instance that held the
loki-gateway
pod itself, which of course then broke data ingestion while the pod was coming up on a new host. All of our other promtail pods were up and running already... so they simply stopped sending data until the gateway was back.The promtail pod that failed was the one that came up on the newly booted instance that was replacing the original instance. When promtail comes up, you can see it first discovers the two log streams (jaeger and nginx in this case), but then fails to send data to the gateway service because its not quite up yet. After a few seconds the gateway comes up, and all of the other log streams start flowing.
Note about the timestamps
I initially thought that we just had a batch of early data that would eventually work its way out of the pipeline through failed retries/backoffs.. however, we noticed that even 45 minutes into the problem that the "newly failed" timestamps were continuing to update and be current timestamps. So we weren't failing some early batch of data, but were actually failing data continually. This is what lead us to restarting the promtail pod entirely.
To Reproduce
Steps to reproduce the behavior:
Environment:
Loki: 2.1.0
Promtail: 2.1.0
Kubernetes: EKS 1.19
Host OS: AWS BottleRocket 1.0.5
Deployment Tool: ArgoCD/Helm
Screenshots, Promtail config, or terminal output
promtail log from offending pod
promtail.yaml
The text was updated successfully, but these errors were encountered: