Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki POD crashes randomly #605

Closed
gauravbist opened this issue May 20, 2019 · 16 comments
Closed

Loki POD crashes randomly #605

gauravbist opened this issue May 20, 2019 · 16 comments

Comments

@gauravbist
Copy link

Describe the bug
Loki Pod crashes randomly with Error: "Liveness probe failed: HTTP probe failed with statuscode: 500"

To Reproduce
Search in Grafana give error: "Unknown error during query transaction. Please check JS console logs" . Found Loki pod crashed with error: "Liveness probe failed: HTTP probe failed with statuscode: 500"

@gauravbist
Copy link
Author

Please ignore it.. I think its because of node health check is not working properly

@gauravbist
Copy link
Author

I fixed the Nodes Liveness probe issue.. but still it is not resolved. as soon as I search logs in Grafana.. loki pod crashed.

please find attached the loki pod details:

kubectl describe pod loki-d86549668-2c4r7 -n prometheus

lokierror.txt

@cyriltovena
Copy link
Contributor

Hello @gauravbist ,

Can you share the logs of the pod crashing please ?

Thank you !

@gauravbist
Copy link
Author

loki-v5.txt

is that you need?

@cyriltovena
Copy link
Contributor

I don't see any crash log, can you try to remove the liveness and readiness checks for a while see what happen ?

@gauravbist
Copy link
Author

How to do it? Should I remove below mentioned lines in loki deployment running yaml file?

"livenessProbe": { "httpGet": { "path": "/ready", "port": "http-metrics", "scheme": "HTTP" }, "initialDelaySeconds": 45, "timeoutSeconds": 5, "periodSeconds": 10, "successThreshold": 1, "failureThreshold": 3 }, "readinessProbe": { "httpGet": { "path": "/ready", "port": "http-metrics", "scheme": "HTTP"

@cyriltovena
Copy link
Contributor

please yes ! Not sure why loki returns a 500 yet.

@gauravbist
Copy link
Author

gauravbist commented May 21, 2019

After removing liveness and readiness checks, so far no crashed.. but in grafana old logs are not coming...
logsnew.txt

@gauravbist
Copy link
Author

gauravbist commented May 22, 2019

It is still not stable.. sometimes it works but most of the time it crashes.

loki-describe.txt

Please let me know how to troubleshoot or debug the same.

I have attached loki-promtails logs too..
loki-promtail-2q972.txt
loki-promtail-hjmst.txt
loki-promtail-z4tdv.txt
loki-promtail-zcgjl.txt

@cyriltovena
Copy link
Contributor

I think your Loki is getting killed, can you look at memory usage and kubernetes events.

kubectl get events --all-namespaces

@gauravbist
Copy link
Author

I don't think due to memory or cpu usage it is getting killed because node have 30GB memory with 4 vCpu core. and only few pods are running on it..

even events are also not showing it..
events.txt

Is there any setting or configuration where I can enable debug mode, so that we can trace the root cause?

@cyriltovena
Copy link
Contributor

cyriltovena commented May 23, 2019

Can you try to upgrade your helm deployment with latest and also delete all loki pods once that's done ?

@gregwebs
Copy link

@gauravbist please check on the memory usage. We have seen an unbound memory leak in Loki.

@christopher-wong
Copy link

christopher-wong commented May 31, 2019

Ran into this problem today with the latest version of Loki. Health / Liveness checks keep failing and k8s kills the pod. Loki itself seems find and I can query logs until k8s kills the pod.

level=info ts=2019-06-05T23:28:42.193600982Z caller=lifecycler.go:380 msg="entry not found in ring, adding with no tokens"
level=info ts=2019-06-05T23:28:42.19458173Z caller=lifecycler.go:310 msg="auto-joining cluster after timeout"
level=warn ts=2019-06-05T23:29:31.81085494Z caller=logging.go:49 traceID=fd87c75b9b0261 msg="GET /ready (500) 71.077µs Response: \"Not ready: waiting for 1m0s after startup\\n\" ws: false; Accept-Encoding: gzip; Connection: close; User-Agent: kube-probe/1.12; "
level=warn ts=2019-06-05T23:29:36.355412592Z caller=logging.go:49 traceID=1b4ca57b1d2169e msg="GET /ready (500) 53.988µs Response: \"Not ready: waiting for 1m0s after startup\\n\" ws: false; Accept-Encoding: gzip; Connection: close; User-Agent: kube-probe/1.12; "
level=warn ts=2019-06-05T23:29:41.810769639Z caller=logging.go:49 traceID=9dbe11edf5a667 msg="GET /ready (500) 54.502µs Response: \"Not ready: waiting for 1m0s after startup\\n\" ws: false; Accept-Encoding: gzip; Connection: close; User-Agent: kube-probe/1.12; "
➜  ~ k get pods -n loki
NAME                  READY   STATUS    RESTARTS   AGE
loki-0                0/1     Running   3          24h
loki-promtail-4244l   1/1     Running   0          24h
loki-promtail-7mtg2   1/1     Running   0          24h
loki-promtail-9cdp2   1/1     Running   0          24h
loki-promtail-hqktb   1/1     Running   0          24h
loki-promtail-l4lvh   1/1     Running   0          24h
loki-promtail-vkzws   1/1     Running   0          24h

after a while it'll come back online.

@gregwebs
Copy link

see #613

@cyriltovena
Copy link
Contributor

yeah let's keep the first issue only this is a dup of #613

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants