-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOMKIlled on every query #613
Comments
Could you delete resource limit and try again? You can try logcli to query directly. |
without limits, it went up to 26G and it got evicted. basically it wanted all the memory on the VM so the kubelet killed it |
Hello @agolomoodysaada, Result should be limited to 1k log lines, also we got two PR that got merged recently to load lazily chunks to avoid this situation. What are you using to query logs ? What is the query, limit, range and how much data do you expect ? What is the sha of the docker image you are using ? Thank you |
I'm using Grafana 6.2.0 explore tab
|
I experienced this on several different versions of Loki, including the latest. But I don't need to query Loki: it will leak memory at a rate > 100MiB a minute and eventually be killed by K8s. It seems that there are certain containers (maybe producing certain logs?) that it doesn't like. This is because memory usage is flat until certain things are running. The amount of logs that I am producing are very small, it doesn't seem possible to use GiB of memory even if all the log lines were in memory. |
Your helm chart is a bit old. Try the latest. Also, set loki's extraArgs in chart to |
In our case, we were able to work around the issue by setting the Whereas previously memory would grow until being OOM killed, with 3 minutes retain period we reached a steady state of ~1.7gb memory usage. It would be great to have some more in depth documentation with respect to these configuration settings, or perhaps some operational guidance. |
Just in case anyone tries this, the parameter is |
I'm still experiencing OOMKilled at query time with this config loki:
config:
ingester:
chunk_retain_period: 3m dependencies dependencies:
- name: "loki"
condition: loki.enabled
repository: "@loki"
version: "^0.8.4"
- name: "promtail"
condition: promtail.enabled
repository: "@loki"
version: "^0.7.3" |
can you share the full config ? |
everything else is using default helm chart config |
Can you send us some profiling informations (I'm sorry but we can't reproduce it) by doing this: First forward loki port locally to say 8080
Then run:
And share the image to us ! (it requires to have the go tool chain installed that shouldn't be difficult to install if you don't have it yet.) Thank you |
@Kuqd do the graphs make sense to you or would you like us to provide you with something else? |
I don’t see any problem except that ingestors and quierers are deployed togethers. |
@Kuqd I think the total of our current log text size could not be more than a megabyte per minute. Yet we are leaking memory at a rate > 100 megabyte per minute. Why does the deployment topology matter for this memory leak? We would really like to use Grafana for logs, but it is very difficult to do when the logging system crashes every 5 minutes after consuming all available memory. |
It does matter because ingestor are meant to be scale on their own, right now querier and ingestor are sharing the same memory space. I will investigate. |
For sure the helm chart is not production ready. Ksonnet is. |
@gregwebs @agolomoodysaada can you try to tweak your config as follow ? loki:
config:
ingester:
chunk_idle_period: 3m
chunk_retain_period: 1m Also can you give me a sense of how many lines per seconds you are ingesting ? you can use that Prometheus query Finally how does the query looks like, time range ? logql selector ? Thank you ! |
Using the new release 0.1.0, new chart, and suggested config seems to have resolved the problem. |
I think you have a too high volume to run on a single node:
What's the amount of logs you're sending ? with 30k /sec I'm all good with the config above but the default config blow in less than 5min. I'm asking because we might want to update default values in helm if this is a common problem. |
I'm happy to confirm that I am able to query the logs with very little issues even several days after applying this config. I would prefer if these were set as the default values for now until we have a better configuration. Would be nice if others could confirm the fix works for them. The rate of logs in my case is around |
I'll make those default values and will work on a distributed version of the charts so you can scale higher if needed. |
@gregwebs can you confirm ? |
My usecase was similar to @agolomoodysaada, we saw mostly 1-2k/s with spikes to 5k/s. I did notice that querying the Distributed helm chart deployment would be very welcome. |
@Kuqd our logging needs are modest. Some downtime with log viewing is not a problem for us either. We really can't be deploying consul, memcache, and all these other things. At that point we will have to do a resource comparison with Elastic Search. Is there a deployment model that doesn't memory leak and doesn't require running lots of coordinated processes? Or is there a distributed deployment with low resource requirements? |
yes definitively. |
Update here... I'm sad to announce that after 4 days of logs, the issue came back. So it is still not resolved with the changes mentioned previously. :/ |
what was the query you used to go OOM ? |
practically any query... but specifically: |
Adding my two cents: I'm having Loki consistently crashing when I reach certain time span with simple queries like Querying just one app out of around 10 running in my cluster for last 24 hours results in OOM. Memory limit for Loki is 1Gib so it's entirely possible that all logs of that app since Loki deployment would fit in memory. |
Thank you for you patience this is under investigation. |
I've made further investigation and it seems like there is some sort of "log line of death" or corrupted storage chunk. I've copied URL that LogCLI produces and manually queried Loki with smaller and smaller window with this script: from=$(date -d 2019-06-17T00:00:00Z +%s)
to=$(date -d 2019-06-26T00:00:00Z +%s)
step=$((60))
start=$from
end=$(($from+$step))
while [[ $end -le $to ]]; do
echo $(date -d @$start +%c) $(date -d @$end +%c)
curl 'http://localhost:3100/api/prom/query?query=<REDACTED>&limit=1000&start='$start'000000000&end='$end'000000000&direction=BACKWARD®exp=<REDACTED>'
start=$(($start+$step))
end=$(($end+$step))
done From Grafana's default 6 hour window down to 1 minute, Loki consistently crashes on the same window. Here is how typical successful request log looks like:
And diff between any successful and crashing request: 1,2c1,2
< caller=logging.go:44 traceID=123 msg="GET /api/prom/query?query=%7Bjob%3D%22foo%2Fapp%22%7D&limit=1000&start=1560835500000000000&end=1560835560000000000&direction=BACKWARD®exp=%2Fr%123 (200) 5.505192ms"
< caller=http.go:111 request="&QueryRequest{Query:{job=\"foo/app\"},Limit:1000,Start:2019-06-18 05:26:00 +0000 UTC,End:2019-06-18 05:27:00 +0000 UTC,Direction:BACKWARD,Regex:/r/123,}"
---
> caller=logging.go:44 traceID=123 msg="GET /api/prom/query?query=%7Bjob%3D%22foo%2Fapp%22%7D&limit=1000&start=1560835620000000000&end=1560835680000000000&direction=BACKWARD®exp=%2Fr%123 (200) 7.023626ms"
> caller=http.go:111 request="&QueryRequest{Query:{job=\"foo/app\"},Limit:1000,Start:2019-06-18 05:28:00 +0000 UTC,End:2019-06-18 05:29:00 +0000 UTC,Direction:BACKWARD,Regex:/r/123,}"
8c8
< caller=grpc_logging.go:57 method=/logproto.Querier/Query duration=265.109µs msg="gRPC (success)"
---
> caller=grpc_logging.go:57 method=/logproto.Querier/Query duration=270.777µs msg="gRPC (success)"
18a19,27
> caller=consul_client_mock.go:99 msg=Get key=ring wait_index=0
> caller=consul_client_mock.go:121 msg=Get key=ring modify_index=168 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9a\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:72 msg=CAS key=ring modify_index=168 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9f\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:121 msg=Get key=ring modify_index=169 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9f\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:99 msg=Get key=ring wait_index=169
> caller=grpc_logging.go:40 duration=2.297117ms method=/logproto.Pusher/Push msg="gRPC (success)"
> caller=logging.go:44 traceID=423af07e4ffb991c msg="POST /api/prom/push (204) 131.668413ms"
> caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=73.791004ms msg="gRPC (success)"
> caller=logging.go:44 traceID=57c725206435677c msg="POST /api/prom/push (204) 189.645418ms" |
Also unable to perform most queries beyond just a few lines. Even something like Had Running on single dedicated node with 18 promtail nodes and modest (25-50K lines/sec) amount of logging.
|
This is normal, go will release the memory slowly but reuse it if needed. 18 promtail at 50k/sec is a lot for a single node, would you mind sharing how many streams you are targeting (unique log series the query will go through) and the exact query you used (including direction/limit/logql/start/end) ? |
Here's an example command line: $ logcli --addr http://loki:3100 query --limit=1 --no-labels '{app="enclosure-core",channel="newdb",namespace="production"}' Notice How do I find "how many streams"? The query in question will match the output of a series of Kubernetes pods that have been doing rolling restarts over many days, so they'll have a lot of different Why would ingest performance affect query performance? According to my Prometheus metrics, the Loki container is consuming about 2-4% CPU, which suggests it's able to handle the load. |
Alright this is merge, you should not OOM anymore, please give it a try and report back any issue. Also there is another PR coming that will ensure that even if you do a query that touches many different unique labels set it will load only a subset of them. |
@Kuqd there is no chart version update or git tag. How should I update my installation? |
@zarbis just set the image tag in the helm chart values to |
Note that since we merge the labels from storage the memory usage might still be high but should be acceptable. In our production cluster labels query for 6 hours of time consume 1 to 2 gib of memory during the request. We're planning work to improve this. |
I can already report that queries that used to OOM right away are now working as expected. |
I was experiencing similar issue, maybe a nasty lines in some logs. |
I'm going to close this issue as soon as we have the batch iterator in. Unless someone still have issues. And thank you for the report ! Appreciated ! |
I would like to second @bigbrozer in that |
Fixed entirely by 0.2 |
Does anyone still have problems with Loki Querier OOMs in 2022? We noticed this using Grafana's Explorer Log Volume querying around 1000 log entries. Thanks for help in advance |
I'm getting OOMs and rolling Loki restarts very frequently. The pod maintains almost 4 Gi of memory even without any querying. |
Hi. I am facing the same issue with Loki version 2.6.1 |
Same issue here. Trivial query even with max lines at 50 dies. |
We're also having this issue with Loki 2.6.1, deployed with helm chart 2.10.2 |
We're also having this issue with Loki 2.7.0 when executed query operation
|
Describe the bug
Everytime I make a query against loki running the helm chart with 8G requests and limits, the server goes into OOMKilled and restarts.
To Reproduce
Steps to reproduce the behavior:
{app="my-app"} exception
Expected behavior
I should get results. Instead I got crashes.
Environment:
Related to #191
The text was updated successfully, but these errors were encountered: