Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OOMKIlled on every query #613

Closed
agolomoodysaada opened this issue May 22, 2019 · 55 comments
Closed

OOMKIlled on every query #613

agolomoodysaada opened this issue May 22, 2019 · 55 comments

Comments

@agolomoodysaada
Copy link

Describe the bug
Everytime I make a query against loki running the helm chart with 8G requests and limits, the server goes into OOMKilled and restarts.

To Reproduce
Steps to reproduce the behavior:

  1. Started loki (SHA or version): master as of today
  2. Started promtail (SHA or version) to tail '...'
  3. Query: {app="my-app"} exception

Expected behavior
I should get results. Instead I got crashes.

Environment:

  • Infrastructure: Kubernetes
  • Deployment tool: helm

Related to #191

@daixiang0
Copy link
Contributor

daixiang0 commented May 23, 2019

Could you delete resource limit and try again?

You can try logcli to query directly.

@agolomoodysaada
Copy link
Author

without limits, it went up to 26G and it got evicted. basically it wanted all the memory on the VM so the kubelet killed it

@cyriltovena
Copy link
Contributor

Hello @agolomoodysaada,

Result should be limited to 1k log lines, also we got two PR that got merged recently to load lazily chunks to avoid this situation.

What are you using to query logs ? What is the query, limit, range and how much data do you expect ? What is the sha of the docker image you are using ?

Thank you

@agolomoodysaada
Copy link
Author

agolomoodysaada commented May 23, 2019

I'm using Grafana 6.2.0 explore tab
limit is 3000
range is 1h
disk usage is 5G of chunks
version running is whatever is on master

dependencies:
- name: "loki"
  condition: loki.enabled
  repository: "@loki"
  version: "^0.8.2"
- name: "promtail"
  condition: promtail.enabled
  repository: "@loki"
  version: "^0.7.2"

@gregwebs
Copy link

I experienced this on several different versions of Loki, including the latest. But I don't need to query Loki: it will leak memory at a rate > 100MiB a minute and eventually be killed by K8s. It seems that there are certain containers (maybe producing certain logs?) that it doesn't like. This is because memory usage is flat until certain things are running. The amount of logs that I am producing are very small, it doesn't seem possible to use GiB of memory even if all the log lines were in memory.

@steven-sheehy
Copy link
Contributor

Your helm chart is a bit old. Try the latest. Also, set loki's extraArgs in chart to log.level: debug to get more detailed logs and upload them here

@jlerche
Copy link

jlerche commented May 24, 2019

In our case, we were able to work around the issue by setting the chunks_retain_period in the config to 3 minutes. Default is 15 minutes according to https://github.com/grafana/loki/blob/master/pkg/ingester/ingester.go#L48

Whereas previously memory would grow until being OOM killed, with 3 minutes retain period we reached a steady state of ~1.7gb memory usage.

It would be great to have some more in depth documentation with respect to these configuration settings, or perhaps some operational guidance.

@danieldabate
Copy link
Contributor

In our case, we were able to work around the issue by setting the chunks_retain_period in the config to 3 minutes. Default is 15 minutes according to https://github.com/grafana/loki/blob/master/pkg/ingester/ingester.go#L48

Whereas previously memory would grow until being OOM killed, with 3 minutes retain period we reached a steady state of ~1.7gb memory usage.

It would be great to have some more in depth documentation with respect to these configuration settings, or perhaps some operational guidance.

Just in case anyone tries this, the parameter is chunk_retain_period without the s.

@agolomoodysaada
Copy link
Author

agolomoodysaada commented May 28, 2019

I'm still experiencing OOMKilled at query time with this config

loki:
  config:
    ingester:
      chunk_retain_period: 3m

dependencies

dependencies:
- name: "loki"
  condition: loki.enabled
  repository: "@loki"
  version: "^0.8.4"
- name: "promtail"
  condition: promtail.enabled
  repository: "@loki"
  version: "^0.7.3"

@cyriltovena
Copy link
Contributor

can you share the full config ?

@agolomoodysaada
Copy link
Author

agolomoodysaada commented May 30, 2019

everything else is using default helm chart config

@cyriltovena
Copy link
Contributor

Can you send us some profiling informations (I'm sorry but we can't reproduce it) by doing this:

First forward loki port locally to say 8080

kubectl port-forward loki 8080:3100

Then run:

go tool pprof -png http://localhost:8080/debug/pprof/heap > out.png

And share the image to us ! (it requires to have the go tool chain installed that shouldn't be difficult to install if you don't have it yet.)

Thank you

@gregwebs
Copy link

loki-mem-leak

@agolomoodysaada
Copy link
Author

agolomoodysaada commented May 31, 2019

Before query

loki_before

Right after query

loki_query1

A few seconds after query right before crash

loki_query2

Reminder: I'm running with 8G of requests and limits in k8s

@gregwebs
Copy link

@Kuqd do the graphs make sense to you or would you like us to provide you with something else?

@cyriltovena
Copy link
Contributor

I don’t see any problem except that ingestors and quierers are deployed togethers.

@gregwebs
Copy link

@Kuqd I think the total of our current log text size could not be more than a megabyte per minute. Yet we are leaking memory at a rate > 100 megabyte per minute. Why does the deployment topology matter for this memory leak?

We would really like to use Grafana for logs, but it is very difficult to do when the logging system crashes every 5 minutes after consuming all available memory.

@cyriltovena
Copy link
Contributor

It does matter because ingestor are meant to be scale on their own, right now querier and ingestor are sharing the same memory space. I will investigate.

@cyriltovena
Copy link
Contributor

For sure the helm chart is not production ready. Ksonnet is.

@cyriltovena
Copy link
Contributor

cyriltovena commented Jun 14, 2019

@gregwebs @agolomoodysaada can you try to tweak your config as follow ?

loki:
  config:
    ingester:
      chunk_idle_period: 3m
      chunk_retain_period: 1m

Also can you give me a sense of how many lines per seconds you are ingesting ?

you can use that Prometheus query sum(rate(loki_distributor_lines_received_total[5m]))

Finally how does the query looks like, time range ? logql selector ?

Thank you !

@agolomoodysaada
Copy link
Author

Using the new release 0.1.0, new chart, and suggested config seems to have resolved the problem.
Thank you so much! Will update again after a few days and more logs get stored.

@cyriltovena
Copy link
Contributor

I think you have a too high volume to run on a single node:

chunk_idle_period allow you to flush chunk that don't receive new data, shouldn't matter too much but for a single node I think 3m is good.

chunk_retain_period allow you to keep flushed chunks in memory for a duration this avoid to query the index storage too much for fresh data, but at the price of more memory used. (think of it as a sliding cache 15m might be too much)

What's the amount of logs you're sending ? with 30k /sec I'm all good with the config above but the default config blow in less than 5min.

I'm asking because we might want to update default values in helm if this is a common problem.

@agolomoodysaada
Copy link
Author

agolomoodysaada commented Jun 17, 2019

I'm happy to confirm that I am able to query the logs with very little issues even several days after applying this config. I would prefer if these were set as the default values for now until we have a better configuration. Would be nice if others could confirm the fix works for them.

The rate of logs in my case is around 1k/sec but in some cases I will need up to 5k/sec.

@cyriltovena
Copy link
Contributor

I'll make those default values and will work on a distributed version of the charts so you can scale higher if needed.

@cyriltovena
Copy link
Contributor

@gregwebs can you confirm ?

@jlerche
Copy link

jlerche commented Jun 17, 2019

My usecase was similar to @agolomoodysaada, we saw mostly 1-2k/s with spikes to 5k/s. I did notice that querying the stderr stream reliably caused the pod to very quickly consume all memory and get OOMKilled.

Distributed helm chart deployment would be very welcome.

@gregwebs
Copy link

@Kuqd our logging needs are modest. Some downtime with log viewing is not a problem for us either. We really can't be deploying consul, memcache, and all these other things. At that point we will have to do a resource comparison with Elastic Search.

Is there a deployment model that doesn't memory leak and doesn't require running lots of coordinated processes? Or is there a distributed deployment with low resource requirements?

@cyriltovena
Copy link
Contributor

yes definitively.

@agolomoodysaada
Copy link
Author

Update here... I'm sad to announce that after 4 days of logs, the issue came back. So it is still not resolved with the changes mentioned previously. :/

@cyriltovena
Copy link
Contributor

what was the query you used to go OOM ?

@agolomoodysaada
Copy link
Author

practically any query... but specifically: {release="myapp"} or {release="myapp"} my custom search string

@zarbis
Copy link

zarbis commented Jun 26, 2019

Adding my two cents: I'm having Loki consistently crashing when I reach certain time span with simple queries like {job="foo/bar"}12345. Total amount of logs accumulated since Loki first deployment month ago is around 20Gb.

Querying just one app out of around 10 running in my cluster for last 24 hours results in OOM. Memory limit for Loki is 1Gib so it's entirely possible that all logs of that app since Loki deployment would fit in memory.

@cyriltovena
Copy link
Contributor

Thank you for you patience this is under investigation.

@zarbis
Copy link

zarbis commented Jun 26, 2019

I've made further investigation and it seems like there is some sort of "log line of death" or corrupted storage chunk.

I've copied URL that LogCLI produces and manually queried Loki with smaller and smaller window with this script:

from=$(date -d 2019-06-17T00:00:00Z +%s)
to=$(date -d 2019-06-26T00:00:00Z +%s)
step=$((60))

start=$from
end=$(($from+$step))
while [[ $end -le $to ]]; do
  echo $(date -d @$start +%c) $(date -d @$end +%c)
  curl 'http://localhost:3100/api/prom/query?query=<REDACTED>&limit=1000&start='$start'000000000&end='$end'000000000&direction=BACKWARD&regexp=<REDACTED>'
  start=$(($start+$step))
  end=$(($end+$step))
done

From Grafana's default 6 hour window down to 1 minute, Loki consistently crashes on the same window.

Here is how typical successful request log looks like:

caller=logging.go:44 traceID=123 msg="GET /api/prom/query?query=%7Bjob%3D%22foo%2Fapp%22%7D&limit=1000&start=1560835500000000000&end=1560835560000000000&direction=BACKWARD&regexp=%2Fr%2F123 (200) 5.505192ms"
caller=http.go:111 request="&QueryRequest{Query:{job=\"foo/app\"},Limit:1000,Start:2019-06-18 05:26:00 +0000 UTC,End:2019-06-18 05:27:00 +0000 UTC,Direction:BACKWARD,Regex:/r/123,}"
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.GetChunkRefs level=debug metric=logs
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatchers level=debug metricName=logs matchers=1
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug metricName=logs matcher="job=\"foo/app\""
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug queries=1
caller=spanlogger.go:37 org_id=fake trace_id=123 method=cachingIndexClient.cacheFetch level=debug hits=0 misses=1
caller=grpc_logging.go:57 method=/logproto.Querier/Query duration=265.109µs msg="gRPC (success)"
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug entries=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatcher level=debug ids=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupSeriesByMetricNameMatchers level=debug msg="post intersection" ids=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.GetChunkRefs level=debug series-ids=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupChunksBySeries level=debug seriesIDs=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupChunksBySeries level=debug queries=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=cachingIndexClient.cacheFetch level=debug hits=0 misses=2
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.lookupChunksBySeries level=debug entries=90
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.GetChunkRefs level=debug chunk-ids=90
caller=spanlogger.go:37 org_id=fake trace_id=123 method=SeriesStore.GetChunkRefs level=debug chunks-post-filtering=2

And diff between any successful and crashing request:

1,2c1,2
< caller=logging.go:44 traceID=123 msg="GET /api/prom/query?query=%7Bjob%3D%22foo%2Fapp%22%7D&limit=1000&start=1560835500000000000&end=1560835560000000000&direction=BACKWARD&regexp=%2Fr%123 (200) 5.505192ms"
< caller=http.go:111 request="&QueryRequest{Query:{job=\"foo/app\"},Limit:1000,Start:2019-06-18 05:26:00 +0000 UTC,End:2019-06-18 05:27:00 +0000 UTC,Direction:BACKWARD,Regex:/r/123,}"
---
> caller=logging.go:44 traceID=123 msg="GET /api/prom/query?query=%7Bjob%3D%22foo%2Fapp%22%7D&limit=1000&start=1560835620000000000&end=1560835680000000000&direction=BACKWARD&regexp=%2Fr%123 (200) 7.023626ms"
> caller=http.go:111 request="&QueryRequest{Query:{job=\"foo/app\"},Limit:1000,Start:2019-06-18 05:28:00 +0000 UTC,End:2019-06-18 05:29:00 +0000 UTC,Direction:BACKWARD,Regex:/r/123,}"
8c8
< caller=grpc_logging.go:57 method=/logproto.Querier/Query duration=265.109µs msg="gRPC (success)"
---
> caller=grpc_logging.go:57 method=/logproto.Querier/Query duration=270.777µs msg="gRPC (success)"
18a19,27
> caller=consul_client_mock.go:99 msg=Get key=ring wait_index=0
> caller=consul_client_mock.go:121 msg=Get key=ring modify_index=168 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9a\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:72 msg=CAS key=ring modify_index=168 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9f\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:121 msg=Get key=ring modify_index=169 value="\"\\x9b\\x10\\xa8\\n!\\n\\x06loki-0\\x12\\x17\\n\\x0f10.60.2.25:9095\\x10\\x9f\\xdf\\xcd\\xe8\\x05\\x12\\r\""
> caller=consul_client_mock.go:99 msg=Get key=ring wait_index=169
> caller=grpc_logging.go:40 duration=2.297117ms method=/logproto.Pusher/Push msg="gRPC (success)"
> caller=logging.go:44 traceID=423af07e4ffb991c msg="POST /api/prom/push (204) 131.668413ms"
> caller=grpc_logging.go:40 method=/logproto.Pusher/Push duration=73.791004ms msg="gRPC (success)"
> caller=logging.go:44 traceID=57c725206435677c msg="POST /api/prom/push (204) 189.645418ms"

@atombender
Copy link

Also unable to perform most queries beyond just a few lines.

Even something like --limit=1000 has the Loki process RAM explode to more than 4GB in just a second or two. If I add a time window such as --since=1h, I'm able to get entries back, though memory usage then climbs to about 400MB RAM and stays there.

Had chunk_idle_period: 15m and no chunk_retain_period set. Switched to more conservative values, but still seeing the problem.

Running on single dedicated node with 18 promtail nodes and modest (25-50K lines/sec) amount of logging.

@cyriltovena
Copy link
Contributor

cyriltovena commented Jun 27, 2019

though memory usage then climbs to about 400MB RAM and stays there.

This is normal, go will release the memory slowly but reuse it if needed.

18 promtail at 50k/sec is a lot for a single node, would you mind sharing how many streams you are targeting (unique log series the query will go through) and the exact query you used (including direction/limit/logql/start/end) ?

@atombender
Copy link

Here's an example command line:

$ logcli --addr http://loki:3100 query --limit=1 --no-labels '{app="enclosure-core",channel="newdb",namespace="production"}'

Notice --limit=1. Even that causes Loki to consume 4GB and be OOM-killed.

How do I find "how many streams"? The query in question will match the output of a series of Kubernetes pods that have been doing rolling restarts over many days, so they'll have a lot of different pod_name labels and so on — I assume they count as separate streams?

Why would ingest performance affect query performance? According to my Prometheus metrics, the Loki container is consuming about 2-4% CPU, which suggests it's able to handle the load.

@cyriltovena
Copy link
Contributor

Alright this is merge, you should not OOM anymore, please give it a try and report back any issue.

Also there is another PR coming that will ensure that even if you do a query that touches many different unique labels set it will load only a subset of them.

@zarbis
Copy link

zarbis commented Jul 16, 2019

@Kuqd there is no chart version update or git tag. How should I update my installation?

@steven-sheehy
Copy link
Contributor

@zarbis just set the image tag in the helm chart values to master-c08ec79 for both promtail and loki and helm upgrade.

@cyriltovena
Copy link
Contributor

Note that since we merge the labels from storage the memory usage might still be high but should be acceptable. In our production cluster labels query for 6 hours of time consume 1 to 2 gib of memory during the request.

We're planning work to improve this.

@jlerche
Copy link

jlerche commented Jul 23, 2019

I can already report that queries that used to OOM right away are now working as expected.

@bigbrozer
Copy link

I was experiencing similar issue, maybe a nasty lines in some logs.
Upgrading from 0.1.0 to master-7deb02b for loki + promtail image tags is now working perfectly.

@cyriltovena
Copy link
Contributor

cyriltovena commented Jul 25, 2019

I'm going to close this issue as soon as we have the batch iterator in. Unless someone still have issues.

And thank you for the report ! Appreciated !

@agolomoodysaada
Copy link
Author

I would like to second @bigbrozer in that master-7deb02b appears to have fixed the OOMKill events for us as well.

@cyriltovena
Copy link
Contributor

Fixed entirely by 0.2

@renatosis
Copy link

renatosis commented Jun 10, 2022

Does anyone still have problems with Loki Querier OOMs in 2022? We noticed this using Grafana's Explorer Log Volume querying around 1000 log entries. Thanks for help in advance

@elsamhaa-g2-travel
Copy link

I'm getting OOMs and rolling Loki restarts very frequently. The pod maintains almost 4 Gi of memory even without any querying.

@LinTechSo
Copy link
Contributor

LinTechSo commented Aug 10, 2022

Hi. I am facing the same issue with Loki version 2.6.1
any updates ?

@Yatekii
Copy link

Yatekii commented Aug 27, 2022

Same issue here. Trivial query even with max lines at 50 dies.

@michalziobro
Copy link

We're also having this issue with Loki 2.6.1, deployed with helm chart 2.10.2

@ningyougang
Copy link

ningyougang commented Dec 13, 2022

We're also having this issue with Loki 2.7.0 when executed query operation

loki-0                                           0/1     CrashLoopBackOff   3 (13s ago)     91m
loki-1                                           1/1     Running            0               92m
loki-2                                           0/1     OOMKilled          4 (2m54s ago)   90m

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests