Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory usage / documentation #616

Closed
nigeisel opened this issue Aug 5, 2020 · 10 comments · Fixed by #662
Closed

Memory usage / documentation #616

nigeisel opened this issue Aug 5, 2020 · 10 comments · Fixed by #662
Assignees

Comments

@nigeisel
Copy link

nigeisel commented Aug 5, 2020

We have a rather small setup with ~1 event / minute and usually not more than one/two user using the web interface at the same time. However we recorded a combined memory usage well above 4G with both Kafka and Clickhouse using more than 1G each.

Since the docs mention 2400MB memory as a minimum I was wondering whether this is actually realistic or if something goes wrong on our side? Are there are ways to configure the setup to reduce memory usage? Otherwise maybe the docs should be updated to show a more realistic minimum requirement (about 5g?).

Memory usage: Top graphs are kafka and clickhouse, followed by the worker and the web container.
Screen Shot 2020-08-05 at 17 45 50

Memory usage combined:
Screen Shot 2020-08-05 at 17 45 19

@BYK BYK self-assigned this Aug 6, 2020
@McSneaky
Copy link
Contributor

McSneaky commented Aug 18, 2020

You can reduce system load quite a bit by reducing Kafka size
Here's an issue about it that also has some workarounds in comments: #502

Haven't checked exact resource usage before and after, but at least it's not crashing constantly anymore 🙂

@renchap
Copy link
Contributor

renchap commented Aug 24, 2020

I am having a look at the memory usage, as you need over 4GB of memory to run this without crashes after a few hours/days due to out of memory errors and the OOM Killer killing either clickhouse or Kafka.

For Kafka, limiting the number of shards using KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS=2 as said in #502 really helps.

For Clickhouse, I looked at the settings and found this option: https://clickhouse.tech/docs/en/operations/server-configuration-parameters/settings/#max_server_memory_usage_to_ram_ratio

It allows you to specify a percentage of the max total memory available to Clickhouse. Without it, Clickhouse does not limit the memory used by queries and defaults to a max of 10G, which is far overkill for most on-premise sentry installations.

I would suggest to configure max_server_memory_usage_to_ram_ratio to something like 0.3, or max_server_memory_usage to 1G.

The easiest way to do this seems to build a custom clickhouse container, with one additional file in /etc/clickhouse-server/config.d/sentry.xml containing the custom settings, something like:

<yandex>
  <max_server_memory_usage_to_ram_ratio>0.3</max_server_memory_usage_to_ram_ratio>
</yandex>

I made this change in my local config and will monitor Clickhouse's memory usage over the next day.

If this works I will submit a PR with the change.

@renchap
Copy link
Contributor

renchap commented Aug 25, 2020

Update after one day: memory usage is stable, clickhouse no longer goes over 30% of total memory on the server.

@BYK can you check with your clickhouse admins to see if they recommend a specific setting here, like a minimum?

I am not sure if we should set this to a fixed value, or to a percentage of total memory. I will open a PR once you tell me what you / they prefer :)

@BYK
Copy link
Member

BYK commented Aug 25, 2020

@BYK can you check with your clickhouse admins to see if they recommend a specific setting here, like a minimum?

Calling for a @JTCunning. Please report to the nearest text box.

@JTCunning
Copy link

Hello.

max_server_memory_usage_to_ram_ratio is a relatively new setting to ClickHouse (as in Order of Months new via ClickHouse/ClickHouse@ea67432).

I'd probably set it by default since it tracks more than just running queries, but we've never worked with this value being limited and cannot comment on what happens to a production system when the limit is reached.

In general, ClickHouse is not happy with being bound to such a small amount (less than 10GB) of RAM. I wouldn't be surprised if we're back here with others asking "how do I have it return my query results and still keeping memory low?"

@renchap
Copy link
Contributor

renchap commented Aug 25, 2020

In general, ClickHouse is not happy with being bound to such a small amount (less than 10GB) of RAM.

Ouch. I guess most onpremise install are quite small (a few events per seconds at max), and the 2400MB memory mentioned by the doc fits within this. If an onpremise Sentry install required 12 GB or memory (10 GB for clickhouse, 2 GB for the other process), then this should be written in bold in the README imo.

@JTCunning
Copy link

I wouldn't go so far as to say it's required, I chose "not happy" since the overwhelming majority of focus within ClickHouse development is geared towards a scalable production system where ClickHouse is deployed to isolated machines with larger resource constraints.

I can't personally comment on what will happen when ClickHouse is deployed with a smaller constraint and expected to perform linearly because that's not my area of focus inside of the Sentry organization, nor is it anyone's at the moment.

I'm stating that if "Things Get Weird" we'd come back to this thread and apply a more scrutinized troubleshooting procedure to the example deployment beyond "set this one setting and see what happens".

@BYK
Copy link
Member

BYK commented Aug 26, 2020

@renchap I'd say let's try this new setting out for a while on your setup and then we can make it the default with a configuration option for larger deployments.

Are you able to share your average and/or peak load for reference?

@renchap
Copy link
Contributor

renchap commented Aug 29, 2020

My load is very very low, 20 events / minute maximum.

I submitted a proposal in #651

renchap added a commit to renchap/sentry-onpremise that referenced this issue Aug 29, 2020
By default, this will configure Clickhouse with a max memory of
30% of the host memory.

Related to getsentry#616
@JTCunning
Copy link

@renchap: Please do let us know if you end up hitting any high watermarks for memory that prevent Sentry from functioning properly. The newer memory management techniques in ClickHouse are enticing for us but are not without their fair share of bear traps (like ClickHouse/ClickHouse#12583 being the reason we pin this repo to <20.4).

There are ones of queries Sentry can issue that have the potential to use an annoyingly large amount of memory for aggregation and sorting. Asking your team members to "tone it down on the amount of distinct tag keys and values" might not be the easiest ask, but if I could peer into my crystal ball and guess what would end up yielding QueryMemoryLimitExceeded, it would definitely be a high cardinality of custom tag key-value pairs.

There are some settings that will instruct ClickHouse to return inaccurate/incomplete results (what it collected up until the limit) instead of throwing an exception. I'd personally say it won't be worth proactively applying those since it would be difficult to tell the difference between a successfully executed query and a query that returned early.

BYK added a commit that referenced this issue Sep 8, 2020
Closes #616, supercedes #651.

Adds an option to reduce max memory usage of Clickhouse server. Sets it to 30% of all available RAM as the default.

Co-authored-by: Renaud Chaput <[email protected]>
@BYK BYK closed this as completed in #662 Sep 8, 2020
BYK added a commit that referenced this issue Sep 8, 2020
Closes #616, supersedes #651

Adds an option to reduce max memory usage of Clickhouse server. Sets it to 30% of all available RAM as the default.

Co-authored-by: Renaud Chaput <[email protected]>
@github-actions github-actions bot locked and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants