Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relay Server: Not Enough Memory on Health Check even though stats show otherwise #3330

Closed
LordSimal opened this issue Sep 17, 2024 · 47 comments

Comments

@LordSimal
Copy link

Self-Hosted Version

24.8.0

CPU Architecture

x86_64

Docker Version

27.2.1

Docker Compose Version

2.29.2

Steps to Reproduce

Can't really tell how to reproduce, since it just happens out of nowhere.

Expected Result

Sentry receives errors again

Actual Result

Sentry stops receiving errors after 1-2 days of normal usage.

Checking the docker logs there are a lot of these entries present:

relay-1                                         | 2024-09-17T06:33:00.437945Z ERROR relay_server::services::health_check: Not enough memory, 32351698944 / 33568419840 (96.38% >= 95.00%)
relay-1                                         | 2024-09-17T06:33:00.437982Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed

but checking either htop we have enough RAM
Image

as well as checking docker container stats there are no containers using > 95% RAM

CONTAINER ID   NAME                                                                CPU %     MEM USAGE / LIMIT     MEM %     NET I/O           BLOCK I/O         PIDS
48244b39b406   sentry-self-hosted-nginx-1                                          0.10%     11.74MiB / 31.26GiB   0.04%     829MB / 832MB     11.8MB / 131kB    13
4188cd46c1fe   sentry-self-hosted-relay-1                                          0.54%     515.6MiB / 31.26GiB   1.61%     896MB / 2.06GB    363MB / 284MB     40
ca96eedda64d   sentry-self-hosted-generic-metrics-consumer-1                       0.56%     342.4MiB / 31.26GiB   1.07%     177MB / 290MB     17.8MB / 82MB     21
5dedcba8621c   sentry-self-hosted-monitors-clock-tick-1                            0.28%     162.4MiB / 31.26GiB   0.51%     36MB / 33.5MB     35.8MB / 29.5MB   6
8e98edce4698   sentry-self-hosted-subscription-consumer-generic-metrics-1          0.28%     323.9MiB / 31.26GiB   1.01%     37.7MB / 34.5MB   13.2MB / 68.7MB   13
ca542ffd958a   sentry-self-hosted-attachments-consumer-1                           0.48%     500.2MiB / 31.26GiB   1.56%     16.9MB / 15.2MB   26.8MB / 66.6MB   19
b300b32205a5   sentry-self-hosted-snuba-replacer-1                                 0.28%     115.6MiB / 31.26GiB   0.36%     35MB / 31.5MB     20.2MB / 68.9MB   5
1a77958a745c   sentry-self-hosted-ingest-monitors-1                                0.37%     169.8MiB / 31.26GiB   0.53%     17.2MB / 15.8MB   41.3MB / 19.8MB   11
b089c100ada2   sentry-self-hosted-worker-1                                         5.47%     1.444GiB / 31.26GiB   4.62%     9.41GB / 13.3GB   238MB / 126MB     227
8a69077b8025   sentry-self-hosted-snuba-replays-consumer-1                         0.46%     157.2MiB / 31.26GiB   0.49%     35.4MB / 31.9MB   2.92MB / 123MB    30
d77331751542   sentry-self-hosted-events-consumer-1                                0.37%     281.7MiB / 31.26GiB   0.88%     974MB / 1.02GB    18.1MB / 114MB    17
9414826f3339   sentry-self-hosted-subscription-consumer-transactions-1             0.26%     258.9MiB / 31.26GiB   0.81%     37.3MB / 34MB     27.9MB / 137MB    13
e59e3a288fbf   sentry-self-hosted-vroom-1                                          0.00%     11.75MiB / 31.26GiB   0.04%     233kB / 0B        34.8MB / 4.35MB   11
0acb817cf62d   sentry-self-hosted-snuba-subscription-consumer-events-1             0.39%     147.5MiB / 31.26GiB   0.46%     40.9MB / 34.8MB   7.13MB / 23.4MB   9
f996440e6e0d   sentry-self-hosted-post-process-forwarder-issue-platform-1          0.57%     274.9MiB / 31.26GiB   0.86%     71.8MB / 65.6MB   22.7MB / 121MB    18
c9578898e6f1   sentry-self-hosted-sentry-cleanup-1                                 0.00%     7.328MiB / 31.26GiB   0.02%     260kB / 28.5kB    136MB / 557kB     6
c148234ce38f   sentry-self-hosted-subscription-consumer-events-1                   0.27%     226.4MiB / 31.26GiB   0.71%     36.4MB / 33.1MB   16.8MB / 171MB    13
4651f280662d   sentry-self-hosted-metrics-consumer-1                               0.50%     339.4MiB / 31.26GiB   1.06%     34.8MB / 31.4MB   21.2MB / 65.4MB   21
91bc0382285e   sentry-self-hosted-ingest-profiles-1                                0.27%     155.3MiB / 31.26GiB   0.49%     33.5MB / 30MB     15.7MB / 37MB     6
82f8557caafa   sentry-self-hosted-snuba-metrics-consumer-1                         0.55%     191.5MiB / 31.26GiB   0.60%     34.6MB / 31.4MB   2.88MB / 96.5MB   34
1c7c176be51e   sentry-self-hosted-transactions-consumer-1                          0.40%     347.4MiB / 31.26GiB   1.09%     30.6MB / 29.6MB   26.3MB / 42.5MB   17
7f04ed421f85   sentry-self-hosted-post-process-forwarder-errors-1                  0.69%     350MiB / 31.26GiB     1.09%     32.8MB / 21.3MB   18.2MB / 43.4MB   23
d8ad278dcd0a   sentry-self-hosted-ingest-occurrences-1                             0.52%     146.3MiB / 31.26GiB   0.46%     36.9MB / 33.3MB   31.5MB / 58.7MB   16
a77658238b91   sentry-self-hosted-snuba-subscription-consumer-metrics-1            0.39%     132.6MiB / 31.26GiB   0.41%     36.2MB / 33.2MB   9.47MB / 41.3MB   9
8c9d85e099b8   sentry-self-hosted-web-1                                            0.12%     734.6MiB / 31.26GiB   2.29%     90.4MB / 332MB    346MB / 194MB     41
74ad5875b983   sentry-self-hosted-monitors-clock-tasks-1                           0.25%     147.3MiB / 31.26GiB   0.46%     35.3MB / 31.9MB   16.9MB / 49MB     6
12994ef232b6   sentry-self-hosted-billing-metrics-consumer-1                       0.46%     157.6MiB / 31.26GiB   0.49%     63.6MB / 37.2MB   13.5MB / 36MB     9
f7f6a96c18ec   sentry-self-hosted-ingest-replay-recordings-1                       0.43%     159.8MiB / 31.26GiB   0.50%     36.2MB / 32.7MB   21.8MB / 35.5MB   13
c0d724db862f   sentry-self-hosted-snuba-issue-occurrence-consumer-1                0.55%     333.4MiB / 31.26GiB   1.04%     34.8MB / 31.3MB   25.1MB / 61.4MB   41
37fb6a1dbf5e   sentry-self-hosted-cron-1                                           0.00%     179.1MiB / 31.26GiB   0.56%     17.4MB / 146MB    28.9MB / 45.1MB   3
6c46a69fa981   sentry-self-hosted-snuba-outcomes-billing-consumer-1                0.35%     200.9MiB / 31.26GiB   0.63%     18.3MB / 16.5MB   11MB / 49.2MB     26
564d6cedc04a   sentry-self-hosted-post-process-forwarder-transactions-1            0.64%     397.1MiB / 31.26GiB   1.24%     5.76GB / 462MB    21.2MB / 104MB    23
f135db607aec   sentry-self-hosted-subscription-consumer-metrics-1                  0.28%     298.9MiB / 31.26GiB   0.93%     36.5MB / 33.3MB   19.1MB / 94MB     13
c3ebde813802   sentry-self-hosted-snuba-transactions-consumer-1                    0.47%     322.3MiB / 31.26GiB   1.01%     21.4MB / 17MB     21.4MB / 60.4MB   37
4e4475215c07   sentry-self-hosted-ingest-feedback-events-1                         0.39%     238.4MiB / 31.26GiB   0.74%     35.7MB / 32.1MB   12.8MB / 156MB    15
d266f38dc1aa   sentry-self-hosted-snuba-spans-consumer-1                           0.35%     179MiB / 31.26GiB     0.56%     142MB / 582MB     11.3MB / 101MB    26
756587f4a091   sentry-self-hosted-snuba-generic-metrics-gauges-consumer-1          0.54%     181.1MiB / 31.26GiB   0.57%     117MB / 37.3MB    3.19MB / 120MB    34
ea958e5724d3   sentry-self-hosted-snuba-profiling-profiles-consumer-1              0.34%     128.8MiB / 31.26GiB   0.40%     35MB / 31.5MB     2.27MB / 120MB    26
cc77019cca49   sentry-self-hosted-symbolicator-cleanup-1                           0.00%     4.367MiB / 31.26GiB   0.01%     234kB / 0B        35.2MB / 0B       6
8c1ca1507b75   sentry-self-hosted-snuba-profiling-functions-consumer-1             0.34%     134.2MiB / 31.26GiB   0.42%     35MB / 31.5MB     2.53MB / 116MB    26
c479e6353a01   sentry-self-hosted-snuba-generic-metrics-counters-consumer-1        0.57%     255.2MiB / 31.26GiB   0.80%     18MB / 16.3MB     10.8MB / 23.2MB   34
a64b4393273e   sentry-self-hosted-snuba-errors-consumer-1                          0.57%     233.1MiB / 31.26GiB   0.73%     360MB / 317MB     4.6MB / 57.9MB    34
475833b0af89   sentry-self-hosted-snuba-generic-metrics-sets-consumer-1            0.67%     213.9MiB / 31.26GiB   0.67%     118MB / 41.6MB    6.73MB / 95.7MB   34
8a412f926058   sentry-self-hosted-snuba-generic-metrics-distributions-consumer-1   0.56%     270.6MiB / 31.26GiB   0.85%     124MB / 559MB     4.73MB / 60.4MB   34
4be03fe74a97   sentry-self-hosted-snuba-outcomes-consumer-1                        0.33%     162.1MiB / 31.26GiB   0.51%     33.8MB / 30.4MB   2.17MB / 92.4MB   26
f9689ba0b412   sentry-self-hosted-snuba-group-attributes-consumer-1                0.47%     322MiB / 31.26GiB     1.01%     34.8MB / 31.7MB   17.9MB / 86.7MB   37
acb9904b4aa4   sentry-self-hosted-snuba-subscription-consumer-transactions-1       0.38%     123.5MiB / 31.26GiB   0.39%     42.5MB / 36.3MB   8.68MB / 41.4MB   9
c44eb35c0248   sentry-self-hosted-vroom-cleanup-1                                  0.00%     3.602MiB / 31.26GiB   0.01%     234kB / 0B        8.87MB / 0B       6
e896238b7399   sentry-self-hosted-memcached-1                                      0.03%     23.07MiB / 31.26GiB   0.07%     673MB / 1.81GB    8.15MB / 2.45MB   10
f5170f49f238   sentry-self-hosted-snuba-api-1                                      0.05%     113.7MiB / 31.26GiB   0.36%     9.83MB / 16.3MB   66.3MB / 68.9MB   5
f7f090a153a3   sentry-self-hosted-symbolicator-1                                   0.00%     35MiB / 31.26GiB      0.11%     307kB / 59.4kB    25.5MB / 142MB    38
dec490aea24f   sentry-self-hosted-smtp-1                                           0.00%     1.371MiB / 31.26GiB   0.00%     255kB / 15.7kB    28.8MB / 4.1kB    2
3e57a3611024   sentry-self-hosted-postgres-1                                       0.01%     231.6MiB / 31.26GiB   0.72%     1.53GB / 728MB    17.6GB / 13.2MB   53
c5eaa992ea9b   sentry-self-hosted-kafka-1                                          1.97%     1.257GiB / 31.26GiB   4.02%     5.5GB / 11.8GB    1.73GB / 461MB    111
4bac36975fe0   sentry-self-hosted-clickhouse-1                                     0.39%     469MiB / 31.26GiB     1.47%     1.71GB / 83.7MB   3.21GB / 93.3MB   481
f1f35f95d7a7   sentry-self-hosted-redis-1                                          0.17%     56.23MiB / 31.26GiB   0.18%     11.9GB / 7.22GB   763MB / 2.97MB    5

docker_compose_logs.txt
latest_install_logs.txt

Event ID

No response

@LordSimal
Copy link
Author

Duplicate of #3327

@barisyild
Copy link

I think it's a memory leak.

@LordSimal
Copy link
Author

LordSimal commented Sep 17, 2024

What I can confirm is the fact, that doing the upgrade to 24.9.0 did NOT fix the issue.
After a few hours of incoming events I repeatedly get the same

relay-1                                         | 2024-09-17T14:16:40.186772Z ERROR relay_server::services::health_check: Not enough memory, 32202633216 / 33568419840 (95.93% >= 95.00%)
relay-1                                         | 2024-09-17T14:16:40.186811Z ERROR relay_server::services::health_check: Health check probe 'system memory' failed
relay-1                                         | 2024-09-17T14:16:40.449995Z ERROR relay_server::endpoints::common: error handling request error=failed to queue envelope

error in the docker compose logs.

@hubertdeng123
Copy link
Member

What does the event volume look like for you? Did this start happening after upgrading to 24.8.0?

@LordSimal
Copy link
Author

LordSimal commented Sep 18, 2024

We did the 24.8.0 Sentry update on the 1st of September.
This is our stats page for the last 30 days
Image

As you can see there are sections where it works fine but then sometimes for a few hours, sometimes even for days there are no events being processed.

Image

@hubertdeng123
Copy link
Member

Could you track your RAM/CPU usage as well? Wondering if there is a correlation there.

@LordSimal
Copy link
Author

I can also see errors related to getsentry/snuba#5707 in my logs

postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] ERROR:  duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] DETAIL:  Key (project_id, environment_id)=(76, 1) already exists.
postgres-1                                      | 2024-09-20 11:04:34.677 UTC [660670] STATEMENT:  INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (76, 1, NULL) RETURNING "sentry_environmentproject"."id"
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] ERROR:  duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] DETAIL:  Key (group_id, release_id, environment)=(385, 413, production) already exists.
postgres-1                                      | 2024-09-20 11:04:34.692 UTC [660670] STATEMENT:  INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 385, 413, 'production', '2024-09-20T11:04:33.517322+00:00'::timestamptz, '2024-09-20T11:04:33.517322+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] ERROR:  duplicate key value violates unique constraint "sentry_grouprelease_group_id_release_id_envi_044354c8_uniq"
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] DETAIL:  Key (group_id, release_id, environment)=(6909, 413, production) already exists.
postgres-1                                      | 2024-09-20 11:04:34.699 UTC [660673] STATEMENT:  INSERT INTO "sentry_grouprelease" ("project_id", "group_id", "release_id", "environment", "first_seen", "last_seen") VALUES (76, 6909, 413, 'production', '2024-09-20T11:04:33.688391+00:00'::timestamptz, '2024-09-20T11:04:33.688391+00:00'::timestamptz) RETURNING "sentry_grouprelease"."id"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/49/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:38 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] ERROR:  duplicate key value violates unique constraint "sentry_environmentprojec_project_id_environment_i_91da82f2_uniq"
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] DETAIL:  Key (project_id, environment_id)=(49, 14) already exists.
postgres-1                                      | 2024-09-20 11:04:39.833 UTC [660680] STATEMENT:  INSERT INTO "sentry_environmentproject" ("project_id", "environment_id", "is_hidden") VALUES (49, 14, NULL) RETURNING "sentry_environmentproject"."id"
clickhouse-1                                    | 2024.09.20 11:04:40.173549 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x0000000013154417 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 4. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 8. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1                                    | 9. ? @ 0x00007f2c86de5353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:45 +0000] "POST /api/2/envelope/ HTTP/1.0" 200 41 "-" "sentry.php/4.9.0" "2a01:aea0:df3:1::153"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:46 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
nginx-1                                         | 144.208.193.56 - - [20/Sep/2024:11:04:48 +0000] "POST /api/14/envelope/ HTTP/1.0" 200 41 "-" "sentry.php.wordpress/8.1.0" "91.227.205.222"
clickhouse-1                                    | 2024.09.20 11:04:48.941366 [ 188178 ] {} <Error> ServerErrorHandler: Poco::Exception. Code: 1000, e.code() = 107, Net Exception: Socket is not connected, Stack trace (when copying this message, always include the lines below):
clickhouse-1                                    | 
clickhouse-1                                    | 0. Poco::Net::SocketImpl::error(int, String const&) @ 0x0000000015b3dbf2 in /usr/bin/clickhouse
clickhouse-1                                    | 1. Poco::Net::SocketImpl::peerAddress() @ 0x0000000015b40376 in /usr/bin/clickhouse
clickhouse-1                                    | 2. DB::ReadBufferFromPocoSocket::ReadBufferFromPocoSocket(Poco::Net::Socket&, unsigned long) @ 0x000000000c896cc6 in /usr/bin/clickhouse
clickhouse-1                                    | 3. DB::HTTPServerRequest::HTTPServerRequest(std::shared_ptr<DB::IHTTPContext>, DB::HTTPServerResponse&, Poco::Net::HTTPServerSession&) @ 0x000000001315451b in /usr/bin/clickhouse
clickhouse-1                                    | 4. DB::HTTPServerConnection::run() @ 0x0000000013152ba4 in /usr/bin/clickhouse
clickhouse-1                                    | 5. Poco::Net::TCPServerConnection::start() @ 0x0000000015b42834 in /usr/bin/clickhouse
clickhouse-1                                    | 6. Poco::Net::TCPServerDispatcher::run() @ 0x0000000015b43a31 in /usr/bin/clickhouse
clickhouse-1                                    | 7. Poco::PooledThread::run() @ 0x0000000015c7a667 in /usr/bin/clickhouse
clickhouse-1                                    | 8. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000015c7893c in /usr/bin/clickhouse
clickhouse-1                                    | 9. ? @ 0x00007f2c86ec0609 in ?
clickhouse-1                                    | 10. ? @ 0x00007f2c86de5353 in ?
clickhouse-1                                    |  (version 23.8.11.29.altinitystable (altinity build))

@LordSimal
Copy link
Author

I will have to try and get some system metric info stats running to give you the requested info

@aldy505
Copy link
Collaborator

aldy505 commented Sep 21, 2024

Hey @LordSimal can you try this:

On your relay/config.yml file (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add a health section, so it'd be:

relay:
  upstream: "http://web:9000/"
  host: 0.0.0.0
  port: 3000
logging:
  level: WARN
processing:
  enabled: true
  kafka_config:
    - {name: "bootstrap.servers", value: "kafka:9092"}
    - {name: "message.max.bytes", value: 50000000} # 50MB
  redis: redis://redis:6379
  geoip_path: "/geoip/GeoLite2-City.mmdb"
health:
  max_memory_percent: 1.0

Then do sudo docker compose up -d relay (or sudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.

Thanks to @Dav1dde

@bijancot
Copy link

Hey @LordSimal can you try this:

On your relay/config.yml file (https://github.com/getsentry/self-hosted/blob/master/relay/config.example.yml) add a health section, so it'd be:

relay:
upstream: "http://web:9000/"
host: 0.0.0.0
port: 3000
logging:
level: WARN
processing:
enabled: true
kafka_config:
- {name: "bootstrap.servers", value: "kafka:9092"}
- {name: "message.max.bytes", value: 50000000} # 50MB
redis: redis://redis:6379
geoip_path: "/geoip/GeoLite2-City.mmdb"
health:
max_memory_percent: 1.0
Then do sudo docker compose up -d relay (or sudo docker compose --env-file .env.custom up -d relay), if things didn't change, try restarting the relay container.

Thanks to @Dav1dde

Interesting, If you don't mind could you elaborate the changes? is it the same with setup a resource limits to each container ?

@LordSimal
Copy link
Author

Just wanna post my current stats of the last 3 days before I do this change

Image

The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)

Image

I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB
Image

But maybe NGINX Amplify uses a different RAM usage stat than htop.

Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.

@LordSimal
Copy link
Author

Adjusted the relay/config.yml and executed docker compose up -d relayto restart the relay container.

Nothing changed till now even though there are 100% events which should be coming in.

Should I try to just run ./install.sh again to do a "fresh restart"? I think this worked in the past.

@barisyild
Copy link

Just wanna post my current stats of the last 3 days before I do this change

Image

The 21st is not shown in the stats here for some reason but there haven't been any events since yesterday 1PM (1 day 7 hours)

Image

I don't really understand the RAM Usage graph here since htop says there is only 13,7 GB used of 31,3 GB Image

But maybe NGINX Amplify uses a different RAM usage stat than htop.

Do we really have a RAM usage error here? Is 32GB for sentry not enough? This worked fine in older versions with exactly this server.

What is java process?

@LordSimal
Copy link
Author

What is java process?

Its the kafka process

root@scarecrow:~# ps -ax | grep java
   1969 pts/0    S+     0:00 grep java
  17703 ?        Ssl  183:39 java -Xmx1G -Xms1G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xlog:gc*:file=/var/log/kafka/kafkaServer-gc.log:time,tags:filecount=10,filesize=100M -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-telemetry/* kafka.Kafka /etc/kafka/kafka.properties

@LordSimal
Copy link
Author

LordSimal commented Sep 22, 2024

I got some news.... I just executed

docker compose down
./install.sh
docker compose up -d

and suddenly the stats page has updated and there seems to be events present which were not previously...

Image

Also events are being processed right now and my sever is pinned at a 100% usage

Image

Seems like something prevented the queue worker to process the queued events.

After around ~15 minutes all queued up events seem to have been processed and now the load is normal again. Also new sentry events are coming up pretty much instantly in the UI as they have been in the past.

Will wait now if the problem re-occurs again.

@barisyild
Copy link

Yeah

@LordSimal
Copy link
Author

will try that the next time it breaks. I already restarted the whole thing.

@bijancot
Copy link

It's just my two cents but it's look like the max open port issue is the root cause. Because it seems issue in one of the services that too slow in processing request or the request it self is too much. Because anything can't be processed and accumulating end up all resource like memory and cpu being 100% used.

Maybe any method or way to throttle snuba?

@LordSimal
Copy link
Author

The mystery doesn't stop... No events have been dropped since 4 days but only the stats page broke 😂

Image

This is fine by me since I don't care about the stats page and I get all my events, but still something weird is going on.

@aldy505
Copy link
Collaborator

aldy505 commented Oct 2, 2024

The mystery doesn't stop... No events have been dropped since 4 days but only the stats page broke 😂

Image

This is fine by me since I don't care about the stats page and I get all my events, but still something weird is going on.

@LordSimal the stats is being handled by the snuba outcomes billing consumer, here:

snuba-outcomes-billing-consumer:
<<: *snuba_defaults
command: rust-consumer --storage outcomes_raw --consumer-group snuba-consumers --auto-offset-reset=earliest --max-batch-time-ms 750 --no-strict-offset-reset --raw-events-topic outcomes-billing

Did you see any errors or anything weird coming from that specific container?

@LordSimal
Copy link
Author

snuba.log

Seems like it panicked

@LordSimal
Copy link
Author

Today 6AM (10h ago), events have stopped coming in again.

Here again just for consistency the logs of all containers from the last 12h.

12h_logs.txt.gz

I restarted the kafka container via docker compose restart kafka BUT no events are being processed. So its not kafka.

@LordSimal
Copy link
Author

But what DOES fix the problem is just simply restarting all containers via

docker compose down
docker compose up -d

this of course won't help you understand the root cause of this problem but I don't know what other information I can provide to debug this problem.

@aldy505
Copy link
Collaborator

aldy505 commented Oct 6, 2024

@LordSimal I saw these on the logs:

relay-1                                  | 2024-10-03T04:20:05.847679Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1                                  | 2024-10-03T04:20:16.953551Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=7
relay-1                                  | 2024-10-03T04:20:21.378061Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1                                  | 2024-10-03T04:20:24.240305Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=8
relay-1                                  | 2024-10-03T04:20:31.549396Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1                                  | 2024-10-03T04:20:31.549355Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=9
relay-1                                  | 2024-10-03T04:20:34.051337Z ERROR relay_server::services::project_upstream: error fetching project state 9c7d767dcc9c4d1e8047bd2ca8f1b4c3: deadline exceeded errors=3 pending=6 tags.did_error=true tags.was_pending=true tags.project_key="9c7d767dcc9c4d1e8047bd2ca8f1b4c3"
relay-1                                  | 2024-10-03T04:20:37.062657Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1                                  | 2024-10-03T04:20:47.608323Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1                                  | 2024-10-03T04:21:03.537675Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1                                  | 2024-10-03T04:21:03.537676Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1                                  | 2024-10-03T04:21:09.048235Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1                                  | %5|1727929270.490|REQTMOUT|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Timed out ProduceRequest in flight (after 60316ms, timeout #0)
relay-1                                  | %5|1727929270.512|REQTMOUT|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Timed out ProduceRequest in flight (after 60173ms, timeout #1)
relay-1                                  | %4|1727929270.512|REQTMOUT|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: Timed out 3 in-flight, 0 retry-queued, 0 out-queue, 0 partially-sent requests
relay-1                                  | %3|1727929270.514|FAIL|rdkafka#producer-2| [thrd:kafka:9092/bootstrap]: kafka:9092/1001: 3 request(s) timed out: disconnect (after 507721942ms in state UP)

Since relay fails to connect to the web and kafka container, I suppose this is an internal Docker networking issue. Have you tried see if your Docker engine needs an upgrade?

Found this issue, but this is specifically for Windows: docker/for-win#8861

@LordSimal
Copy link
Author

We use Debian 12 with Docker

root@scarecrow:~# docker --version
Docker version 27.3.1, build ce12230

root@scarecrow:~# dockerd --version
Docker version 27.3.1, build 41ca978

root@scarecrow:~# uptime
 13:30:51 up 9 days,  4:15,  1 user,  load average: 1.02, 0.79, 0.74

root@scarecrow:~# uname -a
Linux scarecrow 6.1.0-25-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.106-3 (2024-08-26) x86_64 GNU/Linux

with this official docker repository

deb [arch=amd64 signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/debian   bookworm stable

@LordSimal
Copy link
Author

Something is definitely wrong with the docker internal network. Even doing a simple docke compose down fails when trying to remove the network

 ✘ Network sentry-self-hosted_default                                           Error                                                                                                    0.0s 
failed to remove network sentry-self-hosted_default: Error response from daemon: error while removing network: network sentry-self-hosted_default id 59fc75ac724a7e9c2bdd2d271826c1c720746cf629f3218f4fa0ed1a2878360f has active endpoints

I had to restart the whole docker service via systemctl restart docker to fix this issue and get sentry up and running again (and yes, 10h ago events have stopped coming in again)

@HolgerHatGarKeineNode
Copy link

Something is definitely wrong with the docker internal network. Even doing a simple docke compose down fails when trying to remove the network

 ✘ Network sentry-self-hosted_default                                           Error                                                                                                    0.0s 
failed to remove network sentry-self-hosted_default: Error response from daemon: error while removing network: network sentry-self-hosted_default id 59fc75ac724a7e9c2bdd2d271826c1c720746cf629f3218f4fa0ed1a2878360f has active endpoints

I had to restart the whole docker service via systemctl restart docker to fix this issue and get sentry up and running again (and yes, 10h ago events have stopped coming in again)

We have the same problem. Restarting the docker daemon helps. But this is not a good solution right now.

failed to remove network sentry-self-hosted_default: Error response from daemon: error while removing network: network sentry-self-hosted_default id b866806e46f2e918e029e25d297f534c8d1b40b717cb5fb84e95bc31b4ee9f5d has active endpoints

@LordSimal
Copy link
Author

Just to be extra sure its not a RAM issue - we just upgraded from 32GB to 64GB but this problem still persists. Sentry just always takes half of the available RAM as a base, no matter how much you have.

Image

But looking at the previous comments this indeed seems like a docker internal network problem.

root@scarecrow:~/sentry# docker compose logs relay
relay-1  | 2024-10-07T06:52:37.232062Z ERROR relay_server::services::health_check: Health check probe 'auth' failed
relay-1  | 2024-10-07T21:24:28.523152Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=2
relay-1  | 2024-10-07T21:24:35.310055Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=3
relay-1  | 2024-10-07T21:25:43.760583Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-10-07T21:25:58.969227Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1  | 2024-10-07T21:26:44.230481Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-10-07T21:26:49.357327Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1  | 2024-10-07T21:26:55.457333Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1.5s
relay-1  | 2024-10-07T21:27:30.477847Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-10-07T21:27:44.876577Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=3
relay-1  | 2024-10-07T21:27:45.636250Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1  | 2024-10-07T21:27:52.102830Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=4
relay-1  | 2024-10-07T21:27:59.095101Z ERROR relay_server::services::project_upstream: error fetching project states error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out] attempts=5
relay-1  | 2024-10-07T21:27:59.095092Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-10-07T21:28:01.014227Z ERROR relay_server::services::global_config: failed to fetch global config from upstream error=could not send request to upstream error.sources=[error sending request for url (http://web:9000/api/0/relays/projectconfigs/?version=3), operation timed out]
relay-1  | 2024-10-07T21:28:04.372670Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s
relay-1  | 2024-10-09T06:31:31.420631Z  WARN relay_server::services::upstream: network outage, scheduling another check in 0ns
relay-1  | 2024-10-09T06:31:37.501585Z  WARN relay_server::services::upstream: network outage, scheduling another check in 1s

@reaper
Copy link

reaper commented Oct 10, 2024

Hey @LordSimal, did you configure a low swappiness setting? I encountered a similar issue running Sentry on a virtualized machine when the swappiness was set to 10. Changing it back to the default value (60) resolved the problem for now—though I’ll continue monitoring it.

@LordSimal
Copy link
Author

  1. I am not in a virtual machine, I am on bare metal with Debian 12 directly installed
  2. We have not changed the swappiness config, as can be seen here:
root@scarecrow:~# cat /proc/sys/vm/swappiness
60

@LordSimal
Copy link
Author

Just to inform you: Sentry doesn't receive messages after 1-2 days - regularly, therefore I just added a crontab to automatically restart all containers at 02:00 in the morning to prevent this kind of issue from happening.

Will try to disable that automatic restart after the next self-hosted update, whenever that releases.

@LordSimal
Copy link
Author

So we have been using 24.10.0 for 1 week now and it seems to be more stable. I think we restarted the containers once manually due to no events being processed anymore but its not as frequent as it used to be. I'll close this issue since it didn't seem to be that widespread, so it may be something related to our setup/network.

Thanks to all participating and trying to find this weird bug 👍🏻 hopefully it doesn't return anytime soon.

@jgelens
Copy link

jgelens commented Nov 4, 2024

I actually experience the same issues (even using 24.10.0). Setting

health:
  max_memory_percent: 1.0

fixes it. But I guess that's more of a workaround than a real solution...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Archived in project
Development

No branches or pull requests

10 participants